Sunday, February 28, 2016

How to clean up your fasta file

If you have a fasta file with text inside that is not supposed to be there and you want to get rid of it, this can be done with a simple grep command

mpjanic@valkyr:~$ head -n100 test.fasta
GeneMark.hmm PROKARYOTIC (Version 3.25)
Date: Fri Feb 26 19:38:08 2016
Sequence file name: sequence-2.fasta
Model file name: /home/wangt2/MetaGeneMark_v1.mod
RBS: false

Model information: Heuristic_model_for_genetic_code_11_and_GC_51

FASTA definition line: gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
Predicted genes
   Gene    Strand    LeftEnd    RightEnd       Gene     Class
    #                                         Length
    1        +         190         273           84        1
    2        +         354        2816         2463        1
    3        +        2818        3750          933        1
    4        +        3751        5037         1287        1
    5        +        5390        5551          162        1
    6        -        5700        6476          777        1
    7        -        6546        7976         1431        1
    8        +        8255        9208          954        1
    9        +        9323        9910          588        1
   10        -        9945       10511          567        1
   11        -       10660       11373          714        1
   12        -       11399       11803          405        1
   13        +       12180       14096         1917        1
   14        +       14185       15315         1131        1
   15        -       15419       15628          210        2
   16        +       16157       17323         1167        1
   17        +       17383       18288          906        1
   18        -       18326       18751          426        1
   19        -       18959       19168          210        
Use grep to keep only rows with ATCG letters and those with > sign that the headers begin with:

mpjanic@valkyr:~$ grep "[>ATCG]" test.fasta | head -n  100
GeneMark.hmm PROKARYOTIC (Version 3.25)
Model file name: /home/wangt2/MetaGeneMark_v1.mod
Model information: Heuristic_model_for_genetic_code_11_and_GC_51
FASTA definition line: gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
   Gene    Strand    LeftEnd    RightEnd       Gene     Class
>gene_1|GeneMark.hmm|84_nt|+|190|273    >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATT
ACCACAGGTAACGGTGCGGGCTGA
>gene_2|GeneMark.hmm|2463_nt|+|354|2816 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTT
GCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCC
GCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCT
TTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTCTGACGGGACTCGCCGCC
GCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTTTCGTCGACCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGATAGCATTAACGCT
GCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCG
CGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTAC
CTCGAATCTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCG
GCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTG
GTACTTGGACGCAACGGTTCCGACTACTCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCC
GATTGTTGCGAGATTTGGACGGACGTTGACGGGGTATATACCTGCGACCCGCGTCAGGTG
CCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTC
GGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGC
CTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGAT
GAAGACGAATTACCGGTCAAGGGCATTTCCAATCTGAATAATATGGCAATGTTCAGCGTT
TCCGGCCCGGGGATGAAAGGAATGGTCGGCATGGCGGCGCGCGTCTTTGCTGCAATGTCA
CGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGTATCAGTTTC
TGCGTTCCGCAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTCTACCTG
GAACTGAAAGAAGGCTTACTGGAGCCGCTGGCGGTGACGGAACGGCTGGCCATTATCTCG
GTGGTAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCGCTG
GCCCGCGCCAATATCAACATTGTCGCTATTGCTCAGGGATCTTCTGAACGCTCAATCTCT
GTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTC
AATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCGTCGGTGGCGTTGGCGGTGCGCTG
CTGGAGCAACTGAAGCGTCAGCAAAGCTGGTTGAAGAATAAACATATCGACTTACGTGTC
TGCGGTGTTGCTAACTCGAAGGCTCTGCTCACCAATGTGCATGGCCTAAATCTGGAAAAC
TGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTCGCCTC
GTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACCTCCAGCCAGGCAGTG
GCGGATCAATATGCCGACTTCCTGCGCGAAGGTTTCCACGTTGTCACGCCGAACAAAAAG
GCCAACACCTCGTCGATGGATTACTACCATCTGTTGCGTCATGCGGCTGAAAAATCGCGG
CGTAAATTCCTCTATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAA
AATCTGCTCAATGCTGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCAGGTTCGCTT
TCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACTACGCTGGCG
CGGGAAATGGGTTATACCGAACCGGATCCGCGAGATGATCTTTCTGGTATGGATGTAGCG
CGTAAACTATTAATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAA
ATTGAACCTGTGCTGCCCGCAGAGTTTAACGCTGAGGGTGATGTTGCCGCTTTTATGGCG
AATCTGTCACAGCTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCCGTGATGAAGGA
AAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCC
GAAGTGGATGGTAATGATCCGCTGTTCAAAGTGAAAAATGGCGAAAACGCCCTGGCCTTT
TATAGCCACTATTATCAGCCGCTGCCGTTGGTGCTGCGCGGATATGGTGCGGGCAATGAC
GTTACCGCTGCCGGTGTCTTTGCCGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTC
TGA
>gene_3|GeneMark.hmm|933_nt|+|2818|3750 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTC
GGGGCGGCGGTGACACCCGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGTCG
GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCA
CGGGAAAATATCGTTTATCAGTGCTGGGAGCGTTTTTGCCAGGAGCTGGGCAAGCAAATT
CCAGTGGCGATGACTCTGGAAAAGAATATGCCGATCGGTTCGGGCTTAGGCTCCAGCGCC
In this example in addition you will have to remove the first 6 lines


mpjanic@valkyr:~$ grep "[>ATCG]" test.fasta | tail -n +6 | head -n  100
>gene_1|GeneMark.hmm|84_nt|+|190|273    >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATT
ACCACAGGTAACGGTGCGGGCTGA
>gene_2|GeneMark.hmm|2463_nt|+|354|2816 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTT
GCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCC
GCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCT
TTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTCTGACGGGACTCGCCGCC
GCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTTTCGTCGACCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGATAGCATTAACGCT
GCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCG
CGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTAC

No comments:

Post a Comment