Sunday, February 28, 2016

How to clean up your fasta file

If you have a fasta file with text inside that is not supposed to be there and you want to get rid of it, this can be done with a simple grep command

mpjanic@valkyr:~$ head -n100 test.fasta
GeneMark.hmm PROKARYOTIC (Version 3.25)
Date: Fri Feb 26 19:38:08 2016
Sequence file name: sequence-2.fasta
Model file name: /home/wangt2/MetaGeneMark_v1.mod
RBS: false

Model information: Heuristic_model_for_genetic_code_11_and_GC_51

FASTA definition line: gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
Predicted genes
  Gene  Strand  LeftEnd  RightEnd    Gene   Class
  #                     Length
  1    +     190     273      84    1
  2    +     354    2816     2463    1
  3    +    2818    3750     933    1
  4    +    3751    5037     1287    1
  5    +    5390    5551     162    1
  6    -    5700    6476     777    1
  7    -    6546    7976     1431    1
  8    +    8255    9208     954    1
  9    +    9323    9910     588    1
  10    -    9945    10511     567    1
  11    -    10660    11373     714    1
  12    -    11399    11803     405    1
  13    +    12180    14096     1917    1
  14    +    14185    15315     1131    1
  15    -    15419    15628     210    2
  16    +    16157    17323     1167    1
  17    +    17383    18288     906    1
  18    -    18326    18751     426    1
  19    -    18959    19168     210    
Use grep to keep only rows with ATCG letters and those with > sign that the headers begin with:

mpjanic@valkyr:~$ grep "[>ATCG]" test.fasta | head -n 100
GeneMark.hmm PROKARYOTIC (Version 3.25)
Model file name: /home/wangt2/MetaGeneMark_v1.mod
Model information: Heuristic_model_for_genetic_code_11_and_GC_51
FASTA definition line: gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
  Gene  Strand  LeftEnd  RightEnd    Gene   Class
>gene_1|GeneMark.hmm|84_nt|+|190|273  >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATT
ACCACAGGTAACGGTGCGGGCTGA
>gene_2|GeneMark.hmm|2463_nt|+|354|2816 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTT
GCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCC
GCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCT
TTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTCTGACGGGACTCGCCGCC
GCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTTTCGTCGACCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGATAGCATTAACGCT
GCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCG
CGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTAC
CTCGAATCTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCG
GCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTG
GTACTTGGACGCAACGGTTCCGACTACTCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCC
GATTGTTGCGAGATTTGGACGGACGTTGACGGGGTATATACCTGCGACCCGCGTCAGGTG
CCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTC
GGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGC
CTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGAT
GAAGACGAATTACCGGTCAAGGGCATTTCCAATCTGAATAATATGGCAATGTTCAGCGTT
TCCGGCCCGGGGATGAAAGGAATGGTCGGCATGGCGGCGCGCGTCTTTGCTGCAATGTCA
CGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGTATCAGTTTC
TGCGTTCCGCAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTCTACCTG
GAACTGAAAGAAGGCTTACTGGAGCCGCTGGCGGTGACGGAACGGCTGGCCATTATCTCG
GTGGTAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCGCTG
GCCCGCGCCAATATCAACATTGTCGCTATTGCTCAGGGATCTTCTGAACGCTCAATCTCT
GTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTC
AATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCGTCGGTGGCGTTGGCGGTGCGCTG
CTGGAGCAACTGAAGCGTCAGCAAAGCTGGTTGAAGAATAAACATATCGACTTACGTGTC
TGCGGTGTTGCTAACTCGAAGGCTCTGCTCACCAATGTGCATGGCCTAAATCTGGAAAAC
TGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTCGCCTC
GTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACCTCCAGCCAGGCAGTG
GCGGATCAATATGCCGACTTCCTGCGCGAAGGTTTCCACGTTGTCACGCCGAACAAAAAG
GCCAACACCTCGTCGATGGATTACTACCATCTGTTGCGTCATGCGGCTGAAAAATCGCGG
CGTAAATTCCTCTATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAA
AATCTGCTCAATGCTGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCAGGTTCGCTT
TCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACTACGCTGGCG
CGGGAAATGGGTTATACCGAACCGGATCCGCGAGATGATCTTTCTGGTATGGATGTAGCG
CGTAAACTATTAATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAA
ATTGAACCTGTGCTGCCCGCAGAGTTTAACGCTGAGGGTGATGTTGCCGCTTTTATGGCG
AATCTGTCACAGCTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCCGTGATGAAGGA
AAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCC
GAAGTGGATGGTAATGATCCGCTGTTCAAAGTGAAAAATGGCGAAAACGCCCTGGCCTTT
TATAGCCACTATTATCAGCCGCTGCCGTTGGTGCTGCGCGGATATGGTGCGGGCAATGAC
GTTACCGCTGCCGGTGTCTTTGCCGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTC
TGA
>gene_3|GeneMark.hmm|933_nt|+|2818|3750 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTC
GGGGCGGCGGTGACACCCGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGTCG
GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCA
CGGGAAAATATCGTTTATCAGTGCTGGGAGCGTTTTTGCCAGGAGCTGGGCAAGCAAATT
CCAGTGGCGATGACTCTGGAAAAGAATATGCCGATCGGTTCGGGCTTAGGCTCCAGCGCC
In this example in addition you will have to remove the first 6 lines


mpjanic@valkyr:~$ grep "[>ATCG]" test.fasta | tail -n +6 | head -n 100
>gene_1|GeneMark.hmm|84_nt|+|190|273  >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATT
ACCACAGGTAACGGTGCGGGCTGA
>gene_2|GeneMark.hmm|2463_nt|+|354|2816 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTT
GCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCC
GCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCT
TTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTCTGACGGGACTCGCCGCC
GCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTTTCGTCGACCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGATAGCATTAACGCT
GCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCG
CGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTAC

No comments:

Post a Comment