Wednesday, April 26, 2017

Awk code for counting isoforms in abundance.tsv files from Kallisto

If you have a series of abundance.tsv from Kallisto RNA-Seq quantification tool separated in different folder use this command to count protein coding isoforms in each output file

sudo find . -name abundance.tsv -exec sh -c "echo {}; grep protein_coding {} | wc -l"  \;
./1522_1hr_TNF_2/abundance.tsv
79795
./1522_6hr_TNF_2/abundance.tsv
79795
./1522_6hr_TGFB_2/abundance.tsv
79795
./1522_1hr_PMA_1/abundance.tsv
79795
./2989_6hr_TNF_2/abundance.tsv
79795
./2989_6hr_SF_1/abundance.tsv

...
To save the list of protein coding isoforms in each separate file use:

sudo find . -name abundance.tsv -exec sh -c "grep protein_coding {} > {}.protein_coding"  \;

To count isoforms that are not expressed use awk code that counts rows with est_counts==0

find . -type f -name abundance.tsv -exec sh -c 'awk "\$4==0{print \$0}" "{}" | wc -l' \;
108088
109840
113075
108092
119730
106363
110323
119521
117195
117358
98931

...
Similarly to count isoforms that are expressed use:

find . -type f -name abundance.tsv -exec sh -c 'awk "\$4!=0{print \$0}" "{}" | wc -l' \;
90532
88780
85545
90528
78890
92257
88297
79099
81425
81262
99689
91844
89495
...
List their sample names contained in the folder names:

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'echo $(basename $(dirname {}))' \;
1522_1hr_TNF_2
1522_6hr_TNF_2
1522_6hr_TGFB_2
1522_1hr_PMA_1
2989_6hr_TNF_2
2989_6hr_SF_1
1522_6hr_SF_2
2989_6hr_TNF_1
1522_1hr_TGFB_2
1522_1hr_TGFB_1
1522_1hr_PDGF_1
1522_6hr_PDGF_2
1522_6hr_TNF_1
2989_1hr_PDGF_1
...
 Make folders with their sample name:

find /home/mpjanic/HCASMC_RNASeq/ -type f -name abundance.tsv.protein_coding -exec sh -c 'mkdir $(basename $(dirname {}))' \;
Find out how many isoforms are expressed within the subgroup of protein coding isoforms:

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'awk "\$4!=0{print \$0}" "{}" | wc -l' \;
43139
42014
41017
43618
38208
43021
42031
38464
40069
40246
46087
43028
42478

...
Save in each folder files with isoforms that are expressed within the subgroup of protein coding isoforms

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'awk "\$4!=0{print \$0}" "{}" > {}.expressed ' \;

No comments:

Post a Comment