Tuesday, November 1, 2016

Reassigning GENCODE counted reads on RefSeq identifiers. Solution to removing duplicated rows on ID column and selecting by summing on counts using awk scripting

In case you have used Gencode transcripts with featureCounts and have counted mapped reads on Gencode transcripts, in some cases you need to revert back to RefSeq ID (NM_  and NR_). The problem that appears in this case is that multiple Gencode IDs will correspond to a single RefSeq identifier, e.g. in this example NC000078 and NC000080.

head mastertable.genename
        
NC_000072       2       0       0       3       0       4       0       0
NC_000078       0       0       0       0       0       0       3       1
NC_000078       12      0       0       0       0       0       6       1
NC_000079       0       0       0       0       4       0       2       0
NC_000080       0       0       0       0       0       0       0       0
NC_000080       0       0       0       0       0       0       0       0
NC_000080       2       0       0       2       0       0       0       0
NM_001001130    113     149     119     154     149     149     138     199
NM_001001144    1565    855     947     1501    961     1295    874     1449
What you need to do is to select a most dominant RNA isoform and assign a RefSeq ID to it, and eliminate those isoforms that are not expressed and have mainly 0 counts. To do so first calculate sum for each row, and plot it as first column.

awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | head -n10
               
9       NC_000072       2       0       0       3       0       4       0       0
4       NC_000078       0       0       0       0       0       0       3       1
19      NC_000078       12      0       0       0       0       0       6       1
6       NC_000079       0       0       0       0       4       0       2       0
0       NC_000080       0       0       0       0       0       0       0       0
0       NC_000080       0       0       0       0       0       0       0       0
4       NC_000080       2       0       0       2       0       0       0       0
1170    NM_001001130    113     149     119     154     149     149     138     199
9447    NM_001001144    1565    855     947     1501    961     1295    874     1449
Next, sort using column 2 by RefSeq ID, and then by column 1, use reverse numerical sort from the highest to the lowest sum.

awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | sort -k2,2 -k1,1nr | head -n10

9       NC_000072       2       0       0       3       0       4       0       0
19      NC_000078       12      0       0       0       0       0       6       1
4       NC_000078       0       0       0       0       0       0       3       1
6       NC_000079       0       0       0       0       4       0       2       0
4       NC_000080       2       0       0       2       0       0       0       0
0       NC_000080       0       0       0       0       0       0       0       0
0       NC_000080       0       0       0       0       0       0       0       0
1170    NM_001001130    113     149     119     154     149     149     138     199
9447    NM_001001144    1565    855     947     1501    961     1295    874     1449
Finally, use awk to eliminate rows duplicated on one column, use awk '!a[$2]++' as a short version of awk '{if(! a[$2]){print; a[$2]++}}' meaning if the current second field ($1) is not present in the a array, which will happen if it is a unique field, print the line and add the 2st field to the a array. Next time we see the same field, it will be already in the array and it will not be printed due to the condition if(!a[$2]).


awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | sort -k2,2 -k1,1nr | awk '!a[$2]++' | head -n10

9       NC_000072       2       0       0       3       0       4       0       0
19      NC_000078       12      0       0       0       0       0       6       1
6       NC_000079       0       0       0       0       4       0       2       0
4       NC_000080       2       0       0       2       0       0       0       0
1170    NM_001001130    113     149     119     154     149     149     138     199
9447    NM_001001144    1565    855     947     1501    961     1295    874     1449
211     NM_001001152    46      18      56      56      7       12      3       13
19      NM_001001160    2       6       2       0       2       0       3       4
60      NM_001001177    9       0       4       10      10      2       23      2
Finally remove the first column:

awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | sort -k2,2 -k1,1nr | awk '!a[$2]++' | cut -f2-| head -n10
        
NC_000072       2       0       0       3       0       4       0       0
NC_000078       12      0       0       0       0       0       6       1
NC_000079       0       0       0       0       4       0       2       0
NC_000080       2       0       0       2       0       0       0       0
NM_001001130    113     149     119     154     149     149     138     199
NM_001001144    1565    855     947     1501    961     1295    874     1449
NM_001001152    46      18      56      56      7       12      3       13
NM_001001160    2       6       2       0       2       0       3       4
NM_001001177    9       0       4       10      10      2       23      2
We see that only NC_000080 and NC_000078 isoforms that show highest sum expression across samples were kept.

Final code:

awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | 
sort -k2,2 -k1,1nr | 
awk '!a[$2]++' | 
cut -f2- > mastertable.genename.cleaned
        

No comments:

Post a Comment