Tuesday, November 1, 2016
Reassigning GENCODE counted reads on RefSeq identifiers. Solution to removing duplicated rows on ID column and selecting by summing on counts using awk scripting
In case you have used Gencode transcripts with featureCounts and have counted mapped reads on Gencode transcripts, in some cases you need to revert back to RefSeq ID (NM_ and NR_). The problem that appears in this case is that multiple Gencode IDs will correspond to a single RefSeq identifier, e.g. in this example NC000078 and NC000080.
What you need to do is to select a most dominant RNA isoform and assign a RefSeq ID to it, and eliminate those isoforms that are not expressed and have mainly 0 counts. To do so first calculate sum for each row, and plot it as first column.
Next, sort using column 2 by RefSeq ID, and then by column 1, use reverse numerical sort from the highest to the lowest sum.
Finally, use awk to eliminate rows duplicated on one column, use awk '!a[$2]++' as a short version of meaning if the current second field ($1) is not present in the a array, which will happen if it is a unique field, print the line and add the 2st field to the a array. Next time we see the same field, it will be already in the array and it will not be printed due to the condition if(!a[$2]).
Finally remove the first column:
We see that only NC_000080 and NC_000078 isoforms that show highest sum expression across samples were kept.