Wednesday, March 8, 2017

Collapsing transcript expression levels into gene levels by sum, maximum and average - awk code

In case you have generated a file containing gene, transcript (isoform), expression level,

DN0a272187:~ milospjanic$ cat file.txt
GENE1,ISOFORM1,4
GENE1,ISOFORM2,6
GENE1,ISOFORM3,9
GENE2,ISOFORM1,43
GENE2,ISOFORM2,16
GENE3,ISOFORM3,19
GENE3,ISOFORM1,43
GENE3,ISOFORM2,4
GENE4,ISOFORM1,21
Use this awk code to collapse this file and sum on the gene level. Sum all $3 in hash with $1 as a key, then loop through the array and print keys and values.

awk -F, '{array[$1]+=$3} END { for (i in array) {print i"," array[i]}}' file.txt
GENE1,19
GENE2,59
GENE3,66
GENE4,21
To collapse the isoform file and select the top isoform file use this awk code. Use hash max with the key $1 and value $3. If $3 is greater than max change max to new $3 value and put $3 in array. Next, loop through the array and print keys and values. 

awk -F, '{if($3>max[$1]){array[$1]=$3; max[$1]=$3}} END { for (i in array) {print i"," array[i]}}' file.txt
GENE1,9
GENE2,43
GENE3,43
GENE4,21
To plot average use the following code. Create array hash with $1 as keys that sums $3 values, and 'no' hash with $1 as key that increases by 1. Loop through the array and print keys and division of hash 'array' and 'no', which is a sum divided by number of transcripts.

awk -F, '{array[$1]+=$3; no[$1]+=1} END { for (i in array) {print i"," array[i]/no[i]}}'  file.txt
GENE1,6.33333
GENE2,29.5
GENE3,22
GENE4,21

No comments:

Post a Comment