Monday, January 29, 2018

Calculate per gene average expression from mastertable using awk/bash scripting

If you want to calculate average expression per each gene across all conditions in a mastertable use awk:

mpjanic@zoran:~$ head mastertable -n25
        TQ6     TQ7     TQ8     TQ9     TQ10    TQ11
lnc-CCDC77-4:1  0       0       0       0       0       0
lnc-COX10-9:1   0       0       0       0       0       0
lnc-MAGEB2-1:1  0       0       0       0       0       0
lnc-TMEM99-2:1  0       0       0       0       0       0
lnc-COX10-9:2   0       0       0       0       0       0
DDN-AS1:2       0       0       0       0       0       0
lnc-TMEM99-2:2  1       0       0       0       0       0
lnc-SPRY4-3:1   0       0       0       0       0       0
DDN-AS1:3       0       0       0       0       0       0
lnc-TMEM99-2:3  0       0       0       0       0       0
DDN-AS1:4       28      32      10      31      2       13
DDN-AS1:5       0       0       0       0       0       0
lnc-ZNF516-4:10 0       0       0       0       0       0
DDN-AS1:6       0       0       0       0       0       0
lnc-ZNF516-4:11 0       0       0       0       0       0
GSEC:2  15      9       16      10      9       34
lnc-AATK-AS1-2:1        15      17      2       19      28      24
lnc-PLCH1-5:1   0       0       0       0       0       0
GSEC:3  0       0       0       0       0       0
lnc-MFSD9-7:1   53      27      22      41      47      41
lnc-PSMC1-1:1   29      30      54      18      23      58
GSEC:4  0       0       0       0       0       0
GSEC:5  0       0       0       0       0       0
lnc-ZNF780B-1:1 124     125     80      123     110     148
Use awk to print the first line, then for each field starting from NF>2 assuming the first field is the gene name, perform {sum=0; for (i=2; i<=NF; i++) sum+=$i; print $1, sum/(NF-1)}.

Then, sort -gr -k2, to sort in reverse order and with -g option (--general-numeric-sort):


mpjanic@zoran:~$ awk 'NR == 1 { print "lncRNA", "Average"; next }    # Print a heading row\
> NF > 2 { sum=0; for (i=2; i<=NF; i++) sum+=$i; print $1, sum/(NF-1) }' mastertable | sort -gr -k2| head -n 20
lnc-SGCE-3:1 157963
lnc-EIF2AK4-6:1 120530
lnc-SLC3A2-6:1 110467
lnc-ATIC-14:1 66120.8
lnc-TRIM69-3:1 59894.5
lnc-TRDMT1-5:2 52209.8
lnc-ANKRD55-6:1 44934.3
lnc-LRRTM4-6:1 44869.3
lnc-LYN-8:1 39859.2
lnc-VGF-4:1 37230.2
lnc-VGF-3:1 32908.2
lnc-VAT1-4:1 27177.7
lnc-SH3D19-2:1 22266.8
IGFBP7-AS1:16 21963.5
lnc-HSD17B7-1:2 21429.5
lnc-BTD-2:1 21348.2
lnc-ARID2-11:1 21063.7
lnc-CBY3-3:2 19925.3
lnc-DYNC2H1-4:1 19492.8
lnc-C6orf120-1:7 18986.2
Then save it in a file expression_average_per_gene:

mpjanic@zoran:~$ awk 'NR == 1 { print "lncRNA", "Average"; next }    # Print a heading row\
NF > 2 { sum=0; for (i=2; i<=NF; i++) sum+=$i; print $1, sum/(NF-1) }' mastertable | sort -gr -k2| head -n 20

No comments:

Post a Comment