Monday, July 4, 2016

Some examples of filtering eQTL (or any) tables using awk scripting

Lets say that you have an output table of eQTLs for example from Matrix eQTL tool, or any other table that you want to filter using awk scripting.
Let say that you want to filter the file so that you take first 10000 most significant eQTLs,, and then select top eQTLs per uniq gene. We know that the structure of the table is the following, SNP/gene/beta/t-stat/p-value/FDR:

SNP gene beta t-stat p-value FDR
chr6_86387888_G_T ENSG00000203875.6 1.31262854934142 19.9689048213817 9.53357865403724e-22 7.80968273626686e-14
chr6_86388223_C_T ENSG00000203875.6 1.34695879685172 19.0475590875782 4.89504726551759e-21 2.00495362287351e-13
chr6_86298063_A_G ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86302034_A_G ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86325632_C_T ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86335150_A_T ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86340848_G_C ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86347605_C_T ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86353621_A_G ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12

In this case file is already sorted by the p-values. If not use sort command to sort it by column 5. We are going to head first 10000 lines and pipe it to sort to sort it by second and then fifth column, i.e. by name and then p-value in reverse order. 
Next pipe it to awk and create a hash max with key $2 and values $5 where $5 will overwrite itself with each consecutive uniq $2 (name) using $5>max[$2]. Condition this on $5 being greater than current value for max[$2] to keep the maximum $5, p-value.
Next, when max[$2] is equal to $5 create hash row with key $2 and  value would be whole line, $0.
After END, create for loop that will print row while i cycles through row hash keys.

head -n 10000 pc3.peer8.2.cis.txt | sort -k2,2 -k5,5rg | awk '{if($5>max[$2])$5>max[$2]}{max[$2]=$5; row[$2]=$0} END{for (i in row) print row[i]}' > top.eQTLs

Output would be something like this:

chr10_52516665_C_T ENSG00000204147.5 -0.872804485046258 -7.51465022189187 5.00265034917682e-09 7.10141829780494e-05
chr4_6615617_G_A ENSG00000013288.4 -0.763628877790672 -7.03412616143357 2.21139078977285e-08 0.000236246409916277
chr10_134000422_C_T ENSG00000151640.8 -0.870153140016277 -10.1627004045395 2.17590110963635e-12 5.71663521329993e-08
chr6_31914935_A_G ENSG00000243649.4 -1.45954276289726 -8.03300498011514 1.03119064533423e-09 1.63579986613183e-05
chr11_93564393_C_T ENSG00000182919.10 -0.846058617890931 -7.6040877741813 3.80220782585795e-09 5.47010705454904e-05
chr1_213059806_C_A ENSG00000198468.3 1.0620021274388 13.9721503138632 1.49288298157308e-16 6.64638326845182e-11
chr8_22457206_T_C ENSG00000241852.5 0.813855189958306 6.99530601465363 2.49551796773746e-08 0.000262827134212094
chr16_56671632_A_T ENSG00000169688.10 1.34414988623047 9.50379531720248 1.37766774780468e-11 3.33201334461005e-07
chr19_24236833_T_C ENSG00000268362.1 1.4012063586675 8.05309361348013 9.70512733245545e-10 1.5467337289482e-05
chr2_128146762_G_C ENSG00000236682.1 -1.4679424727015 -9.15709826609346 3.72042826691254e-11 8.60199568544154e-07

Lets check if the top eQTL is in the list:

DN52eii6:Downloads milospjanic$ grep 9.53357865403724e-22 top.eQTLs
chr6_86387888_G_T ENSG00000203875.6 1.31262854934142 19.9689048213817 9.53357865403724e-22 7.80968273626686e-14

Before doing this filtering, it may be good to remove header in eQTL input file with

DN52eii6:Downloads milospjanic$ tail +2 pc3.peer8.2.cis.txt > pc3.peer8.2.cis.txt.noheader

No comments:

Post a Comment