Tuesday, July 5, 2016

Using awk scripting to prepare eQTL file for LocusZoom plot

Lets say that you have eQTL file (e.g. eQTL.input) with following output:

"snps"  "gene"  "statistic"     "pvalue"        "FDR"   "beta"
"6:134214525"   "ENSG00000118526.6"     -8.00604105843143       8.03916943247416e-15    8.27596273928866e-13    -0.472596606956573
"6:134226147"   "ENSG00000118526.6"     -7.76939299996685       4.35609034489325e-14    4.1908757955211e-12     -0.453463080643677
"6:134209837"   "ENSG00000118526.6"     -5.91258104036461       6.16525172030328e-09    3.6491797333611e-07     -0.434239358522576
"6:134215779"   "ENSG00000118526.6"     -5.4626188984354        7.33863755697273e-08    3.7776937355882e-06     -0.335590058154907
"6:134215155"   "ENSG00000118526.6"     -5.46040034813125       7.42581108817843e-08    3.82004547528766e-06    -0.335842843000443
"6:134219229"   "ENSG00000118526.6"     -5.4546658391212        7.6558233417211e-08     3.93297487388231e-06    -0.335279178823917

Lets say that you need prepare file for plotting in LocusZoom plot. You can do this with a couple of bash commands and a simple awk script.
Prepare dbSNP file, for example, from 142 version of the database, such as that it contains chromosome/position/position+1 and SNPID (simply cut these columns from the dbSNP file you downloaded):

head snp142.cut_tab
chr1 10019 10020 rs376643643
chr1 10056 10056 rs373328635
chr1 10107 10108 rs62651026
chr1 10108 10109 rs376007522
chr1 10138 10139 rs368469931
chr1 10144 10145 rs144773400
chr1 10146 10147 rs375931351
chr1 10149 10150 rs371194064
chr1 10176 10177 rs201752861

chr1 101......77 10177 rs367896724

Next, grep gene by search with ENSEMBL ID

grep ENSG00000118526 eQTL.input > ENSG00000118526.eQTL.input

Next, remove quotes form such shortened input file, remove header, cut 1st (position) and 4th (p-value) column, remove chr info from beginning of each line.  

Next grep snp142.cut.tab file and select chr6 corresponding lines, which will make subsequent awk code much faster. Remove all chr6_ entries.

Execute awk code. Loop through the eQTL file using while and getline, then create two hashes: a- with keys of $1 and values of 1, and b- with keys $1 and values $2. Next, if in second file, a- with key from third column ==1, print $4 (SNP ID) and b[$3] ($2 from first file, i.e. p-value).


tr -d "\"" < ENSG00000118526.eQTL.input > ENSG00000118526.eQTL.input.noquote
tail -n+2 ENSG00000118526.eQTL.input.noquote > ENSG00000118526.eQTL.input.noquote.noheader
cut -f1,4 ENSG00000118526.eQTL.input.noquote.noheader > ENSG00000118526.eQTL.input.noquote.noheader.cut
sed 's/6\://g' ENSG00000118526.eQTL.input.noquote.noheader.cut > ENSG00000118526.eQTL.input.noquote.noheader.cut.sed

grep "chr6" snp142.cut_tab > snp142.cut_tab.chr6
grep -v "chr6_" snp142.cut_tab.chr6 >snp142.cut_tab.chr6_clean
awk -F " " 'BEGIN{while(getline<"ENSG00000118526.eQTL.input.noquote.noheader.cut.sed") a[$1]=1&&b[$1]=$2 } ; a[$3] ==1 {print $4, b[$3] } ' snp142.cut_tab.chr6_clean >ENSG00000118526.eQTL.input.noquote.noheader.cut.sed.rsid

Test if the file output is correct:

head ENSG00000118526.eQTL.input.noquote.noheader.cut.sed.rsid
rs111452679 0.488039067262398
rs79242155 0.488039067262398
rs386627460 0.488039067262398
rs75165450 0.488039067262398
rs57994571 0.939031341091344
rs6671424 0.488039067262398
rs3753343 0.488039067262398
rs3753342 0.305948571977292
rs111698464 0.488039067262398


No comments:

Post a Comment