Thursday, July 14, 2016

How to substitute newline with one or multiple characters using tr, sed or awk

If you have a file

cat file.txt 
one
two
three
four
five

and want to substitute every new line with one character, (eg. ;) you can use tr

tr '\n' ';' < file.txt
one;two;three;four;five;

If you want to substitute every new line with multiple characters (eg. :::), you cannot use tr, in this case in bash it would be better to use sed

sed ':a;N;$!ba;s/\n/:::/g' file.txt
one:::two:::three:::four:::five

Or using awk and defining output record separator ORS:

awk 1 ORS=':::' file.txt
one:::two:::three:::four:::five:::

Tuesday, July 5, 2016

Using awk scripting to prepare eQTL file for LocusZoom plot

Lets say that you have eQTL file (e.g. eQTL.input) with following output:

"snps"  "gene"  "statistic"     "pvalue"        "FDR"   "beta"
"6:134214525"   "ENSG00000118526.6"     -8.00604105843143       8.03916943247416e-15    8.27596273928866e-13    -0.472596606956573
"6:134226147"   "ENSG00000118526.6"     -7.76939299996685       4.35609034489325e-14    4.1908757955211e-12     -0.453463080643677
"6:134209837"   "ENSG00000118526.6"     -5.91258104036461       6.16525172030328e-09    3.6491797333611e-07     -0.434239358522576
"6:134215779"   "ENSG00000118526.6"     -5.4626188984354        7.33863755697273e-08    3.7776937355882e-06     -0.335590058154907
"6:134215155"   "ENSG00000118526.6"     -5.46040034813125       7.42581108817843e-08    3.82004547528766e-06    -0.335842843000443
"6:134219229"   "ENSG00000118526.6"     -5.4546658391212        7.6558233417211e-08     3.93297487388231e-06    -0.335279178823917

Lets say that you need prepare file for plotting in LocusZoom plot. You can do this with a couple of bash commands and a simple awk script.
Prepare dbSNP file, for example, from 142 version of the database, such as that it contains chromosome/position/position+1 and SNPID (simply cut these columns from the dbSNP file you downloaded):

head snp142.cut_tab
chr1 10019 10020 rs376643643
chr1 10056 10056 rs373328635
chr1 10107 10108 rs62651026
chr1 10108 10109 rs376007522
chr1 10138 10139 rs368469931
chr1 10144 10145 rs144773400
chr1 10146 10147 rs375931351
chr1 10149 10150 rs371194064
chr1 10176 10177 rs201752861

chr1 101......77 10177 rs367896724

Next, grep gene by search with ENSEMBL ID

grep ENSG00000118526 eQTL.input > ENSG00000118526.eQTL.input

Next, remove quotes form such shortened input file, remove header, cut 1st (position) and 4th (p-value) column, remove chr info from beginning of each line.  

Next grep snp142.cut.tab file and select chr6 corresponding lines, which will make subsequent awk code much faster. Remove all chr6_ entries.

Execute awk code. Loop through the eQTL file using while and getline, then create two hashes: a- with keys of $1 and values of 1, and b- with keys $1 and values $2. Next, if in second file, a- with key from third column ==1, print $4 (SNP ID) and b[$3] ($2 from first file, i.e. p-value).


tr -d "\"" < ENSG00000118526.eQTL.input > ENSG00000118526.eQTL.input.noquote
tail -n+2 ENSG00000118526.eQTL.input.noquote > ENSG00000118526.eQTL.input.noquote.noheader
cut -f1,4 ENSG00000118526.eQTL.input.noquote.noheader > ENSG00000118526.eQTL.input.noquote.noheader.cut
sed 's/6\://g' ENSG00000118526.eQTL.input.noquote.noheader.cut > ENSG00000118526.eQTL.input.noquote.noheader.cut.sed

grep "chr6" snp142.cut_tab > snp142.cut_tab.chr6
grep -v "chr6_" snp142.cut_tab.chr6 >snp142.cut_tab.chr6_clean
awk -F " " 'BEGIN{while(getline<"ENSG00000118526.eQTL.input.noquote.noheader.cut.sed") a[$1]=1&&b[$1]=$2 } ; a[$3] ==1 {print $4, b[$3] } ' snp142.cut_tab.chr6_clean >ENSG00000118526.eQTL.input.noquote.noheader.cut.sed.rsid

Test if the file output is correct:

head ENSG00000118526.eQTL.input.noquote.noheader.cut.sed.rsid
rs111452679 0.488039067262398
rs79242155 0.488039067262398
rs386627460 0.488039067262398
rs75165450 0.488039067262398
rs57994571 0.939031341091344
rs6671424 0.488039067262398
rs3753343 0.488039067262398
rs3753342 0.305948571977292
rs111698464 0.488039067262398


Monday, July 4, 2016

Some examples of filtering eQTL (or any) tables using awk scripting

Lets say that you have an output table of eQTLs for example from Matrix eQTL tool, or any other table that you want to filter using awk scripting.
Let say that you want to filter the file so that you take first 10000 most significant eQTLs,, and then select top eQTLs per uniq gene. We know that the structure of the table is the following, SNP/gene/beta/t-stat/p-value/FDR:

SNP gene beta t-stat p-value FDR
chr6_86387888_G_T ENSG00000203875.6 1.31262854934142 19.9689048213817 9.53357865403724e-22 7.80968273626686e-14
chr6_86388223_C_T ENSG00000203875.6 1.34695879685172 19.0475590875782 4.89504726551759e-21 2.00495362287351e-13
chr6_86298063_A_G ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86302034_A_G ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86325632_C_T ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86335150_A_T ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86340848_G_C ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86347605_C_T ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12
chr6_86353621_A_G ENSG00000203875.6 -1.33027793036445 -15.8106562475601 2.66918515248737e-18 8.40974427246143e-12

In this case file is already sorted by the p-values. If not use sort command to sort it by column 5. We are going to head first 10000 lines and pipe it to sort to sort it by second and then fifth column, i.e. by name and then p-value in reverse order. 
Next pipe it to awk and create a hash max with key $2 and values $5 where $5 will overwrite itself with each consecutive uniq $2 (name) using $5>max[$2]. Condition this on $5 being greater than current value for max[$2] to keep the maximum $5, p-value.
Next, when max[$2] is equal to $5 create hash row with key $2 and  value would be whole line, $0.
After END, create for loop that will print row while i cycles through row hash keys.

head -n 10000 pc3.peer8.2.cis.txt | sort -k2,2 -k5,5rg | awk '{if($5>max[$2])$5>max[$2]}{max[$2]=$5; row[$2]=$0} END{for (i in row) print row[i]}' > top.eQTLs


Output would be something like this:

chr10_52516665_C_T ENSG00000204147.5 -0.872804485046258 -7.51465022189187 5.00265034917682e-09 7.10141829780494e-05
chr4_6615617_G_A ENSG00000013288.4 -0.763628877790672 -7.03412616143357 2.21139078977285e-08 0.000236246409916277
chr10_134000422_C_T ENSG00000151640.8 -0.870153140016277 -10.1627004045395 2.17590110963635e-12 5.71663521329993e-08
chr6_31914935_A_G ENSG00000243649.4 -1.45954276289726 -8.03300498011514 1.03119064533423e-09 1.63579986613183e-05
chr11_93564393_C_T ENSG00000182919.10 -0.846058617890931 -7.6040877741813 3.80220782585795e-09 5.47010705454904e-05
chr1_213059806_C_A ENSG00000198468.3 1.0620021274388 13.9721503138632 1.49288298157308e-16 6.64638326845182e-11
chr8_22457206_T_C ENSG00000241852.5 0.813855189958306 6.99530601465363 2.49551796773746e-08 0.000262827134212094
chr16_56671632_A_T ENSG00000169688.10 1.34414988623047 9.50379531720248 1.37766774780468e-11 3.33201334461005e-07
chr19_24236833_T_C ENSG00000268362.1 1.4012063586675 8.05309361348013 9.70512733245545e-10 1.5467337289482e-05
chr2_128146762_G_C ENSG00000236682.1 -1.4679424727015 -9.15709826609346 3.72042826691254e-11 8.60199568544154e-07


Lets check if the top eQTL is in the list:

DN52eii6:Downloads milospjanic$ grep 9.53357865403724e-22 top.eQTLs
chr6_86387888_G_T ENSG00000203875.6 1.31262854934142 19.9689048213817 9.53357865403724e-22 7.80968273626686e-14

Before doing this filtering, it may be good to remove header in eQTL input file with

DN52eii6:Downloads milospjanic$ tail +2 pc3.peer8.2.cis.txt > pc3.peer8.2.cis.txt.noheader