Tuesday, August 29, 2017

GTExExtractor, a script to parse and plot individual level GTEx data

This is the script I wrote you may find useful in case you are working with GTEx data.

https://github.com/milospjanic/GTExExtractor/

As GTEx currently can only show the box plot distribution for a single gene across tissues, I have made a script (named it GTExExtractor) to show the distribution of multiple genes in a single tissue in a form of more informative violin plot. You just have to type the genes you want and the tissue. You can use it to show how certain members within a gene family are expressed higher in certain tissues.

For example, NFIX is more dominant in the class of NFI transcription factors in coronary arteries, while in the skeletal muscle it is NFIC, and in lung NFIB becomes more expressed.






You can see other examples in the link.

Thursday, August 10, 2017

Extract columns by matching column names from file

If you want to select columns using column names that are read from another file use this code. For example, input file is input.txt and column names to be extracted are in columns.txt.

First save this awk code as extractor.sh, it will read the first row and iterate through the fields to find those that match the first argument $1. After it will save the matched row number as col_num and print it out. For other rows, NR>1, it will print out the field content only if i=col_num.

awk -v COLUMN_ID=$1 '
        NR==1 {
                for (i=1; i<=NF; i++) {
                        if ($i==COLUMN_ID) {
                                col_num=i;
                                print $i;
                        }
                }
        }
        NR>1 {
                if (i=col_num) {
                        print $i;
                }
        }
' $2
Then run this code, it will read the columns.txt file line by line, save it in each loop into variable $line, then it runs the extractor.sh script with $line as $1 and input.txt as $2. It then saves output for each $line as a temp file. Then use paste to connect all .temp files to create the final table.

while IFS= read -r line; 
do ./extractor.sh $line path/to/file/input.txt > $line.temp; 

done < path/to/file/columns.txt 

paste *.temp > output.txt

Friday, July 7, 2017

Copy files using regular expressions

If you need to copy or move only a portion of your files based on some rule for their name, use regular expressions.

For example, these are all fastqc.zip files in a folder:

ls | grep fastqc.zip$
071_2_E7_24h_TAG_fastqc.zip
071_2_E7_72h_TAG_fastqc.zip
071_2_E8_24h_TAG_fastqc.zip
071_2_E8_72h_TAG_fastqc.zip
334_1_E7_24h_TAG_fastqc.zip
334_1_E7_72h_TAG_fastqc.zip
334_1_E8_24h_TAG_fastqc.zip
334_1_E8_72h_TAG_fastqc.zip
756_3_E7_24h_TAG_fastqc.zip
756_3_E7_72h_TAG_fastqc.zip
756_3_E8_24h_TAG_fastqc.zip
756_3_E8_72h_TAG_fastqc.zip
835_1_E7_24h_TAG_fastqc.zip
835_1_E7_72h_TAG_fastqc.zip
835_1_E8_24h_TAG_fastqc.zip
835_1_E8_72h_TAG_fastqc.zip
H1_E7_24h_TAG_fastqc.zip
H1_E7_72h_TAG_fastqc.zip
H1_E8_24h_TAG_fastqc.zip
H1_E8_72h_TAG_fastqc.zip
H7_E7_24h_TAG_fastqc.zip
H7_E7_72h_TAG_fastqc.zip
H7_E8_24h_TAG_fastqc.zip
H7_E8_72h_TAG_fastqc.zip
If you need to find files with specific number, lets say those with _1_ or _2_ in the name of the file, use the following regex:


ls | grep .*_[1-2]_.*_fastqc.zip$
071_2_E7_24h_TAG_fastqc.zip
071_2_E7_72h_TAG_fastqc.zip
071_2_E8_24h_TAG_fastqc.zip
071_2_E8_72h_TAG_fastqc.zip
334_1_E7_24h_TAG_fastqc.zip
334_1_E7_72h_TAG_fastqc.zip
334_1_E8_24h_TAG_fastqc.zip
334_1_E8_72h_TAG_fastqc.zip
835_1_E7_24h_TAG_fastqc.zip
835_1_E7_72h_TAG_fastqc.zip
835_1_E8_24h_TAG_fastqc.zip
835_1_E8_72h_TAG_fastqc.zip
Now simply to copy those files capture the output of the command:

cp $(ls | grep .*_[1-2]_.*_fastqc.zip$) test/
cd test/
ls -ltrh
total 10648
-rw-r--r--  1 milospjanic  staff   443K Jul  7 01:45 835_1_E7_72h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   439K Jul  7 01:45 835_1_E7_24h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   438K Jul  7 01:45 334_1_E8_72h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   443K Jul  7 01:45 334_1_E8_24h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   446K Jul  7 01:45 334_1_E7_72h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   436K Jul  7 01:45 334_1_E7_24h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   441K Jul  7 01:45 071_2_E8_72h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   441K Jul  7 01:45 071_2_E8_24h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   444K Jul  7 01:45 071_2_E7_72h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   438K Jul  7 01:45 071_2_E7_24h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   447K Jul  7 01:45 835_1_E8_72h_TAG_fastqc.zip
-rw-r--r--  1 milospjanic  staff   443K Jul  7 01:45 835_1_E8_24h_TAG_fastqc.zip

Wednesday, July 5, 2017

Grep extract before and after a line

If you need to extract lines before and after a line with a string you search for using grep use -B flag for before and -A for after to get lines wanted:

grep -A3 -B2 1998 libtool 
# NOTE: Changes made to this file will be lost: look at ltmain.sh.
#
# Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001
# Free Software Foundation, Inc.
#
# This file is part of GNU Libtool:
If you need same number of lines before and after use -C flag

grep -C3 1998 libtool 
# Generated automatically by  (GNU  )
# NOTE: Changes made to this file will be lost: look at ltmain.sh.
#
# Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001
# Free Software Foundation, Inc.
#
# This file is part of GNU Libtool:

Wednesday, June 7, 2017

Awk code for converting gene symbols to Ensembl IDs


Simple awk code to convert gene symbols into Ensembl IDs, using a conversion table. Use awk to place first field of the conversion table into hash with second field as the key h[$2] = $1, then use if statement if(h[$1]) to plot h[$1].

head id_merge_ens_genename.upper.txt
ENSMUSG00000095309 VMN1R125
ENSMUSG00000000126 WNT9A
ENSMUSG00000086196 GM13571
ENSMUSG00000054418 2900041M22RIK
ENSMUSG00000095268 GM2913
ENSMUSG00000082399 GM14036
ENSMUSG00000097090 GM26724
ENSMUSG00000020063 SIRT1
ENSMUSG00000029623 PDAP1
ENSMUSG00000073944 OLFR619
head housekeeping.genes.conversion.to.mouse.gene.names
DAG1
PPIH
RBX1
LAMTOR4
MFSD12
ARPC1A
NDUFA9
COPZ1
ACTR10
DNAJA2
awk 'NR==FNR {h[$2] = $1; next} {if(h[$1]) print h[$1]}' id_merge_ens_genename.upper.txt housekeeping.genes.conversion.to.mouse.gene.names | head
ENSMUSG00000039952
ENSMUSG00000060288
ENSMUSG00000022400
ENSMUSG00000050552
ENSMUSG00000034854
ENSMUSG00000029621
ENSMUSG00000000399
ENSMUSG00000060992
ENSMUSG00000021076
ENSMUSG00000031701

Awk code to subtract list of housekeeping genes from the mastertable

If you have a list of housekeeping genes that you want to subtract from the mastertable (for example if you don't want to include housekeeping genes in DE analysis), use the awk code bellow to place the gene list into a hash h[$1], then use it to subtract with if(!h[$1]):


head housekeeping.genes.conversion.to.mouse.gene.names.ens
ENSMUSG00000039952
ENSMUSG00000060288
ENSMUSG00000022400
ENSMUSG00000050552
ENSMUSG00000034854
ENSMUSG00000029621
ENSMUSG00000000399
ENSMUSG00000060992
ENSMUSG00000021076
ENSMUSG00000031701
head mastertable.RPKM
  control_macs ribo_heart1 ribo_heart2 ribo_kidney1 ribo_kidney2 ribo_liver1 ribo_liver2 tcf21cre_macs
ENSMUSG00000053783 0 0 0 0 0 0 0 0
ENSMUSG00000078607 14.3315290267182 5.27148998470254 9.47657223005189 14.1782105971938 25.9570809377262 12.5352263065328 3.49215569538464 8.12296582986335
ENSMUSG00000021900 65.6250210929769 22.948792387045 60.228922019752 38.7788513986779 52.1120026703478 57.297708533105 19.665464723166 62.9006962103502
ENSMUSG00000021901 63.5325181369401 29.4644784214023 66.5475638175278 26.4470153993863 29.9656600887317 19.5248137747344 9.16367902027884 29.6891300712868
ENSMUSG00000081820 0 0 0 0 0 0 0.111648246991865 0
ENSMUSG00000021902 5.34723506150612 5.33759154872005 9.07451378908043 8.48829107065946 9.13791262030437 7.06216010001703 2.64534314243628 3.86463200913539
ENSMUSG00000021903 15.8545088541307 9.74915150330365 22.1922681668968 0.63286285488678 0.530720326421349 3.22314506289442 3.27454460225776 11.8394474075886
ENSMUSG00000081821 0 0 0 0 0 0 0 0
ENSMUSG00000019710 26.4100432213795 21.5133687933026 30.3889090935622 30.3215943263374 28.0822793064558 19.990967660423 28.4085170598329 24.6477111636308
awk 'NR==FNR {h[$1] = $1; next} {if(!h[$1]) print $0}' housekeeping.genes.conversion.to.mouse.gene.names.ens mastertable.RPKM | head
  control_macs ribo_heart1 ribo_heart2 ribo_kidney1 ribo_kidney2 ribo_liver1 ribo_liver2 tcf21cre_macs
ENSMUSG00000053783 0 0 0 0 0 0 0 0
ENSMUSG00000078607 14.3315290267182 5.27148998470254 9.47657223005189 14.1782105971938 25.9570809377262 12.5352263065328 3.49215569538464 8.12296582986335
ENSMUSG00000081820 0 0 0 0 0 0 0.111648246991865 0
ENSMUSG00000021902 5.34723506150612 5.33759154872005 9.07451378908043 8.48829107065946 9.13791262030437 7.06216010001703 2.64534314243628 3.86463200913539
ENSMUSG00000021903 15.8545088541307 9.74915150330365 22.1922681668968 0.63286285488678 0.530720326421349 3.22314506289442 3.27454460225776 11.8394474075886
ENSMUSG00000081821 0 0 0 0 0 0 0 0
ENSMUSG00000080500 0 0 0 0 0 0 0 0
ENSMUSG00000021904 3.74710427285173 1.01565097626204 69.4171627480469 13.6072031971553 38.4037682639844 0.161290198636122 0.0409393307612302 0.0734611101052428
ENSMUSG00000081822 0.173365270484494 2.52854077807611 0 0 0 0 0.107017674535981 0

Sunday, May 14, 2017

Unique on one column and keep maximum value from another column in Unix

If you have a file, e.g.

head id_merge_ens_length.txt
ENSMUSG00000095309 924
ENSMUSG00000000126 3318
ENSMUSG00000000126 3320
ENSMUSG00000086196 1381
ENSMUSG00000086196 2127
ENSMUSG00000054418 1784
ENSMUSG00000095268 1960
ENSMUSG00000082399 480
ENSMUSG00000097090 1650
ENSMUSG00000020063 3353
And want to unique on first column while keeping the maximum on second column, first use sort to sort on column 1 followed by reverse numerical sort on column 2:

sort -k1,1 -k2,2nr file.txt | head
ENSMUSG00000000001 3262
ENSMUSG00000000003 902
ENSMUSG00000000003 697
ENSMUSG00000000028 2143
ENSMUSG00000000028 1747
ENSMUSG00000000028 832
ENSMUSG00000000031 2286
ENSMUSG00000000031 1853
ENSMUSG00000000031 935
ENSMUSG00000000031 817
Now, to select the column with maximum value for each unique entry from column one, use awk. Use awk '!a[$1]++' as a short version of awk '{if(! a[$1]){print; a[$1]++}}' meaning if the current first field ($1) is not present in the a array, which will happen if it is a unique field, print the line and add the 1st field to the a array. Next time we see the same field, it will be already in the array and it will not be printed due to the condition if(!a[$1]).

sort -k1,1 -k2,2nr file.txt | awk '!a[$1]++' | head
ENSMUSG00000000001 3262
ENSMUSG00000000003 902
ENSMUSG00000000028 2143
ENSMUSG00000000031 2286
ENSMUSG00000000037 4847
ENSMUSG00000000049 1190
ENSMUSG00000000056 4395
ENSMUSG00000000058 2733
ENSMUSG00000000078 4217
ENSMUSG00000000085 3544