Wednesday, June 7, 2017

Awk code for converting gene symbols to Ensembl IDs


Simple awk code to convert gene symbols into Ensembl IDs, using a conversion table. Use awk to place first field of the conversion table into hash with second field as the key h[$2] = $1, then use if statement if(h[$1]) to plot h[$1].

head id_merge_ens_genename.upper.txt
ENSMUSG00000095309 VMN1R125
ENSMUSG00000000126 WNT9A
ENSMUSG00000086196 GM13571
ENSMUSG00000054418 2900041M22RIK
ENSMUSG00000095268 GM2913
ENSMUSG00000082399 GM14036
ENSMUSG00000097090 GM26724
ENSMUSG00000020063 SIRT1
ENSMUSG00000029623 PDAP1
ENSMUSG00000073944 OLFR619
head housekeeping.genes.conversion.to.mouse.gene.names
DAG1
PPIH
RBX1
LAMTOR4
MFSD12
ARPC1A
NDUFA9
COPZ1
ACTR10
DNAJA2
awk 'NR==FNR {h[$2] = $1; next} {if(h[$1]) print h[$1]}' id_merge_ens_genename.upper.txt housekeeping.genes.conversion.to.mouse.gene.names | head
ENSMUSG00000039952
ENSMUSG00000060288
ENSMUSG00000022400
ENSMUSG00000050552
ENSMUSG00000034854
ENSMUSG00000029621
ENSMUSG00000000399
ENSMUSG00000060992
ENSMUSG00000021076
ENSMUSG00000031701

Awk code to subtract list of housekeeping genes from the mastertable

If you have a list of housekeeping genes that you want to subtract from the mastertable (for example if you don't want to include housekeeping genes in DE analysis), use the awk code bellow to place the gene list into a hash h[$1], then use it to subtract with if(!h[$1]):


head housekeeping.genes.conversion.to.mouse.gene.names.ens
ENSMUSG00000039952
ENSMUSG00000060288
ENSMUSG00000022400
ENSMUSG00000050552
ENSMUSG00000034854
ENSMUSG00000029621
ENSMUSG00000000399
ENSMUSG00000060992
ENSMUSG00000021076
ENSMUSG00000031701
head mastertable.RPKM
  control_macs ribo_heart1 ribo_heart2 ribo_kidney1 ribo_kidney2 ribo_liver1 ribo_liver2 tcf21cre_macs
ENSMUSG00000053783 0 0 0 0 0 0 0 0
ENSMUSG00000078607 14.3315290267182 5.27148998470254 9.47657223005189 14.1782105971938 25.9570809377262 12.5352263065328 3.49215569538464 8.12296582986335
ENSMUSG00000021900 65.6250210929769 22.948792387045 60.228922019752 38.7788513986779 52.1120026703478 57.297708533105 19.665464723166 62.9006962103502
ENSMUSG00000021901 63.5325181369401 29.4644784214023 66.5475638175278 26.4470153993863 29.9656600887317 19.5248137747344 9.16367902027884 29.6891300712868
ENSMUSG00000081820 0 0 0 0 0 0 0.111648246991865 0
ENSMUSG00000021902 5.34723506150612 5.33759154872005 9.07451378908043 8.48829107065946 9.13791262030437 7.06216010001703 2.64534314243628 3.86463200913539
ENSMUSG00000021903 15.8545088541307 9.74915150330365 22.1922681668968 0.63286285488678 0.530720326421349 3.22314506289442 3.27454460225776 11.8394474075886
ENSMUSG00000081821 0 0 0 0 0 0 0 0
ENSMUSG00000019710 26.4100432213795 21.5133687933026 30.3889090935622 30.3215943263374 28.0822793064558 19.990967660423 28.4085170598329 24.6477111636308
awk 'NR==FNR {h[$1] = $1; next} {if(!h[$1]) print $0}' housekeeping.genes.conversion.to.mouse.gene.names.ens mastertable.RPKM | head
  control_macs ribo_heart1 ribo_heart2 ribo_kidney1 ribo_kidney2 ribo_liver1 ribo_liver2 tcf21cre_macs
ENSMUSG00000053783 0 0 0 0 0 0 0 0
ENSMUSG00000078607 14.3315290267182 5.27148998470254 9.47657223005189 14.1782105971938 25.9570809377262 12.5352263065328 3.49215569538464 8.12296582986335
ENSMUSG00000081820 0 0 0 0 0 0 0.111648246991865 0
ENSMUSG00000021902 5.34723506150612 5.33759154872005 9.07451378908043 8.48829107065946 9.13791262030437 7.06216010001703 2.64534314243628 3.86463200913539
ENSMUSG00000021903 15.8545088541307 9.74915150330365 22.1922681668968 0.63286285488678 0.530720326421349 3.22314506289442 3.27454460225776 11.8394474075886
ENSMUSG00000081821 0 0 0 0 0 0 0 0
ENSMUSG00000080500 0 0 0 0 0 0 0 0
ENSMUSG00000021904 3.74710427285173 1.01565097626204 69.4171627480469 13.6072031971553 38.4037682639844 0.161290198636122 0.0409393307612302 0.0734611101052428
ENSMUSG00000081822 0.173365270484494 2.52854077807611 0 0 0 0 0.107017674535981 0

Sunday, May 14, 2017

Unique on one column and keep maximum value from another column in Unix

If you have a file, e.g.

head id_merge_ens_length.txt
ENSMUSG00000095309 924
ENSMUSG00000000126 3318
ENSMUSG00000000126 3320
ENSMUSG00000086196 1381
ENSMUSG00000086196 2127
ENSMUSG00000054418 1784
ENSMUSG00000095268 1960
ENSMUSG00000082399 480
ENSMUSG00000097090 1650
ENSMUSG00000020063 3353
And want to unique on first column while keeping the maximum on second column, first use sort to sort on column 1 followed by reverse numerical sort on column 2:

sort -k1,1 -k2,2nr file.txt | head
ENSMUSG00000000001 3262
ENSMUSG00000000003 902
ENSMUSG00000000003 697
ENSMUSG00000000028 2143
ENSMUSG00000000028 1747
ENSMUSG00000000028 832
ENSMUSG00000000031 2286
ENSMUSG00000000031 1853
ENSMUSG00000000031 935
ENSMUSG00000000031 817
Now, to select the column with maximum value for each unique entry from column one, use awk. Use awk '!a[$1]++' as a short version of awk '{if(! a[$1]){print; a[$1]++}}' meaning if the current first field ($1) is not present in the a array, which will happen if it is a unique field, print the line and add the 1st field to the a array. Next time we see the same field, it will be already in the array and it will not be printed due to the condition if(!a[$1]).

sort -k1,1 -k2,2nr file.txt | awk '!a[$1]++' | head
ENSMUSG00000000001 3262
ENSMUSG00000000003 902
ENSMUSG00000000028 2143
ENSMUSG00000000031 2286
ENSMUSG00000000037 4847
ENSMUSG00000000049 1190
ENSMUSG00000000056 4395
ENSMUSG00000000058 2733
ENSMUSG00000000078 4217
ENSMUSG00000000085 3544

Thursday, May 11, 2017

Defining and calculating genome coverage, depth and breadth of coverage

Average estimation of the whole genome coverage in a sequencing assay (i.e. depth of coverage) is calculated with the formula:


coverage = (read count * read length ) / total genome size
Similarly you can calculate the estimation of the coverage for each gene or gene locus:


coverage of the gene= (gene read count * read length ) / gene size 
In reality of the sequencing experiment, distribution of the mapped reads is uneven and in additon not all portions of the reads will map to the genome, therefore for each nucleotide you will calculate specific read coverage (per base read coverage), for example using bedtools genomecov:

bedtools genomecov -ibam file.bam -g my.genome -d | head

chr1  6  0
chr1  7  0
chr1  8  0
chr1  9  0
chr1  10 0
chr1  11 1
chr1  12 1
chr1  13 1
chr1  14 1
chr1  15 1
On the other side, breadth of coverage is a different term than depth of coverage and relates to the proportion of the genome that is covered with reads. Both breadth and depth of coverage correlate to the sequencing depth i.e. the number of reads generated in the sequencing experiment.

For details see publication:

Wednesday, April 26, 2017

Awk code for counting isoforms in abundance.tsv files from Kallisto

If you have a series of abundance.tsv from Kallisto RNA-Seq quantification tool separated in different folder use this command to count protein coding isoforms in each output file

sudo find . -name abundance.tsv -exec sh -c "echo {}; grep protein_coding {} | wc -l"  \;
./1522_1hr_TNF_2/abundance.tsv
79795
./1522_6hr_TNF_2/abundance.tsv
79795
./1522_6hr_TGFB_2/abundance.tsv
79795
./1522_1hr_PMA_1/abundance.tsv
79795
./2989_6hr_TNF_2/abundance.tsv
79795
./2989_6hr_SF_1/abundance.tsv

...
To save the list of protein coding isoforms in each separate file use:

sudo find . -name abundance.tsv -exec sh -c "grep protein_coding {} > {}.protein_coding"  \;

To count isoforms that are not expressed use awk code that counts rows with est_counts==0

find . -type f -name abundance.tsv -exec sh -c 'awk "\$4==0{print \$0}" "{}" | wc -l' \;
108088
109840
113075
108092
119730
106363
110323
119521
117195
117358
98931

...
Similarly to count isoforms that are expressed use:

find . -type f -name abundance.tsv -exec sh -c 'awk "\$4!=0{print \$0}" "{}" | wc -l' \;
90532
88780
85545
90528
78890
92257
88297
79099
81425
81262
99689
91844
89495
...
List their sample names contained in the folder names:

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'echo $(basename $(dirname {}))' \;
1522_1hr_TNF_2
1522_6hr_TNF_2
1522_6hr_TGFB_2
1522_1hr_PMA_1
2989_6hr_TNF_2
2989_6hr_SF_1
1522_6hr_SF_2
2989_6hr_TNF_1
1522_1hr_TGFB_2
1522_1hr_TGFB_1
1522_1hr_PDGF_1
1522_6hr_PDGF_2
1522_6hr_TNF_1
2989_1hr_PDGF_1
...
 Make folders with their sample name:

find /home/mpjanic/HCASMC_RNASeq/ -type f -name abundance.tsv.protein_coding -exec sh -c 'mkdir $(basename $(dirname {}))' \;
Find out how many isoforms are expressed within the subgroup of protein coding isoforms:

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'awk "\$4!=0{print \$0}" "{}" | wc -l' \;
43139
42014
41017
43618
38208
43021
42031
38464
40069
40246
46087
43028
42478

...
Save in each folder files with isoforms that are expressed within the subgroup of protein coding isoforms

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'awk "\$4!=0{print \$0}" "{}" > {}.expressed ' \;

Tuesday, April 4, 2017

Parsing dbSNP for insertions, single nucleotide polymorphisms and large deletions - awk code

If you download dbSNP database file in bed format using dBSNP, MySQL or UCSC Table browser,

(MySQL command for dbSNP147)

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -D hg19 -e 'SELECT chrom, chromStart, chromEnd, name FROM snp147Common' > snp147Common.bed
you will notice that coordinates of variants can be divided roughly into 3 categories:

1. insertions (same base pair coordinates), 
2. SNPs plus simple deletions (single base pair coordinates), 
and 
3. large deletions (more than 1 base pair difference in the coordinates).

To parse and separate these three categories, use the following awk code, checking if $2 equals $3 and placing the rows that fall into this category into .insertions file; then selecting those rows where $3=$2=1 and placing them in .snp.plus.simple.deletions; and finally selecting those rows that do not fall into the previous two selection criteria, that then have $2 and $3 separated with more than 1 bp difference.


#parse dbSNPs into insertions, SNPs and simple deletions, large deletions

if [ ! -f snp147Common.bed.insertions ]
then
awk '$2 == $3 {print $0}' snp147Common.bed > snp147Common.bed.insertions
fi

if [ ! -f snp147Common.bed.snp.plus.simple.deletions ]
then
awk '$3 == $2+1 {print $0}' snp147Common.bed > snp147Common.bed.snp.plus.simple.deletions
fi

if [ ! -f snp147Common.bed.large.deletions ]
then
awk '{if ($3 != $2+1 && $2 != $3) print $0}' snp147Common.bed > snp147Common.bed.large.deletions
fi
Check the output files whether they satisfy the criteria for selection:


mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.insertions
chr1 10177 10177 rs367896724
chr1 10352 10352 rs555500075
chr1 13417 13417 rs777038595
chr1 15903 15903 rs557514207
chr1 54712 54712 rs568927205
chr1 91551 91551 rs375085441
chr1 249275 249275 rs200079338
chr1 255923 255923 rs199745078
chr1 363244 363244 rs572571697
chr1 604229 604229 rs556776674
mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.snp.plus.simple.deletions
chr1 11007 11008 rs575272151
chr1 11011 11012 rs544419019
chr1 13109 13110 rs540538026
chr1 13115 13116 rs62635286
chr1 13117 13118 rs62028691
chr1 13272 13273 rs531730856
chr1 14463 14464 rs546169444
chr1 14598 14599 rs531646671
chr1 14603 14604 rs541940975
chr1 14672 14673 rs4690
mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.large.deletions
chr1 17358 17361 rs749387668
chr1 63735 63738 rs201888535
chr1 66435 66437 rs560481224
chr1 82133 82135 rs550749506
chr1 129010 129013 rs377161483
chr1 267227 267230 rs374780253
chr1 532325 532327 rs577455319
chr1 612688 612691 rs201365517
chr1 691567 691571 rs566250387
chr1 701779 701783 rs201234755

Wednesday, March 8, 2017

Collapsing transcript expression levels into gene levels by sum, maximum and average - awk code

In case you have generated a file containing gene, transcript (isoform), expression level,

DN0a272187:~ milospjanic$ cat file.txt
GENE1,ISOFORM1,4
GENE1,ISOFORM2,6
GENE1,ISOFORM3,9
GENE2,ISOFORM1,43
GENE2,ISOFORM2,16
GENE3,ISOFORM3,19
GENE3,ISOFORM1,43
GENE3,ISOFORM2,4
GENE4,ISOFORM1,21
Use this awk code to collapse this file and sum on the gene level. Sum all $3 in hash with $1 as a key, then loop through the array and print keys and values.

awk -F, '{array[$1]+=$3} END { for (i in array) {print i"," array[i]}}' file.txt
GENE1,19
GENE2,59
GENE3,66
GENE4,21
To collapse the isoform file and select the top isoform file use this awk code. Use hash max with the key $1 and value $3. If $3 is greater than max change max to new $3 value and put $3 in array. Next, loop through the array and print keys and values. 

awk -F, '{if($3>max[$1]){array[$1]=$3; max[$1]=$3}} END { for (i in array) {print i"," array[i]}}' file.txt
GENE1,9
GENE2,43
GENE3,43
GENE4,21
To plot average use the following code. Create array hash with $1 as keys that sums $3 values, and 'no' hash with $1 as key that increases by 1. Loop through the array and print keys and division of hash 'array' and 'no', which is a sum divided by number of transcripts.

awk -F, '{array[$1]+=$3; no[$1]+=1} END { for (i in array) {print i"," array[i]/no[i]}}'  file.txt
GENE1,6.33333
GENE2,29.5
GENE3,22
GENE4,21