Wednesday, June 7, 2017

Awk code for converting gene symbols to Ensembl IDs


Simple awk code to convert gene symbols into Ensembl IDs, using a conversion table. Use awk to place first field of the conversion table into hash with second field as the key h[$2] = $1, then use if statement if(h[$1]) to plot h[$1].

head id_merge_ens_genename.upper.txt
ENSMUSG00000095309 VMN1R125
ENSMUSG00000000126 WNT9A
ENSMUSG00000086196 GM13571
ENSMUSG00000054418 2900041M22RIK
ENSMUSG00000095268 GM2913
ENSMUSG00000082399 GM14036
ENSMUSG00000097090 GM26724
ENSMUSG00000020063 SIRT1
ENSMUSG00000029623 PDAP1
ENSMUSG00000073944 OLFR619
head housekeeping.genes.conversion.to.mouse.gene.names
DAG1
PPIH
RBX1
LAMTOR4
MFSD12
ARPC1A
NDUFA9
COPZ1
ACTR10
DNAJA2
awk 'NR==FNR {h[$2] = $1; next} {if(h[$1]) print h[$1]}' id_merge_ens_genename.upper.txt housekeeping.genes.conversion.to.mouse.gene.names | head
ENSMUSG00000039952
ENSMUSG00000060288
ENSMUSG00000022400
ENSMUSG00000050552
ENSMUSG00000034854
ENSMUSG00000029621
ENSMUSG00000000399
ENSMUSG00000060992
ENSMUSG00000021076
ENSMUSG00000031701

1 comment:

  1. Thank you so much for providing such information. I know a couple of friends of mine who were looking for this, so I'll forward this to them. Keep up the good work.

    ReplyDelete