Tuesday, April 4, 2017

Parsing dbSNP for insertions, single nucleotide polymorphisms and large deletions - awk code

If you download dbSNP database file in bed format using dBSNP, MySQL or UCSC Table browser,

(MySQL command for dbSNP147)

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -D hg19 -e 'SELECT chrom, chromStart, chromEnd, name FROM snp147Common' > snp147Common.bed
you will notice that coordinates of variants can be divided roughly into 3 categories:

1. insertions (same base pair coordinates), 
2. SNPs plus simple deletions (single base pair coordinates), 
and 
3. large deletions (more than 1 base pair difference in the coordinates).

To parse and separate these three categories, use the following awk code, checking if $2 equals $3 and placing the rows that fall into this category into .insertions file; then selecting those rows where $3=$2=1 and placing them in .snp.plus.simple.deletions; and finally selecting those rows that do not fall into the previous two selection criteria, that then have $2 and $3 separated with more than 1 bp difference.


#parse dbSNPs into insertions, SNPs and simple deletions, large deletions

if [ ! -f snp147Common.bed.insertions ]
then
awk '$2 == $3 {print $0}' snp147Common.bed > snp147Common.bed.insertions
fi

if [ ! -f snp147Common.bed.snp.plus.simple.deletions ]
then
awk '$3 == $2+1 {print $0}' snp147Common.bed > snp147Common.bed.snp.plus.simple.deletions
fi

if [ ! -f snp147Common.bed.large.deletions ]
then
awk '{if ($3 != $2+1 && $2 != $3) print $0}' snp147Common.bed > snp147Common.bed.large.deletions
fi
Check the output files whether they satisfy the criteria for selection:


mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.insertions
chr1 10177 10177 rs367896724
chr1 10352 10352 rs555500075
chr1 13417 13417 rs777038595
chr1 15903 15903 rs557514207
chr1 54712 54712 rs568927205
chr1 91551 91551 rs375085441
chr1 249275 249275 rs200079338
chr1 255923 255923 rs199745078
chr1 363244 363244 rs572571697
chr1 604229 604229 rs556776674
mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.snp.plus.simple.deletions
chr1 11007 11008 rs575272151
chr1 11011 11012 rs544419019
chr1 13109 13110 rs540538026
chr1 13115 13116 rs62635286
chr1 13117 13118 rs62028691
chr1 13272 13273 rs531730856
chr1 14463 14464 rs546169444
chr1 14598 14599 rs531646671
chr1 14603 14604 rs541940975
chr1 14672 14673 rs4690
mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.large.deletions
chr1 17358 17361 rs749387668
chr1 63735 63738 rs201888535
chr1 66435 66437 rs560481224
chr1 82133 82135 rs550749506
chr1 129010 129013 rs377161483
chr1 267227 267230 rs374780253
chr1 532325 532327 rs577455319
chr1 612688 612691 rs201365517
chr1 691567 691571 rs566250387
chr1 701779 701783 rs201234755