Tuesday, April 4, 2017
Parsing dbSNP for insertions, single nucleotide polymorphisms and large deletions - awk code
If you download dbSNP database file in bed format using dBSNP, MySQL or UCSC Table browser,
(MySQL command for dbSNP147)
you will notice that coordinates of variants can be divided roughly into 3 categories:
1. insertions (same base pair coordinates),
2. SNPs plus simple deletions (single base pair coordinates),
3. large deletions (more than 1 base pair difference in the coordinates).
To parse and separate these three categories, use the following awk code, checking if $2 equals $3 and placing the rows that fall into this category into .insertions file; then selecting those rows where $3=$2=1 and placing them in .snp.plus.simple.deletions; and finally selecting those rows that do not fall into the previous two selection criteria, that then have $2 and $3 separated with more than 1 bp difference.
Check the output files whether they satisfy the criteria for selection: