Tuesday, November 17, 2015

Filter dbSNP vcf with a list of SNP IDs

If you want to get vcf file with a specific set of SNPs, e.g. lets say that you have a list of SNP IDs in one file and a complete dbSNP vcf set in another file, you can use awk scripting to quickly obtain a vcf file with your set of SNPs.

File with SNP IDs, named CAD_SNP_positions.bed

chr1    2162944 2162945 rs590367
chr1    2163568 2163569 rs263533
chr1    2164116 2164117 rs263532
chr1    2164699 2164700 rs377599
...

Vcf file with dbSNP variants, named dbSNP-common_all.vcf.gz
...
##INFO=<ID=COMMON,Number=1,Type=Integer,Description="RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10177   rs367896724     A       AC      .       .       RS=367896724;RSPOS=10177;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020005140026000200;WGT=1;VC=DIV;R5;ASP;VLD;KGPhase3;CAF=0.5747,0.4253;COMMON=1
1       10352   rs145072688     T       TA      .       .       RS=145072688;RSPOS=10353;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000020005000002000200;WGT=1;VC=DIV;R5;ASP;CAF=0.5625,0.4375;COMMON=1
...

Do the following:

mpjanic@valkyr:~/vcf_subset_selection_script$ grep "#" <(gzip -dc dbSNP-common_all.vcf.gz) > CAD_SNP_common_all.vcf

mpjanic@valkyr:~/vcf_subset_selection_script$ awk 'NR==FNR {h[$4] = $4; next} {if(h[$3]) print$0}' CAD_SNP_positions.bed <(gzip -dc dbSNP-common_all.vcf.gz) >> CAD_SNP_common_all.vcf


Or in case you want to gzip it immediately pipe it to gzip:

mpjanic@valkyr:~/vcf_subset_selection_script$ grep "#" <(gzip -dc dbSNP-common_all.vcf.gz) | gzip > CAD_SNP_common_all.vcf

mpjanic@valkyr:~/vcf_subset_selection_script$ awk 'NR==FNR {h[$4] = $4; next} {if(h[$3]) print$0}' CAD_SNP_positions.bed <(gzip -dc dbSNP-common_all.vcf.gz) | gzip >> CAD_SNP_common_all.vcf





No comments:

Post a Comment