Sunday, October 9, 2016

Step by step code to extract SNPs in LD that reside in open chromatin regions as defined from combined ENCODE set

This is a step by step code to extract SNPs in LD that reside in open chromatin regions as defined using a combined ENCODE data set.

To create a bed file using the list of input human SNPs you can download latest dbSNP, our link is for version 147. Follow these steps and using awk create a bed file for your input SNP list.

#download dbSNP.147.hg19.bed

#unzip gz
gunzip dbSNP.147.hg19.bed.gz

#create bed file for input SNPs, using awk and dbSNP.147.hg19.bed
awk 'NR==FNR {h[$1] = $1; next} {if(h[$4]) print$0}' SNP.list dbSNP.147.hg19.bed > SNP.list.bed

Follow these steps to perform LD analysis using Broad SNAP tool, and to select SNPs with D'>8. As SNAP doesn't allow to select for D' use r2>0.2 then awk to select D'>0.8.

#perform SNAP on input bed files using SNAP BRoad web toll, e.g. using less stringent R2 >0.2 threshold, download file, save in current folder

#select SNPs in LD with Dprime >0.8 using awk
awk  -F"\t" '{if ($5>0.8)print $0}' SNAPResults_SNP.list_r2_0.2.txt > SNAPResults.dprime0.8

#cut second column with rsID
cut -f2 SNAPResults.dprime0.8 > SNAPResults.dprime0.8.cut

Create bed file for SNPs in LD D'>0.8 using awk and dbSNP file you downloaded.

#create bed using awk and dbSNP.147.hg19.bed
awk 'NR==FNR {h[$1] = $1; next} {if(h[$4]) print$0}' SNAPResults.dprime0.8.cut  dbSNP.147.hg19.bed > SNAPResults.dprime0.8.bed

Download a file I prepared using DNase-Seq open chromatin regions in 125 ENCODE cell lines by combining open chromatin regions form all cell lines.

#download combined Encode open chromatin dataset

Intersect ENCODE open chr with bedtools:

#bedtools intersect with combined Encode openchromatin dataset
bedtools intersect -a SNAPResults.dprime0.8.bed -b background.bed -wa > SNAPResults.dprime0.8.overlap.openchr

Bedtools closest to get closest feature to your original list of variants. Remove 0 or -1 that originate for SNPs in input list that overlap open chromatin.

#bedtools closest to grab variants of the original file that are closest to LD SNPs in open chromatin
bedtools closest -d -a SNP.list.bed -b SNAPResults.dprime0.8.overlap.openchr > bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr

#get distance column
cut -f13 bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr  > bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr.distance

#remove 0
sed -i '/^0$/d' bedtools.closest.SNP.list.vs.SNAPResults.r2.0.8.overlap.openchr.distance

#remove -1
sed -i '/^-1$/d' bedtools.closest.SNP.list.vs.SNAPResults.r2.0.8.overlap.openchr.distance

Remove long distances that originate from other SNPs in the original file that are on the same chromosome:

#remove long distances as they probably originate from another input locus on the same chromosome
awk  -F"\t" '{if ($1<100000)print $0}' bedtools.closest.SNP.list.vs.SNAPResults.r2.0.8.overlap.openchr.distance > bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr.distance.clear

Plot the number of original SNPs that are not in open chromatin whose LD SNPs can overlap a potentially open chromatin region.

#plot number of loci from input that have LD variant in combined open chromatin from all ENCODE tissues
wc -l bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr.distance.clear

Note these are obtained from a combined ENCODE dataset so this number refers to SNPs in LD that may be in open chromatin, whether or not they will be in the open chromatin it depends on the specific cell type of interest and specific conditions that will determine whether that region will be open or not. However all these regions have a potential to be open regions, meaning they are functional in a certain context.

No comments:

Post a Comment