Sunday, October 30, 2016

Hierarchical clustering of correlation matrices to find co-regulated genes

If you obtained a correlation matrix for a set of genes and would like to cluster it, use the following R code.

In this example, dissimilarity is calculated using 1 - absolute value of the correlation formula that gives the largest discrimination of the correlated pairs compared to other methods, such as, 1-abs(cor^2), 1-cor, or (1-cor)/2. Ref. http://research.stowers-institute.org/mcm/efg/R/Visualization/cor-cluster/index.htm

library("spatstat")
library(gplots)

#load data matrix
data <-read.delim("data.txt", header=T,row.names=1)
data[is.na(data)] <- 1

#list matrix
data

#use cor to make a correlation matrix
correlation<-cor(data)

#make disimilarity matrix as 1 - absolute value of the correlation
dissimilarity <- 1 - abs(correlation)

#calculate distance
distance <- as.dist(dissimilarity)

#create pdf
pdf("cor.matrix.clustered.pdf")

#plot matrix using heatmap.2 from gplots
heatmap.2(as.matrix(distance))

dev.off()
After plotting the resulting graph will show clusters of correlated genes using hierarchical clustering of heatmap.2, that itself uses hclust R function for clustering:




Sunday, October 9, 2016

Step by step code to extract SNPs in LD that reside in open chromatin regions as defined from combined ENCODE set

This is a step by step code to extract SNPs in LD that reside in open chromatin regions as defined using a combined ENCODE data set.

To create a bed file using the list of input human SNPs you can download latest dbSNP, our link is for version 147. Follow these steps and using awk create a bed file for your input SNP list.

#download dbSNP.147.hg19.bed
wget https://stanfordmedicine.box.com/s/o85v60uzj6tnl0cbrk6lnmkmm6k1csnk

#unzip gz
gunzip dbSNP.147.hg19.bed.gz

#create bed file for input SNPs, using awk and dbSNP.147.hg19.bed
awk 'NR==FNR {h[$1] = $1; next} {if(h[$4]) print$0}' SNP.list dbSNP.147.hg19.bed > SNP.list.bed


Follow these steps to perform LD analysis using Broad SNAP tool, and to select SNPs with D'>8. As SNAP doesn't allow to select for D' use r2>0.2 then awk to select D'>0.8.

#perform SNAP on input bed files using SNAP BRoad web toll, e.g. using less stringent R2 >0.2 threshold, download file, save in current folder

#select SNPs in LD with Dprime >0.8 using awk
awk  -F"\t" '{if ($5>0.8)print $0}' SNAPResults_SNP.list_r2_0.2.txt > SNAPResults.dprime0.8

#cut second column with rsID
cut -f2 SNAPResults.dprime0.8 > SNAPResults.dprime0.8.cut



Create bed file for SNPs in LD D'>0.8 using awk and dbSNP file you downloaded.

#create bed using awk and dbSNP.147.hg19.bed
awk 'NR==FNR {h[$1] = $1; next} {if(h[$4]) print$0}' SNAPResults.dprime0.8.cut  dbSNP.147.hg19.bed > SNAPResults.dprime0.8.bed



Download a file I prepared using DNase-Seq open chromatin regions in 125 ENCODE cell lines by combining open chromatin regions form all cell lines.

#download combined Encode open chromatin dataset
wget https://stanfordmedicine.box.com/s/cgke0v3cuskt8rtklh8jkw5w6fgq1qhh



Intersect ENCODE open chr with bedtools:

#bedtools intersect with combined Encode openchromatin dataset
bedtools intersect -a SNAPResults.dprime0.8.bed -b background.bed -wa > SNAPResults.dprime0.8.overlap.openchr



Bedtools closest to get closest feature to your original list of variants. Remove 0 or -1 that originate for SNPs in input list that overlap open chromatin.

#bedtools closest to grab variants of the original file that are closest to LD SNPs in open chromatin
bedtools closest -d -a SNP.list.bed -b SNAPResults.dprime0.8.overlap.openchr > bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr

#get distance column
cut -f13 bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr  > bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr.distance

#remove 0
sed -i '/^0$/d' bedtools.closest.SNP.list.vs.SNAPResults.r2.0.8.overlap.openchr.distance

#remove -1
sed -i '/^-1$/d' bedtools.closest.SNP.list.vs.SNAPResults.r2.0.8.overlap.openchr.distance


Remove long distances that originate from other SNPs in the original file that are on the same chromosome:

#remove long distances as they probably originate from another input locus on the same chromosome
awk  -F"\t" '{if ($1<100000)print $0}' bedtools.closest.SNP.list.vs.SNAPResults.r2.0.8.overlap.openchr.distance > bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr.distance.clear



Plot the number of original SNPs that are not in open chromatin whose LD SNPs can overlap a potentially open chromatin region.

#plot number of loci from input that have LD variant in combined open chromatin from all ENCODE tissues
wc -l bedtools.closest.SNP.list.vs.SNAPResults.dprime0.8.overlap.openchr.distance.clear



Note these are obtained from a combined ENCODE dataset so this number refers to SNPs in LD that may be in open chromatin, whether or not they will be in the open chromatin it depends on the specific cell type of interest and specific conditions that will determine whether that region will be open or not. However all these regions have a potential to be open regions, meaning they are functional in a certain context.



Friday, October 7, 2016

In R how to correctly read column names that have spaces in their name

In R if you try to correctly read tab separated table in R that has header names written with spaces you will end up with dots put instead of spaces.




> data <-read.delim("AP1_CADSNPs_repeatmasker", header=T)

> data
       X Non.masked Repeat.Filter.masked 
 
 
To correctly read spaces use check.names=FALSE option with read.delim


data <-read.delim("AP1_CADSNPs_repeatmasker", header=T, check.names=FALSE)
 
data$`Repeat Filter masked`
  [1] 1.077 1.113 1.095 1.099 1.083 1.104 1.082 1.103 1.114 1.114 1.113 1.107
 [13] 1.104 1.120 1.111 1.104 1.113 1.112 1.119 1.109 1.098 1.100 1.116 1.118
 [25] 1.080 1.097 1.104 1.100 1.094 1.102 1.111 1.096 1.085 1.106 1.153 1.090
 [37] 1.110 1.098 1.123 1.084 1.110 1.101 1.098 1.112 1.062 1.092 1.112 1.096
 [49] 1.119 1.106 1.095 1.116 1.130 1.094 1.123 1.105 1.085 1.102 1.098 1.106
 [61] 1.127 1.119 1.106 1.116 1.109 1.107 1.126 1.085 1.097 1.111 1.096 1.103
 [73] 1.101 1.107 1.100 1.142 1.129 1.104 1.119 1.139 1.126 1.116 1.131 1.120
 [85] 1.157 1.126 1.151 1.152 1.139 1.135 1.164 1.167 1.177 1.159 1.167 1.159
 [97] 1.171 1.184 1.185 1.055 1.191 1.172 1.159 1.174 1.154 1.176 1.158 1.146
[109] 1.174 1.168 1.150 1.151 1.141 1.126 1.171 1.118 1.111 1.131 1.115 1.113
[121] 1.121 1.133 1.107 1.148 1.087 1.119 1.098 1.099 1.120 1.110 1.106 1.128
[133] 1.119 1.132 1.120 1.112 1.114 1.125 1.104 1.115 1.102 1.132 1.099 1.121
[145] 1.105 1.095 1.120 1.107 1.110 1.095 1.114 1.105 1.102 1.113 1.103 1.097
[157] 1.095 1.101 1.105 1.100 1.099 1.093 1.091 1.107 1.077 1.095 1.084 1.097
[169] 1.113 1.084 1.093 1.087 1.096 1.080 1.093 1.074 1.109 1.105 1.087 1.104
[181] 1.128 1.107 1.102 1.103 1.108 1.099 1.104 1.111 1.105 1.116 1.085 1.093
[193] 1.095 1.082 1.103 1.082 1.083 1.090 1.098


Sunday, October 2, 2016

Simple solution to select Position Weight Matrix or ChIP-Seq sites in phase and out of phase

If you have two lists of position weight matrix (PWM) or ChiP-Seq protein binding sites and their distances obtained with bedtools closest and you used options -d or -D to output the distance between the two features:


chr1    13747   13753   TGCGTG  1069    +       chr1    14228   14239   AACAGCTGCCC     1440    +       476
chr1    17712   17718   TGCGTG  1069    -       chr1    14228   14239   AACAGCTGCCC     1440    +       3474
chr1    19157   19163   TGCGTG  1069    +       chr1    14228   14239   AACAGCTGCCC     1440    +       -4919
chr1    22861   22867   TGCGTG  1069    -       chr1    28916   28927   AGCAGCTGCGG     1421    +       -6050
chr1    26478   26484   TGCGTG  1069    +       chr1    28916   28927   AGCAGCTGCGG     1421    +       2433
chr1    32585   32591   TGCGTG  1069    -       chr1    32034   32045   AACAGCTGCAG     1579    +       541
chr1    33596   33602   TGCGTG  1069    +       chr1    32034   32045   AACAGCTGCAG     1579    +       -1552
chr1    36978   36984   TGCGTG  1069    +       chr1    32034   32045   AACAGCTGCAG     1579    +       -4934
chr1    41096   41102   TGCGTG  1069    +       chr1    32034   32045   AACAGCTGCAG     1579    +       -9052
chr1    80948   80954   TGCGTG  1069    +       chr1    32034   32045   AACAGCTGCAG     1579    +       -48904
chr1    87253   87259   TGCGTG  1069    +       chr1    131307  131318  AACAGCTGCCA     1433    -       44049
chr1    89356   89362   TGCGTG  1069    -       chr1    131307  131318  AACAGCTGCCA     1433    -       -41946
chr1    92760   92766   TGCGTG  1069    -       chr1    131307  131318  AACAGCTGCCA     1433    -       -38542
chr1    97039   97045   TGCGTG  1069    +       chr1    131307  131318  AACAGCTGCCA     1433    -       34263
chr1    104193  104199  TGCGTG  1069    -       chr1    131307  131318  AACAGCTGCCA     1433    -       -27109
chr1    106687  106693  TGCGTG  1069    -       chr1    131307  131318  AACAGCTGCCA     1433    -       -24615
chr1    109572  109578  TGCGTG  1069    +       chr1    131307  131318  AACAGCTGCCA     1433    -       21730
chr1    140637  140643  TGCGTG  1069    +       chr1    135896  135907  AACAGCTGGGC     1414    -       -4731
chr1    158167  158173  TGCGTG  1069    -       chr1    149357  149368  AACAGCTGCTA     1493    -       8800
chr1    162890  162896  TGCGTG  1069    -       chr1    172532  172543  AGCAGCTGCTG     1453    -       -9637
chr1    165259  165265  TGCGTG  1069    +       chr1    172532  172543  AGCAGCTGCTG     1453    -       7268
chr1    228669  228675  TGCGTG  1069    +       chr1    172533  172544  AGCAGCTGCTG     1453    +       -56126
chr1    231500  231506  TGCGTG  1069    -       chr1    172533  172544  AGCAGCTGCTG     1453    +       58957
chr1    239088  239094  TGCGTG  1069    -       chr1    172533  172544  AGCAGCTGCTG     1453    +       66545
chr1    243374  243380  TGCGTG  1069    +       chr1    172533  172544  AGCAGCTGCTG     1453    +       -70831
chr1    250508  250514  TGCGTG  1069    -       chr1    327446  327457  AACAGCTGGGC     1414    +       -76933
chr1    253002  253008  TGCGTG  1069    -       chr1    327446  327457  AACAGCTGGGC     1414    +       -74439
chr1    255893  255899  TGCGTG  1069    +       chr1    327446  327457  AACAGCTGGGC     1414    +       71548
chr1    395412  395418  TGCGTG  1069    -       chr1    388540  388551  AACAGCTGCAG     1579    -       6862
chr1    404658  404664  TGCGTG  1069    +       chr1    388540  388551  AACAGCTGCAG     1579    -       -16108
chr1    411852  411858  TGCGTG  1069    -       chr1    388540  388551  AACAGCTGCAG     1579    -       23302
chr1    436821  436827  TGCGTG  1069    -       chr1    441048  441059  GACAGCTGCTG     1542    +       -4222
chr1    436923  436929  TGCGTG  1069    -       chr1    441048  441059  GACAGCTGCTG     1542    +       -4120
chr1    437027  437033  TGCGTG  1069    -       chr1    441048  441059  GACAGCTGCTG     1542    +       -4016
chr1    437127  437133  TGCGTG  1069    +       chr1    441048  441059  GACAGCTGCTG     1542    +       3916
You can use this to select sites which are in phase with respect to the DNA rotational pitch and those that are not in phase, i.e. that are located on the opposite sides of the DNA. In these examples, phased sites are localized no more than 50 bp away. For bedtools closest option -d:

for j in {1..52..10}
do
echo $j
awk -F '\t' '{ if ($13 == '$j') print $0 }' PWM1_PWM2_bedtools_closest > "$j"_distance_even
cut -f 1-3 "$j"_distance_even > "$j"_distance_cut_even
done

cat *_cut_even > even_merge

for j in {5..56..10}
do
echo $j
awk -F '\t' '{ if ($13 == '$j') print $0 }' PWM1_PWM2_bedtools_closest > "$j"_distance_odd
cut -f 1-3 "$j"_distance_odd > "$j"overlapping_stringent_cut_odd
done

cat *_cut_odd > odd_merge
For bedtools closest option -D, that reports upstream and downstream closest features:

for j in {-49..52..10}
do
echo $j
awk -F '\t' '{ if ($13 == '$j') print $0 }' PWM1_PWM2_bedtools_closest > "$j"_distance_even
cut -f 1-3 "$j"_distance_even > "$j"_distance_cut_even
done

cat *_cut_even > even_merge

for j in {-45..56..10}
do
echo $j
awk -F '\t' '{ if ($13 == '$j') print $0 }' PWM1_PWM2_bedtools_closest > "$j"_distance_odd
cut -f 1-3 "$j"_distance_odd > "$j"overlapping_stringent_cut_odd
done

cat *_cut_odd > odd_merge