Monday, December 22, 2014

Collapsing snp-pair association table in GWAS-ChIPSeq comparison

If you have a file with SNP pairs associations (e.g. SNPs from GWAS studies and SNPs found within chip-seq peaks) and you want to select those that meet the specified linkage disequilibrium threshold for example r2>0.7, you would need to filter the table which could be done with awk:

http://milospjanic.blogspot.com/2014/09/delete-all-rows-with-column-n-contains.html

However, even with this selection we are still overestimating the association between SNPs, since e.g. SNPs from GWAS may be in high LD with each other and each of these will associate with high LD to a single SNP from chip-seq peaks. Conversely, the SNPs from chip-seq peak/peaks may be in high LD with each other and correlate with high LD to a single GWAS SNP.

This may lead to overestimation of the association between GWAS phenotypes and transcription factor probed with chip-seq.

To collapse the table and keep only a single r2 value (e.g. maximum) for a single chip-seq SNP and a single GWAS SNP use the following code below.

In this example we will use the file r2_from_hapmap_1x56_peak_gwas_ld_li_eur that has SNP pairs from GWAS catalog and chip-seq experiment for a transcription factor.

Lets read the beginning of the file first.

head r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt
38356410        38461119        CEU     rs13132853      rs10004195      0.751   0.126   2.26    383     Helicobacter pylori serologic status
38461119        38621564        CEU     rs10004195      rs1060582       0.66    0.033   0.58    384     Helicobacter pylori serologic status
38461119        38621612        CEU     rs10004195      rs10011235      1.0     0.023   0.21    384     Helicobacter pylori serologic status
142225023       142416549       CEU     rs10007052      rs354834        0.082   0.0020  0.04    1422    Chronic obstructive pulmonary disease-related biomarkers
61257852        61411881        CEU     rs2013326       rs1000778       0.121   0.015   0.27    612     Sphingolipid levels
61411881        61446981        CEU     rs1000778       rs12420625      1.0     0.0040  0.14    614     Sphingolipid levels
61411881        61447283        CEU     rs1000778       rs741887        0.34    0.049   0.78    614     Sphingolipid levels
61411881        61447477        CEU     rs1000778       rs7927548       0.356   0.0070  0.15    614     Sphingolipid levels
61411881        61479221        CEU     rs1000778       rs1109748       0.162   0.0010  0.01    614     Sphingolipid levels
61411881        61479414        CEU     rs1000778       rs195165        0.139   0.0     0.01    614     Sphingolipid levels

The 4th column has the SNP id for GWAS SNPs and 5th column has SNP id for chip-seq SNPs. 6th column contains r2 values for each column i.e. each SNP pair.

The following R code will collapse and save the file with either collapsed one or the other snp column (test2, test3) or both (test4). It is necessary to transform 6th column to numeric with transform(test, V6 = as.numeric(as.character(V6))), then aggregate will do the collapsing, but it will keep only columns 4 and 6, merge will merge the aggregated data with the original data to obtain all the original columns. 

test <- read.delim("r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt", header=F)
test <- test[,1:10]
test <- transform(test, V6 = as.numeric(as.character(V6)))
test2<-merge(aggregate(V6~V4, data=test, max), test, all.x=T)
write.table(file="GWAS_collapse_2", test2, sep="\t", quote=F)

test <- read.delim("r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt", header=F)
test <- test[,1:10]
test <- transform(test, V6 = as.numeric(as.character(V6)))
test3<-merge(aggregate(V6~V5, data=test, max), test, all.x=T)
write.table(file="chip-seq_snps_collapse_2", test3, sep="\t", quote=F)

test4<-merge(aggregate(V6~V5, data=test2, max), test2, all.x=T)
write.table(file="ALL_collapse_2", test4, sep="\t", quote=F)

Lets save the beginning of the file into the test file as an example of culling the data.

head r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt > r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt_test

test <- read.delim("r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt_test", header=F)
test <- test[,1:10]
test
          V1        V2  V3         V4         V5    V6    V7   V8   V9
1   38356410  38461119 CEU rs13132853 rs10004195 0.751 0.126 2.26  383
2   38461119  38621564 CEU rs10004195  rs1060582 0.660 0.033 0.58  384
3   38461119  38621612 CEU rs10004195 rs10011235 1.000 0.023 0.21  384
4  142225023 142416549 CEU rs10007052   rs354834 0.082 0.002 0.04 1422
5   61257852  61411881 CEU  rs2013326  rs1000778 0.121 0.015 0.27  612
6   61411881  61446981 CEU  rs1000778 rs12420625 1.000 0.004 0.14  614
7   61411881  61447283 CEU  rs1000778   rs741887 0.340 0.049 0.78  614
8   61411881  61447477 CEU  rs1000778  rs7927548 0.356 0.007 0.15  614
9   61411881  61479221 CEU  rs1000778  rs1109748 0.162 0.001 0.01  614
10  61411881  61479414 CEU  rs1000778   rs195165 0.139 0.000 0.01  614
                                                        V10
1                      Helicobacter pylori serologic status
2                      Helicobacter pylori serologic status
3                      Helicobacter pylori serologic status
4  Chronic obstructive pulmonary disease-related biomarkers
5                                       Sphingolipid levels
6                                       Sphingolipid levels
7                                       Sphingolipid levels
8                                       Sphingolipid levels
9                                       Sphingolipid levels
10                                      Sphingolipid levels

test <- transform(test, V6 = as.numeric(as.character(V6)))
test2<-merge(aggregate(V6~V4, data=test, max), test, all.x=T)
test2
          V4    V6        V1        V2  V3         V5    V7   V8   V9
1 rs10004195 1.000  38461119  38621612 CEU rs10011235 0.023 0.21  384
2 rs10007052 0.082 142225023 142416549 CEU   rs354834 0.002 0.04 1422
3  rs1000778 1.000  61411881  61446981 CEU rs12420625 0.004 0.14  614
4 rs13132853 0.751  38356410  38461119 CEU rs10004195 0.126 2.26  383
5  rs2013326 0.121  61257852  61411881 CEU  rs1000778 0.015 0.27  612
                                                       V10
1                     Helicobacter pylori serologic status
2 Chronic obstructive pulmonary disease-related biomarkers
3                                      Sphingolipid levels
4                     Helicobacter pylori serologic status
5                                      Sphingolipid levels

What we see here is that we collapsed the data and kept only unique SNP identifiers in 4th column. SNP rs1000778 was culled from 5 to one entry and snp rs1000778 was culled from 2 to one entry and only pairs with maximum r2 values were kept.

Now in this code we have overlooked one important thing that will give rise to multiple pairs with same SNP id in the output. What is happening if there are multiple pairs from one GWAS SNP (or one chip-seq SNP) that have the same r2 value? They will all be kept even though aggregate function will keep only single V4-V6 pair, but after performing merge with the original data, each V4-V6 pair will find multiple corresponding rows if r2 value is the same.

For example, lets grep one snp and save output to a new file and repeat the steps from above:

grep "rs12418204" r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt
71927771        72074657        CEU     rs12418204      rs11824205      1.0     0.0060  0.23    719     Optic disc size (cup)
71927771        72074848        CEU     rs12418204      rs2291288       1.0     0.0     0.02    719     Optic disc size (cup)
71927771        72091837        CEU     rs12418204      rs3765105       0.461   0.01    0.18    719     Optic disc size (cup)
71927771        72092977        CEU     rs12418204      rs2306615       1.0     0.0080  0.31    719     Optic disc size (cup)
71927771        72124606        CEU     rs12418204      rs12808507      0.171   0.0020  0.04    719     Optic disc size (cup)

grep "rs12418204" r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt > r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt_test2

test <- read.delim("r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt_test2", header=F)
test <- test[,1:10]
test
        V1       V2  V3         V4         V5    V6    V7   V8  V9
1 71927771 72074657 CEU rs12418204 rs11824205 1.000 0.006 0.23 719
2 71927771 72074848 CEU rs12418204  rs2291288 1.000 0.000 0.02 719
3 71927771 72091837 CEU rs12418204  rs3765105 0.461 0.010 0.18 719
4 71927771 72092977 CEU rs12418204  rs2306615 1.000 0.008 0.31 719
5 71927771 72124606 CEU rs12418204 rs12808507 0.171 0.002 0.04 719
                    V10
1 Optic disc size (cup)
2 Optic disc size (cup)
3 Optic disc size (cup)
4 Optic disc size (cup)
5 Optic disc size (cup)

test <- transform(test, V6 = as.numeric(as.character(V6)))
test2<-merge(aggregate(V6~V4, data=test, max), test, all.x=T)
test2
          V4 V6       V1       V2  V3         V5    V7   V8  V9
1 rs12418204  1 71927771 72074657 CEU rs11824205 0.006 0.23 719
2 rs12418204  1 71927771 72074848 CEU  rs2291288 0.000 0.02 719
3 rs12418204  1 71927771 72092977 CEU  rs2306615 0.008 0.31 719
                    V10
1 Optic disc size (cup)
2 Optic disc size (cup)

3 Optic disc size (cup)

What we can see is that all of pairs with maximum r2 value of 1 were kept (3 rows).

To keep only one of them (i.e. the first entry) we could do quick collapse of the output files in unix with sort -u 
Write output as a file in R:

write.table(file="test_file", test2, sep="\t", quote=F)

In bash write:

grep -v "V1" test_file > test_file_removeheader
sort -u -k2,2 test_file_removeheader > test_file_sortuniq
cat test_file_sortuniq

1       rs12418204      1       71927771        72074657        CEU     rs11824205      0.006   0.23    719     Optic disc size (cup)

So to finish the original data set do the same thing with ALL_collapse_2 file:

grep -v "V1" ALL_collapse_2 > ALL_collapse_2_removeheader
sort -u -k2,2 ALL_collapse_2_removeheader > ALL_collapse_2_sortuniq

No comments:

Post a Comment