Integrative functional genomics identifies regulatory mechanisms at coronary artery disease loci.
Clint L. Miller, Milos Pjanic, Ting Wang, Trieu Nguyen, Ariella Cohain, Jonathan D. Lee, Ljubica Perisic, Ulf Hedin, Ramendra K. Kundu, Deshna Majmudar, Juyong B. Kim, Oliver Wang, Christer Betsholtz, Arno Ruusalepp, Oscar Franzén, Themistocles L. Assimes, Stephen B. Montgomery, Eric E. Schadt, Johan L.M. Björkegren & Thomas Quertermous. Nature Communications. 2016 Jul 8;7:12092
we show the importance of functional genomics, i.e. the use of vast amount of big data from various genomics and transcriptiomics asseys, to prioritize variants (SNPs and indels) to decipher which are the causal GWAS variants associated to the disease (from the bulk of lead and LD SNPs and indels obtained in a GWAS experiment).
We prioritize coronary artery disease (CAD) GWAS variants based on overlaps with multiple genomics data, ATAC-Seq open chromatin, ChIP-Seq for H3K27ac, and ChIP-Seq for transcription factors, TCF21, that is itself a GWAS hit for CAD, and, JUN and FOS, that compose AP1 transcription factor. By defining a set of 64 such hypothetically causal variants from 5,240 lead plus LD variants for CAD, we tested 7 SNPs/indels and show evidence that supports their causality.
Interesting approach here is that we used transcription factor TCF21, that itself is a CAD GWAS causal gene, performed ChIP-Seq in disease relevant cell type, defined genome wide binding sites for TCF21, and used those to define causal variants for other CAD GWAS loci, an approach that could be valid for other GWAS phenotypes as well. We show evidence that TCF21 is enriched near CAD GWAS loci and therefore, for any type of GWAS/eQTL study that suffers from a lack of causal inference, by defining a master regulator transcription factor from GWAS/eQTL data itself, it could be possible to infer causal variants at other loci, by performing ChIP-Seq (or if the antibody is not available using position weight matrices, PWM).
Model for defining causal variants is circular, i.e. starting from GWAS to define TF master regulator to return to GWAS at the end:
GWAS >> TF master regulator >> ChIP-Seq (PWM) >> overlap with GWAS variants >> causal variants
This model presumes the existence of a master regulator, such as TCF21 for CAD, however it is equally possible that no such TF exists for a tested GWAS phenotype, in which case this approach wont show good results, or it is possible that multiple master regulator TFs exist in which case one would have to perform multiple ChIP-Seq experiments. Nevertheless, focusing on only one TF may still be enough to capture a proportion of the causal variants, as shown in our paper.
In the integrative functional genomics approach to define causality in GWAS/QTL data the main bottleneck would be a way to define whether a TF is a master regulator or not. Approach that we used is to combine a plethora of data from various experiment to define TCF21 as critical for CAD, but the main experiment was to define PWM sites for the TF in the vicinity of GWAS variants. A compilation of PWMs from the databases such as JASPAR would be a good choice. In case sites are enriched compared to other PWMs, after performing statistical tests e.g. Fisher exact test for genomic overlaps, one could infer if sites are indeed enriched near GWAS SNPs, and infer the importance of that particular TF for the GWAS phenotype, and proceed to prioritize variants in the next step.