Friday, September 30, 2016

Solution to messy gene tables - compare two file contents using grep

If you have a gene list and want to compare it to another gene list you can use awk and create a hash table reading first file that you will use while reading second file for comparison with a certain column of the second file, as discussed previously:

http://www.genomicscode.org/2015/07/finding-differences-in-two-files-on.html

Code:

awk 'NR==FNR {h[$1] = $1; next} {if(!h[$4]) print$0}' file1 file2
However, if you have a second file with a messy structure you would want to scan a complete file without focussing on a individual column of the second file. For, example here, I have a gene list that I want to compare with the list I created from BioGRID, that contains gene names, alternative gene names and other info. Clearly the first awk code would not work in this case:

Miloss-MacBook-Air:test milospjanic$ cat test
SUMO
TANGO
AHR
ARNT
TEST
Miloss-MacBook-Air:test milospjanic$ head -n 20 ahr.biogrid
Displaying 41 total unique interactors
Sort By: [Evidence] [Alphabetical]
10
[details]
AIP
| ARA9, FKBP16, FKBP37, SMTPHN, XAP-2, XAP2
aryl hydrocarbon receptor interacting protein
UBI
9
[details]
ARNT
| HIF-1-beta, HIF-1beta, HIF1-beta, HIF1B, HIF1BETA, TANGO, bHLHe2
aryl hydrocarbon receptor nuclear translocator
UBI
8
[details]
RB1
| RP11-174I10.1, OSRC, PPP1R130, RB, p105-Rb, pRb, pp110
retinoblastoma 1
UBISUMO
Instead, easy solution is to use grep with -Fwf to comprehensively search the second file and find lines that contain gene name from the first file. In this case all lines from file 2 that contain gene name from file 1 will be grepped.

Miloss-MacBook-Air:test milospjanic$ grep -Fwf test ahr.biogrid 
ARNT
| HIF-1-beta, HIF-1beta, HIF1-beta, HIF1B, HIF1BETA, TANGO, bHLHe2
SUMO
AHR
SUMO

Sunday, September 18, 2016

Simple solutions of filtering by sum and repeats with awk scripting

In a file lets says you need to eliminate rows that have a certain sum or that have certain number repeated N times, use awk. 

Example content of file.name:
cat file.name
0 0 0 0
0 1 1 1
0 3 4 9
5 5 5 5
5 5 5 5
0 0 0 0
1 1 3 4


1. To filter out lines that have all 0s.

Create variable sum, let i go from 1 to NF, and add 0 or 1 to sum depending on the awk ternary operator (?:).  With $i to mark field content, add to sum 0 or 1, using sum += $i ? 1 : 0. In brief, if $i (i.e if $i!=0) use 1 for sum+=, if false (i.e. if $i=0) use 0 for sum+=.

In case sum=0, the only time this would happen is if every field is 0, thus remove those fields with if (sum!=0) print.


awk '{sum=0; for (i=1; i<=NF; i++){sum += $i ? 1 : 0} if (sum!=0) print}' file.name
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += $i ? 1 : 0} if (sum!=0) print}' file.name
0 1 1 1
0 3 4 9
5 5 5 5
5 5 5 5
1 1 3 4


2. To filter out lines that do not contain 0

To filter out only those lines that have non 0 numbers repeated N times (for example 4) substitute 0 with 4 in if (sum!=0) print. This code will basically keep only those lines that contain at least one 0. 

awk '{sum=0; for (i=1; i<=NF; i++){sum += $i ? 1 : 0} if (sum!=4) print}' file.name
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += $i ? 1 : 0} if (sum!=4) print}' file.name
0 0 0 0
0 1 1 1
0 3 4 9
0 0 0 0

To filter out those lines that have non-zero number repeated 3 times use:


awk '{sum=0; for (i=1; i<=NF; i++){sum += $i ? 1 : 0} if (sum!=3) print}' file.name
0 0 0 0
5 5 5 5
5 5 5 5
0 0 0 0
1 1 3 4


To filter out lines that have non-zero number repeated 3 or 4 times use:

awk '{sum=0; for (i=1; i<=NF; i++){sum += $i ? 1 : 0} if (sum<3) print}' file.name
0 0 0 0
0 0 0 0


3. To filter out lines that do not contain certain number

Modification of this code is easy, i.e. to keep lines that contain at least one number 3, change the condition in ternary operator (?:) to $i==3:


awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==3 ? 1 : 0} if (sum!=0) print}' file.name
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==3 ? 1 : 0} if (sum!=0) print}' file.name
0 3 4 9
1 1 3 4


4. To filter out lines that contain certain number


awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==3 ? 1 : 0} if (sum==0) print}' file.name
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==3 ? 1 : 0} if (sum==0) print}' file.name
0 0 0 0
0 1 1 1
5 5 5 5
5 5 5 5
0 0 0 0


5. To filter out lines that contain ceratin number repeated N times

For example to keep only lines that contain 1 repeated 3 times:

awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==1 ? 1 : 0} if (sum==3) print}' file.name
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==1 ? 1 : 0} if (sum==3) print}' file.name
0 1 1 1


6. To filter out lines that do not contain number repeated N times

For example to remove only lines that contain 1 repeated 3 times:

awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==1 ? 1 : 0} if (sum!=3) print}' file.name
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==1 ? 1 : 0} if (sum!=3) print}' file.name
0 0 0 0
0 3 4 9
5 5 5 5
5 5 5 5
0 0 0 0
1 1 3 4


7. To count how many times certain number is repeated in each row

You can modify this code to count how many times in a row you have repeated 0, or any other number. To find out how many times 0 is in fields of each row:


awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==0 ? 1 : 0} print sum}' file.name
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==0 ? 1 : 0} print sum}' file.name 
4
1
1
0
0
4
0

To find out how many times 5 is repeated in each row:

awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==5 ? 1 : 0} print sum}' file.name
0
0
0
4
4
0
0


8. To count how many times certain combination of numbers is repeated in each row

You can modify this code to count how many times in a row you have repeated two or more numbers, for example 1 or 5. Change the code so that if $i is either 1 or 5 sum will be increased by 1.





awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==1||$i==5 ? 1 : 0} print sum}' file.name
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==1||$i==5 ? 1 : 0} print sum}' file.name
0
3
0
4
4
0
2

How to remove all files except one in unix using find

In some case you need to delete all files in current folder except for a file of interest. You can do this with find command by adding ! to specify that you don't want a match with a file of that particular name.


find . ! -name 'file.name' -type f -exec rm -f {} +

Check the contents of the folder:

DN52eik9:testttt milospjanic$ ls -ltrh
total 0
-rw-r--r--  1 milospjanic  staff     0B Sep 18 11:54 file.name
-rw-r--r--  1 milospjanic  staff     0B Sep 18 11:54 file2.name
-rw-r--r--  1 milospjanic  staff     0B Sep 18 11:54 file3.name
DN52eik9:testttt milospjanic$ find . ! -name 'file.name' -type f -exec rm -f {} +
DN52eik9:testttt milospjanic$ ls -ltrh
total 0
-rw-r--r--  1 milospjanic  staff     0B Sep 18 11:54 file.name

Friday, September 2, 2016

Functional genomics as integrative approach applied in GWAS to define causal variants

In our recent paper,

Integrative functional genomics identifies regulatory mechanisms at coronary artery disease loci. 
Clint L. Miller, Milos Pjanic, Ting Wang, Trieu Nguyen, Ariella Cohain, Jonathan D. Lee, Ljubica Perisic, Ulf Hedin, Ramendra K. Kundu, Deshna Majmudar, Juyong B. Kim, Oliver Wang, Christer Betsholtz, Arno Ruusalepp, Oscar Franzén, Themistocles L. Assimes, Stephen B. Montgomery, Eric E. Schadt, Johan L.M. Björkegren & Thomas Quertermous. Nature Communications. 2016 Jul 8;7:12092

we show the importance of functional genomics, i.e. the use of vast amount of big data from various genomics and transcriptiomics asseys, to prioritize variants (SNPs and indels) to decipher which are the causal GWAS variants associated to the disease (from the bulk of lead and LD SNPs and indels obtained in a GWAS experiment).

http://www.nature.com/articles/ncomms12092

We prioritize coronary artery disease (CAD) GWAS variants based on overlaps with multiple genomics data, ATAC-Seq open chromatin, ChIP-Seq for H3K27ac, and ChIP-Seq for transcription factors, TCF21, that is itself a GWAS hit for CAD, and, JUN and FOS, that compose AP1 transcription factor. By defining a set of 64 such hypothetically causal variants from 5,240 lead plus LD variants for CAD, we tested 7 SNPs/indels and show evidence that supports their causality.

Interesting approach here is that we used transcription factor TCF21, that itself is a CAD GWAS causal gene, performed ChIP-Seq in disease relevant cell type, defined genome wide binding sites for TCF21, and used those to define causal variants for other CAD GWAS loci, an approach that could be valid for other GWAS phenotypes as well. We show evidence that TCF21 is enriched near CAD GWAS loci and therefore, for any type of GWAS/eQTL study that suffers from a lack of causal inference, by defining a master regulator transcription factor from GWAS/eQTL data itself, it could be possible to infer causal variants at other loci, by performing ChIP-Seq (or if the antibody is not available using position weight matrices, PWM).

Model for defining causal variants is circular, i.e. starting from GWAS to define TF master regulator to return to GWAS at the end:

GWAS >> TF master regulator >>  ChIP-Seq (PWM) >> overlap with GWAS variants >> causal variants

This model presumes the existence of a master regulator, such as TCF21 for CAD, however it is equally possible that no such TF exists for a tested GWAS phenotype, in which case this approach wont show good results, or it is possible that multiple master regulator TFs exist in which case one would have to perform multiple ChIP-Seq experiments. Nevertheless, focusing on only one TF may still be enough to capture a proportion of the causal variants, as shown in our paper.

In the integrative functional genomics approach to define causality in GWAS/QTL data the main bottleneck would be a way to define whether a TF is a master regulator or not. Approach that we used is to combine a plethora of data from various experiment to define TCF21 as critical for CAD, but the main experiment was to define PWM sites for the TF in the vicinity of GWAS variants. A compilation of PWMs from the databases such as JASPAR would be a good choice. In case sites are enriched compared to other PWMs, after performing statistical tests e.g. Fisher exact test for genomic overlaps, one could infer if sites are indeed enriched near GWAS SNPs, and infer the importance of that particular TF for the GWAS phenotype, and proceed to prioritize variants in the next step.