Monday, April 4, 2016

Bash/R combined script for Fisher test of genomic overlaps in a given background

I made a bash script that will invoke R through RScript and will generate a Fisher test p-value that shows significance of the overlap of two sets of genomic regions (for example from ChIP-Seq experiments).

To calculate Fisher p-value you would need three BED files: 1. First bed file to overlap, i.e. file1.bed 2. Second bed file for overlap, i.e. file2.bed 3. Bed file that will serve as genomic background for overlaps, i.e. background.bed

Note that file names do not have to be file1.bed, file2.bed, background.bed, this is just an example.

As genomic background you can use e.g. combined ENCODE set of open chromatin regions or a similar data set. Genomic background is necessary to calculate the constituents of the Fisher test contingency matrix: overlap of A and B in the background BG set, A but not B in BG, B but not A in BG, BG without A and B.

On my github page I provided a combined ENCODE DHS data set in a bed file which you can use as a background set, i.e. background.bed.


#intersecting with bedtools
bedtools intersect -a $1 -b $3 > A_BG
bedtools intersect -a $2 -b $3 > B_BG

bedtools intersect -wa -a A_BG -b B_BG | uniq > A_B_BG
bedtools intersect -wa -a B_BG -b A_BG | uniq > B_A_BG
bedtools intersect -v -a A_BG -b A_B_BG | uniq > A_BG_noB
bedtools intersect -v -a B_BG -b B_A_BG | uniq > B_BG_noA
bedtools intersect -v -a $3 -b A_B_BG | uniq > BG_noA_noB

echo "Number of uniq A overlapping B in genomic background"
wc -l A_B_BG | cut -f1 -d ' '
echo "Number of uniq B overlapping A in genomic background"
wc -l B_A_BG | cut -f1 -d ' '
echo "Number of uniq A not overlapping B in genomic background"
wc -l A_BG_noB | cut -f1 -d ' '
echo "Number of uniq B not overlapping A in genomic background"
wc -l B_BG_noA | cut -f1 -d ' '
echo "Number of genomic background not overlapping A or B"
wc -l BG_noA_noB | cut -f1 -d ' '

#writing and executing R script
echo "#!/usr/bin/Rscript
A_B  <-
matrix(c($(wc -l A_B_BG | cut -f1 -d ' '), $(wc -l A_BG_noB | cut -f1 -d ' '), $(wc -l B_BG_noA | cut -f1 -d ' '), $(wc -l BG_noA_noB | cut -f1 -d ' ')),
       nrow = 2,
       dimnames = list(Guess = c(\"b+\", \"b-\"),
                       Truth = c(\"a+\", \"a-\")))
A_B.fisher <- fisher.test(A_B)
str(A_B.fisher)" > script.r
chmod 775 script.r

#removing intermediary files
rm A_BG
rm B_BG
rm A_B_BG
rm B_A_BG
rm A_BG_noB
rm B_BG_noA
rm BG_noA_noB
rm script.r

No comments:

Post a Comment