Monday, April 11, 2016

Binomial test for genomic overlaps of two bed files

Final script I wrote in a series - genomic overlaps - performs binomial statistical test using two bed files with genomic coordinates. This is a combined bash/R script that will generate a binomial p-value that shows significance for the overlap of two sets of genomic regions in hg19 human genome (for example from ChIP-Seq experiments).

To calculate binomial p-value you would need two bed files as inputs. Note that binomial test does not require background set as do Fisher and hypergeometric tests.

Note that file names are optional.

Example output:

mpjanic@zoran:~/AHR-TCF21$ ./binomial.sh ARNT.chipseq.cut TCF21_sigma_not_normalized_peaks.bed background.bed
Number of unique overlaps of A and B
143
Calculating coverage of ./TCF21_sigma_not_normalized_peaks.bed in human genome hg19
Base pair coverage of B in hg19 is: 0.00566281
Calculating binomial p-value with pbinom(143, 42920, 0.00566281)
Binomial p-value:
[1] 2.196093e-12
 num 2.2e-12
Code:

#!/bin/bash

# binomial-test-for-genomic-overlaps1

#intersecting with bedtools
bedtools intersect -a $1 -b $2 | uniq > A_B_uniq.bed

echo "Number of unique overlaps of A and B"
wc -l A_B_uniq.bed | cut -f1 -d ' '

mkdir TEMP

for f in $(find . -maxdepth 1 -name $2)
do
echo "Calculating coverage of $f in human genome hg19"
cat "$f" | awk '
{
    FS="\t"
    {
        print $1"\t"$2"\t"$3"\t"($3-$2)
}
}' > TEMP/"$f".temp
done

for f in $(find TEMP/ -maxdepth 1 -name \*temp\*);
do
echo "$f"
cat "$f" | awk '
{s+=$4}END{print s}';
done > out.dat

awk '{if (count++%2!=0) print ($0/2630301437);print $0}' out.dat  > out.dat2

declare bpcoverage=$(awk '{if (count++%2!=0) print ($0/2630301437)}' out.dat)

echo "Base pair coverage of B in hg19 is: $bpcoverage"
sed -i 's/TEMP\//print\(\"/' out.dat2
sed -i 's/\.temp/\"\)/' out.dat2

awk '{if (count++%3!=2) print$0}' out.dat2 > out.dat3

awk '{if (count++%2!=0) print ("dbinom("$0")");print $0}' out.dat3  > out.dat4

#writing and executing R script
echo "#!/usr/bin/Rscript
A_B.binomial  <-
pbinom($(wc -l A_B_uniq.bed | cut -f1 -d ' '), $(wc -l $2 | cut -f1 -d ' '), $bpcoverage)
A_B.binomial
str(A_B.binomial)" > script.r

chmod 775 script.r
echo "Calculating binomial p-value with pbinom($(wc -l A_B_uniq.bed | cut -f1 -d ' '), $(wc -l $2 | cut -f1 -d ' '), $bpcoverage)"
echo "Binomial p-value:"
./script.r

#removing intermediary files

rm A_B_uniq.bed
rm script.r
rm -r TEMP

No comments:

Post a Comment