Thursday, April 28, 2016

How to search barcoded files using first two columns in the barcode list and modify their names with third column

Continuing on the previous post.
Lets say you have barcodes from a sequencing experiment in a file and you want to find files that correspond to those barcodes and modify their names. You will use awk to print columns with barcodes separated with a symbol in between that is present in the actual file names (in this case -) and pipe it to xargs and find to list files.

mpjanic@zoran:~$ cat tmp.tmp
ATCG GCTA first
GGGG CCCC second
CCCC TTTT third
ATAT CGCG fourth
mpjanic@zoran:~$ ls -ltrh | tail -n4
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 16:57 tmpATCG-GCTAtmp
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 16:57 tmpGGGG-CCCCtmp
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 16:57 tmpCCCC-TTTTtmp
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 16:57 tmpATAT-CGCGtmp
Write a short script where you will grab the output of the awk/xargs/find command for each line in your input file. Use another awk to grab 3rd column from the input to specify new filename:

#!/bin/bash
while read -r line; do
cp $(echo $(awk '{print $1"-"$2}' <<<$line | xargs -I % sh -c 'find . -name "*"%"*" -exec ls -1 {} \;;') $(awk '{print $3}' <<<$line).filename)

done < tmp.tmp
Check if the output files are present:

ls -ltrh | tail -n4
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 18:06 first.filename
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 18:06 second.filename
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 18:06 third.filename
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 18:06 fourth.filename


How to search barcoded files using a barcode list

Lets say you have barcodes from a sequencing experiment in a file and you want to find files that correspond to those barcodes. You will use awk to print columns with barcodes separated with a symbol in between them that is present in the actual file names (in this case -) and pipe it to xargs and find to list files:


mpjanic@zoran:~$ cat tmp.tmp 
ATCG GCTA
GGGG CCCC
CCCC TTTT
ATAT CGCG
mpjanic@zoran:~$ ls -ltrh | tail -n4
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 16:57 tmpATCG-GCTAtmp
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 16:57 tmpGGGG-CCCCtmp
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 16:57 tmpCCCC-TTTTtmp
-rw-rw-r--  1 mpjanic mpjanic     0 Apr 28 16:57 tmpATAT-CGCGtmp
mpjanic@zoran:~$ awk -F"\t" '{print $1"-"$2}' tmp.tmp | xargs -I % sh -c 'echo %; find . -name "*"%"*" -exec ls -l {} \;;'
ATCG-GCTA
-rw-rw-r-- 1 mpjanic mpjanic 0 Apr 28 16:57 ./tmpATCG-GCTAtmp
GGGG-CCCC
-rw-rw-r-- 1 mpjanic mpjanic 0 Apr 28 16:57 ./tmpGGGG-CCCCtmp
CCCC-TTTT
-rw-rw-r-- 1 mpjanic mpjanic 0 Apr 28 16:57 ./tmpCCCC-TTTTtmp
ATAT-CGCG
-rw-rw-r-- 1 mpjanic mpjanic 0 Apr 28 16:57 ./tmpATAT-CGCGtmp

Wednesday, April 27, 2016

Bulk URL download using xargs

If you are able to get bulk URLs for download fro a server, (which can be the case for some sequencing facilities) use this code, to pipe the list to xargs and then run wget sequentially on each of the input URLs:

cat file.url| xargs -I % sh -c 'echo %; wget %;'

Friday, April 22, 2016

Extend the range of bed file using awk

If you have a bed file with 3 tab separated columns and want to extend the range, e.g. +/-2000bp, it can be done easily with awk:

awk  -F" " '{print $1 "\t" $2-2000 "\t"$3+2000}' file.bed

Wednesday, April 20, 2016

Tools to convert bam to bigwig

These are two scripts I wrote, bam2bigwig and bamsplit2bigwig, to convert bam files to bigwig for UCSC Genome Browser viewing. The difference is that bamsplit2bigwig will separate reads according to the mapping strand and the separately convert them to bigwig, which may be useful in some cases.

bam2bigwig
This script will take a bam file mapped to the human genome hg19 and convert it to a bigwig file that could be loaded to UCSC Genome Browser. It depends on bedtools bamtobed (aka bamToBed), and it will attempt to download and execute three UCSC scripts, bedItemOverlapCount, bedGraphToBigWig and fetchChromSizes, so make sure your connection is working.

Usage
chmod 775 bam2bigwig
./bam2bigwig file.bam
Code:

#! /bin/bash

#create bed from bam, requires bedtools bamToBed
bamToBed -i $1 -split > accepted_hits.bed

#download bedItemOverlapCount
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedItemOverlapCount
chmod 775 bedItemOverlapCount

#fetch hg19 chromosome sizes
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes
chmod 775 fetchChromSizes
./fetchChromSizes hg19 > hg19.chrom.sizes

#create plus and minus strand bedgraph
cat accepted_hits.bed | sort -k1,1 | ./bedItemOverlapCount hg19 -chromSize=hg19.chrom.sizes stdin | sort -k1,1 -k2,2n > accepted_hits.bedGraph

#download bedGraphToBigWig
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
chmod 755 bedGraphToBigWig

#create plus and minus strand bigwig
./bedGraphToBigWig accepted_hits.bedGraph hg19.chrom.sizes $1.bw

#removing intermediery files
rm accepted_hits.bed
rm accepted_hits.bedGraph
rm bedItemOverlapCount
rm bedGraphToBigWig
rm fetchChromSizes
rm hg19.chrom.sizes

bamsplit2bigwig
This script will take a bam file mapped to the human genome hg19 and convert it to two strand specific bigwig files that could be loaded to UCSC Genome Browser. It depends on bedtools bamtobed (aka bamToBed), and it will attempt to download and execute three UCSC scripts, bedItemOverlapCount, bedGraphToBigWig and fetchChromSizes, so make sure your connection is working. Converting bam to bigwig and separating according to strand could be useful to visualize nucleosomal or protein binding 'footprints' in ChIP-Seq experiments as at the boundaries of protein-DNA interactions will be marked with reads mapping on different DNA strands. 

Usage
chmod 775 bamsplit2bigwig
./bamsplit2bigwig file.bam
Code:

#! /bin/bash

#create bed from bam, requires bedtools bamToBed
bamToBed -i $1 -split > accepted_hits.bed

#download bedItemOverlapCount
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedItemOverlapCount
chmod 775 bedItemOverlapCount

#fetch hg19 chromosome sizes
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes
chmod 775 fetchChromSizes
./fetchChromSizes hg19 > hg19.chrom.sizes

#create plus and minus strand bedgraph
awk '{if($6=="+") print}' accepted_hits.bed | sort -k1,1 | ./bedItemOverlapCount hg19 -chromSize=hg19.chrom.sizes stdin | sort -k1,1 -k2,2n > accepted_hits.plus.bedGraph
awk '{if($6=="-") print}' accepted_hits.bed | sort -k1,1 | ./bedItemOverlapCount hg19 -chromSize=hg19.chrom.sizes stdin | sort -k1,1 -k2,2n | awk '{OFS="\t"; print $1,$2,$3,"-"$4}' > accepted_hits.minus.bedGraph

#download bedGraphToBigWig
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
chmod 755 bedGraphToBigWig

#create plus and minus strand bigwig
./bedGraphToBigWig accepted_hits.plus.bedGraph hg19.chrom.sizes $1.plus.bw
./bedGraphToBigWig accepted_hits.minus.bedGraph hg19.chrom.sizes $1.minus.bw

#removing intermediery files
rm accepted_hits.bed
rm accepted_hits.plus.bedGraph
rm accepted_hits.minus.bedGraph
rm bedItemOverlapCount
rm bedGraphToBigWig
rm fetchChromSizes
rm hg19.chrom.sizes

gwasCatalog2Bed2Category

This is the script that will connect to http://www.genome.gov, download GWAS Catalog, convert it to a bed file with "chr position position+1 proxy_gene phenotype", and then take all your input search terms and select those entries in the catalog that match the term, and create separate bed files for each search term.

Script will create and call two 'subscripts' GwasCatalog2Bed.sh and main.sh. Main.sh will be removed while GwasCatalog2Bed.sh will not be removed and you can use it afterwards to download whole GWAS Catalog in bed format.

Files that will be generated are: GwasCatalog.bed, which contains shorter bed version of the entire GWAS Catalog and phenotype specific bed files.

gwasCatalog2Bed2Category will also remove duplicates from phenotype/term specific bed files, that may occur as you may greb similar but different phenotypes that contain identical SNPs. Script will save those files as .bed.cut.sort.uniq and report the number of SNPs before and after removing deplicates.

Enter all search terms as arguments passed to a Bash script and if your serach term is a phrase use double quotes. Note that names of phenotypes in GWAS Catalog start with capital letter but then next word is with small letter.
chmod 755 gwasCatalog2Bed2Category.sh
~/gwasCatalog2Bed2Category "Coronary artery" "Coronary heart" "Bipolar disorder"

Code:

#!/bin/bash
#download gwas catalog and create bed file with "chr position position+1 proxy_gene phenotype"
echo "
#download gwas catalog 
wget http://www.genome.gov/admin/gwascatalog.txt
awk -F\"\t\" '{if (\$12!=\"\") print \$12\"\t\"\$13\"\t\"\$15\"\t\"\$8}' gwascatalog.txt > tmp
awk -F\"\t\" '{print \$1\"\t\"\$2\"\t\"\$2+1\"\t\"\$3\"\t\"\$4}' tmp > GwasCatalog.bed
rm tmp" > GwasCatalog2Bed.sh
chmod 775 GwasCatalog2Bed.sh
./GwasCatalog2Bed.sh

#writing main to perform grep on gwascatalog.txt.cut
echo "
#grep categories
for var in \"\$@\"
do
  echo Received: \$var
grep \"\$var\" GwasCatalog.bed > \"\$var\".gwascatalog.bed 
echo Done: \$var
done
#print stats
echo "Gwas Catalog number of SNP-phenotype associations:"
wc -l GwasCatalog.bed
echo "Gwas Catalog number of SNP-phenotype associations per category:"
for var in \"\$@\"
do
  echo Phenotype: \$var
wc -l \"\$var\".gwascatalog.bed 
done
for var in \"\$@\"
do
cut -f1-3 \"\$var\".gwascatalog.bed > \"\$var\".gwascatalog.bed.cut
sort -k1,1V -k2,2n \"\$var\".gwascatalog.bed.cut > \"\$var\".gwascatalog.bed.cut.sort
uniq \"\$var\".gwascatalog.bed.cut.sort > \"\$var\".gwascatalog.bed.cut.sort.uniq 
rm \"\$var\".gwascatalog.bed.cut
rm \"\$var\".gwascatalog.bed.cut.sort
done
echo "Gwas Catalog number of SNP-phenotype associations per category AFTER REMOVING DUPLICATES:"
for var in \"\$@\"
do
  echo Phenotype: \$var
wc -l \"\$var\".gwascatalog.bed.cut.sort.uniq  
done
" > main.sh

#calling main
chmod 775 main.sh
./main.sh "$@"

#removing files
rm gwascatalog.txt
rm main.sh

Monday, April 18, 2016

Using xargs to parallelize bedrolls intersect


One of the most useful commands in bash is xargs, that allows you to substitute simple loops with a one liner command. Xargs takes output of the e.g. find command and sequentially runs the issued command/s to all files that match the find attributes. In this example I have parallelized bedrolls intersect with xargs to intersect 125 ENCODE datasets with our test dataset.

find ENCODE_DNASE_FOR_PCA_PLOT/*narrowPeak.cut.merge | xargs -I % sh -c 'echo %; bedtools intersect -wa -wb -a snp.position.rs1537373 -b %;'
ENCODE_DNASE_FOR_PCA_PLOT/wgEncodeAwgDnaseDuke8988tUniPk.narrowPeak.cut.merge
ENCODE_DNASE_FOR_PCA_PLOT/wgEncodeAwgDnaseDukeAosmcUniPk.narrowPeak.cut.merge
ENCODE_DNASE_FOR_PCA_PLOT/wgEncodeAwgDnaseDukeChorionUniPk.narrowPeak.cut.merge
ENCODE_DNASE_FOR_PCA_PLOT/wgEncodeAwgDnaseDukeCllUniPk.narrowPeak.cut.merge
ENCODE_DNASE_FOR_PCA_PLOT/wgEncodeAwgDnaseDukeFibroblUniPk.narrowPeak.cut.merge
chr9 22103340 22103341 rs1537373 chr9 22103265 22103415 16
ENCODE_DNASE_FOR_PCA_PLOT/wgEncodeAwgDnaseDukeFibropUniPk.narrowPeak.cut.merge
ENCODE_DNASE_FOR_PCA_PLOT/wgEncodeAwgDnaseDukeGlioblaUniPk.narrowPeak.cut.merge
ENCODE_DNASE_FOR_PCA_PLOT/wgEncodeAwgDnaseDukeGm12891UniPk.narrowPeak.cut.merge
...

Monday, April 11, 2016

Binomial test for genomic overlaps of two bed files

Final script I wrote in a series - genomic overlaps - performs binomial statistical test using two bed files with genomic coordinates. This is a combined bash/R script that will generate a binomial p-value that shows significance for the overlap of two sets of genomic regions in hg19 human genome (for example from ChIP-Seq experiments).

To calculate binomial p-value you would need two bed files as inputs. Note that binomial test does not require background set as do Fisher and hypergeometric tests.

Note that file names are optional.

Example output:

mpjanic@zoran:~/AHR-TCF21$ ./binomial.sh ARNT.chipseq.cut TCF21_sigma_not_normalized_peaks.bed background.bed
Number of unique overlaps of A and B
143
Calculating coverage of ./TCF21_sigma_not_normalized_peaks.bed in human genome hg19
Base pair coverage of B in hg19 is: 0.00566281
Calculating binomial p-value with pbinom(143, 42920, 0.00566281)
Binomial p-value:
[1] 2.196093e-12
 num 2.2e-12
Code:

#!/bin/bash

# binomial-test-for-genomic-overlaps1

#intersecting with bedtools
bedtools intersect -a $1 -b $2 | uniq > A_B_uniq.bed

echo "Number of unique overlaps of A and B"
wc -l A_B_uniq.bed | cut -f1 -d ' '

mkdir TEMP

for f in $(find . -maxdepth 1 -name $2)
do
echo "Calculating coverage of $f in human genome hg19"
cat "$f" | awk '
{
    FS="\t"
    {
        print $1"\t"$2"\t"$3"\t"($3-$2)
}
}' > TEMP/"$f".temp
done

for f in $(find TEMP/ -maxdepth 1 -name \*temp\*);
do
echo "$f"
cat "$f" | awk '
{s+=$4}END{print s}';
done > out.dat

awk '{if (count++%2!=0) print ($0/2630301437);print $0}' out.dat  > out.dat2

declare bpcoverage=$(awk '{if (count++%2!=0) print ($0/2630301437)}' out.dat)

echo "Base pair coverage of B in hg19 is: $bpcoverage"
sed -i 's/TEMP\//print\(\"/' out.dat2
sed -i 's/\.temp/\"\)/' out.dat2

awk '{if (count++%3!=2) print$0}' out.dat2 > out.dat3

awk '{if (count++%2!=0) print ("dbinom("$0")");print $0}' out.dat3  > out.dat4

#writing and executing R script
echo "#!/usr/bin/Rscript
A_B.binomial  <-
pbinom($(wc -l A_B_uniq.bed | cut -f1 -d ' '), $(wc -l $2 | cut -f1 -d ' '), $bpcoverage)
A_B.binomial
str(A_B.binomial)" > script.r

chmod 775 script.r
echo "Calculating binomial p-value with pbinom($(wc -l A_B_uniq.bed | cut -f1 -d ' '), $(wc -l $2 | cut -f1 -d ' '), $bpcoverage)"
echo "Binomial p-value:"
./script.r

#removing intermediary files

rm A_B_uniq.bed
rm script.r
rm -r TEMP

Sunday, April 10, 2016

ensemble2genename

This is combined bash/R script that will use a file with human ENSEMBLE geneIDs in a first column of a file and append a gene name to it, while keeping the structure of the file from other columns. Ensemble2genename sets its host to ensembl.org thus it could be especially useful when biomaRt site is down.

Download and examples on GitHub:
https://github.com/milospjanic/ensemble2genename

#!/bin/bash

#write R script, needs biomaRt

echo "#!/usr/bin/Rscript
library(biomaRt)
listMarts(host=\"grch37.ensembl.org\")
ensembl = useMart(\"ENSEMBL_MART_ENSEMBL\",dataset=\"hsapiens_gene_ensembl\", host=\"grch37.ensembl.org\")
id_merge = getBM(attributes=c(\"ensembl_gene_id\",\"external_gene_name\"),mart=ensembl)
write.table(id_merge, file=\"id_merge.txt\", sep = \"\t\", quote =F, col.names=F, row.names=F)
" > script.r

#run R script

chmod 775 script.r
./script.r

#Use awk to append gene names

awk 'NR==FNR {h[$1] = $1; h2[$1] = $2; next} {print h2[$1], $0}' id_merge.txt $1 >$1.genename

#remove temporary files

rm id_merge.txt
rm script.r

Friday, April 8, 2016

Multiple spaces/tabs to single tab - code

To convert multiple spaces and tabs to a single tab use the code below:

Input file.txt

chr1       10100           10330            
chr1     10345    10590                                 
chr1    16100       16315      
chr1    20060      20210   
chr1      56200   56350   
Code. First use tr -s to squeeze spaces, then sed to substitute spaces to tabs, then tr -s to squeeze tabs.

tr -s " " < file.txt > file.tmp
sed -i 's/\s/\t/g' file.tmp
tr -s "\t" < file.tmp > file.txt
Output file.txt

chr1    10100   10330
chr1    10345   10590
chr1    16100   16315
chr1    20060   20210
chr1    56200   56350

Thursday, April 7, 2016

Hypergeometric statistical test for overlaps of two sets of genomic regions in a predefined genomic background

This is a bash script that will invoke R through RScript and generate a hypergeometric p-value that shows significance of the overlap of two sets of genomic regions (for example from ChIP-Seq experiments).

To calculate hypergeometric p-value you would need three BED files: 1. First bed file to overlap, i.e. file1.bed 2. Second bed file for overlap, i.e. file2.bed 3. Bed file that will serve as genomic background for overlaps, i.e. background.bed

Note that file names are optional.

As genomic background in human you can use e.g. combined ENCODE set of open chromatin regions or a similar data set. Genomic background is necessary to calculate the constituents of the hypergeometric test elements: A-B overlap in BG, A no B in BG, total BG minus BG-A overlap, B no A in BG.

I provided a combined ENCODE DHS data set in a bed file which you can use as a background set, background.bed.

https://github.com/milospjanic/hypergeometric-test-for-genomic-overlaps


#!/bin/bash

# hypergeometric-test-for-genomic-overlaps

#intersecting with bedtools
bedtools intersect -a $1 -b $3 > A_BG
bedtools intersect -a $2 -b $3 > B_BG

bedtools intersect -wa -a A_BG -b B_BG | uniq > A_B_BG
bedtools intersect -wa -a B_BG -b A_BG | uniq > B_A_BG
bedtools intersect -v -a A_BG -b A_B_BG | uniq > A_BG_noB
bedtools intersect -v -a B_BG -b B_A_BG | uniq > B_BG_noA
bedtools intersect -v -a $3 -b A_BG | uniq > BG_noA

echo "Number of uniq A overlapping B in genomic background"
wc -l A_B_BG | cut -f1 -d ' '
echo "Number of uniq B overlapping A in genomic background"
wc -l B_A_BG | cut -f1 -d ' '
echo "Number of uniq A not overlapping B in genomic background"
wc -l A_BG_noB | cut -f1 -d ' '
echo "Number of uniq B not overlapping A in genomic background"
wc -l B_BG_noA | cut -f1 -d ' '
echo "Number of genomic background not overlapping A or B"
wc -l BG_noA | cut -f1 -d ' '

echo "Hypergeometric p-value will be calculated with phyper in R as phyper(A-B overlap in BG, A no B in BG, total BG minus BG-A overlap, B no A in BG, lower.tail = FALSE, log.p = FALSE)"
#writing and executing R script
echo "#!/usr/bin/Rscript
A_B.hypergeometric  <-
phyper($(wc -l A_B_BG | cut -f1 -d ' '), $(wc -l A_BG_noB | cut -f1 -d ' '), $(wc -l BG_noA | cut -f1 -d ' '), $(wc -l B_BG_noA | cut -f1 -d ' '), lower.tail = FALSE, log.p = FALSE)
A_B.hypergeometric
str(A_B.hypergeometric)" > script.r

chmod 775 script.r
echo "Hypergeometric p-value:"
./script.r

#removing intermediary files
rm A_BG
rm B_BG
rm A_B_BG
rm B_A_BG
rm A_BG_noB
rm B_BG_noA
rm BG_noA
rm script.r

Wednesday, April 6, 2016

How to change multiple spaces to a single tab in Unix

If you copy/paste tab separated files in terminal, it may happen that tabs are changed into multiple spaces. In case you end up with a file with multiple spaces and you need them to be changed to a single tab use tr and sed commands, tr with -s option will squeeze multiple spaces and sed will substitute a space \s with a tab \t.

tr -s " " < file.txt > file.txt2
mv file.txt2 file.txt
sed -i 's/\s/\t/g' file.txt

Monday, April 4, 2016

Bash/R combined script for Fisher test of genomic overlaps in a given background

I made a bash script that will invoke R through RScript and will generate a Fisher test p-value that shows significance of the overlap of two sets of genomic regions (for example from ChIP-Seq experiments).

To calculate Fisher p-value you would need three BED files: 1. First bed file to overlap, i.e. file1.bed 2. Second bed file for overlap, i.e. file2.bed 3. Bed file that will serve as genomic background for overlaps, i.e. background.bed

Note that file names do not have to be file1.bed, file2.bed, background.bed, this is just an example.

As genomic background you can use e.g. combined ENCODE set of open chromatin regions or a similar data set. Genomic background is necessary to calculate the constituents of the Fisher test contingency matrix: overlap of A and B in the background BG set, A but not B in BG, B but not A in BG, BG without A and B.

On my github page I provided a combined ENCODE DHS data set in a bed file which you can use as a background set, i.e. background.bed.

https://github.com/milospjanic/fisher-test-for-genomic-overlaps

Code:

#intersecting with bedtools
bedtools intersect -a $1 -b $3 > A_BG
bedtools intersect -a $2 -b $3 > B_BG

bedtools intersect -wa -a A_BG -b B_BG | uniq > A_B_BG
bedtools intersect -wa -a B_BG -b A_BG | uniq > B_A_BG
bedtools intersect -v -a A_BG -b A_B_BG | uniq > A_BG_noB
bedtools intersect -v -a B_BG -b B_A_BG | uniq > B_BG_noA
bedtools intersect -v -a $3 -b A_B_BG | uniq > BG_noA_noB

echo "Number of uniq A overlapping B in genomic background"
wc -l A_B_BG | cut -f1 -d ' '
echo "Number of uniq B overlapping A in genomic background"
wc -l B_A_BG | cut -f1 -d ' '
echo "Number of uniq A not overlapping B in genomic background"
wc -l A_BG_noB | cut -f1 -d ' '
echo "Number of uniq B not overlapping A in genomic background"
wc -l B_BG_noA | cut -f1 -d ' '
echo "Number of genomic background not overlapping A or B"
wc -l BG_noA_noB | cut -f1 -d ' '

#writing and executing R script
echo "#!/usr/bin/Rscript
A_B  <-
matrix(c($(wc -l A_B_BG | cut -f1 -d ' '), $(wc -l A_BG_noB | cut -f1 -d ' '), $(wc -l B_BG_noA | cut -f1 -d ' '), $(wc -l BG_noA_noB | cut -f1 -d ' ')),
       nrow = 2,
       dimnames = list(Guess = c(\"b+\", \"b-\"),
                       Truth = c(\"a+\", \"a-\")))
A_B.fisher <- fisher.test(A_B)
str(A_B.fisher)" > script.r
chmod 775 script.r
./script.r

#removing intermediary files
rm A_BG
rm B_BG
rm A_B_BG
rm B_A_BG
rm A_BG_noB
rm B_BG_noA
rm BG_noA_noB
rm script.r

Calculate Fisher test p-values for overlaps of two bed files in a given genomic background

This is a code that will generate a Fisher test p-value that shows significance for the overlap of two sets of genomic regions (for example from ChIP-Seq experiments). To calculate Fisher p-value you would need three BED files:
1. First bed file to overlap
2. Second bed file for overlap
3. Bed file that will serve as genomic background for overlaps.

As genomic background you can use e.g. combined ENCODE set of open chromatin regions or a similar data set. Genomic background is necessary to calculate the constituents of the Fisher test contingency matrix: overlap of A and B in the background BG set, A but not B in BG, B but not A in BG, BG without A and B.

Code:

bedtools intersect -a A.bed -b background.bed > 1_BG
bedtools intersect -a B.bed -b background.bed > 2_BG

bedtools intersect -wa -a 1_BG -b 2_BG | uniq > 1_2_BG
bedtools intersect -wa -a 2_BG -b 1_BG | uniq > 2_1_BG
bedtools intersect -v -a 1_BG -b 1_2_BG | uniq > 1_BG_no2
bedtools intersect -v -a 2_BG -b 2_1_BG | uniq > 2_BG_no1
bedtools intersect -v -a background.bed -b 1_2_BG | uniq > BG_no1_no2

wc -l 1_2_BG
wc -l 2_1_BG
wc -l 1_BG_no2
wc -l 2_BG_no1
wc -l BG_no1_no2
Next, create a contingency matrix in R using the outputs of wc -l command, e.g.

365 1_2_BG
360 2_1_BG
45854 1_BG_no2
3670 2_BG_no1
2233184 BG_no1_no2
Create R code:

AHR_TCF21  <-
matrix(c(365, 3670, 45854, 2233184),
       nrow = 2,
       dimnames = list(Guess = c("b+", "b-"),
                       Truth = c("a+", "a-")))

AHR_TCF21.fisher <- fisher.test(AHR_TCF21)
str(AHR_TCF21.fisher)
Output in R:

> AHR_TCF21  <-
+ matrix(c(365, 3670, 45854, 2233184),
+        nrow = 2,
+        dimnames = list(Guess = c("b+", "b-"),
+                        Tru .... [TRUNCATED]

> AHR_TCF21.fisher <- fisher.test(AHR_TCF21)

> str(AHR_TCF21.fisher)
List of 7
 $ p.value    : num 1.88e-121
 $ conf.int   : atomic [1:2] 4.34 5.4
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num 4.84
  ..- attr(*, "names")= chr "odds ratio"
 $ null.value : Named num 1
  ..- attr(*, "names")= chr "odds ratio"
 $ alternative: chr "two.sided"
 $ method     : chr "Fisher's Exact Test for Count Data"
 $ data.name  : chr "AHR_TCF21"
 - attr(*, "class")= chr "htest"
In this example, we can see that the overlap is highly significant, with p-value of 1.88e-121 and CI 4.34-5.4.