Wednesday, December 14, 2016

Introducing rsID2Bed, a script to convert SNP rsIDs to genomics positions in a bed format using MySQL for hg19 dataset

rsID2Bed is a script to convert SNP rsIDs to a list of genomics positions in a bed format: chr, position, position+1, rsID. Script is useful for quick conversion of SNPs to a bed format for downstream analysis (e.g. overlaps with bedtools).

Download from GitHub:
https://github.com/milospjanic/rsID2Bed


# rsID2Bed

rsID2Bed is a script to convert SNP rsIDs to a list of genomics positions in a bed format: chr, position, position+1, rsID. Script is useful for quick conversion of SNPs to a bed format for downstream analysis (e.g. overlaps with bedtools).

**Usage**

This script will check if the working folder is present and if not it will create ~/rsID2Bed. Next, script will go into ~/rsID2Bed and check if dbSNP bed file for human genome hg19 is present or not, version 147 (the latest one present in the mySQL database currently), and case it is not present it will download it using mySQL from snp147Common table of hg19 dataset on genome-mysql.cse.ucsc.edu.

Next an akw code will perform comparison and output bed file for your input snps and save it as $1.bed, $1 being first parameter provided to the script that should be the file name containing SNPs.

Output file will be placed in ~/rsID2Bed

MySQL download will produce a file snp147Common.bed, with 14,815,821 SNPs:

head snp147Common.bed
chr1 10177 10177 rs367896724
chr1 10352 10352 rs555500075
chr1 11007 11008 rs575272151
chr1 11011 11012 rs544419019
chr1 13109 13110 rs540538026
chr1 13115 13116 rs62635286
chr1 13117 13118 rs62028691
chr1 13272 13273 rs531730856
chr1 13417 13417 rs777038595
chr1 14463 14464 rs546169444

mpjanic@zoran:~/rsID2Bed$ wc -l snp147Common.bed 
14815821 snp147Common.bed


**Running**

To run the script type:

chmod 775 rsID2Bed.sh 
./rsID2Bed.sh path/to/file


**Example**


cat SNP.file
rs575272151
rs62028691
rs546169444
rs541940975
rs4690
rs201327123
rs11586607
rs62636497
rs369786760
rs143379270
rs764954080
rs762420258

./rsID2Bed.sh SNP.file

cat SNP.file.bed

chr1 11007 11008 rs575272151
chr1 13117 13118 rs62028691
chr1 14463 14464 rs546169444
chr1 14603 14604 rs541940975
chr1 14672 14673 rs4690
chr1 14676 14677 rs201327123
chr1 15210 15211 rs11586607
chr1 15273 15274 rs62636497
chrX 155258430 155258431 rs369786760
chrX 155258653 155258654 rs143379270
chrX 155259767 155259767 rs764954080
chrX 155259783 155259783 rs762420258
chrY 59361436 59361437 rs369786760
chrY 59361659 59361660 rs143379270
chrY 59362773 59362773 rs764954080
chrY 59362789 59362789 rs762420258

Saturday, December 10, 2016

Introducing ChIPSeqFPro, a workflow for full processing of ChIPSeq data, from fastq to bigwig

ChIPSeqFPro (beta) is a script for full processing of ChIPSeq data starting from fastq files. It performs fastqc quality control, mapping to the human genome hg19 or mouse mm10 using bwa, sam to bam conversion, peak calling with MACS, creates bigwig files from bam files using bam2bigwig.

ChIPSeqFPro

ChiPSeqFPro (short from ChIP-Seq Full Processing) is a pipeline that will perform full processing of ChIPSeq data starting from the fastq.gz files. It performs fastqc quality control, mapping to the human genome hg19 or mouse mm10 using bwa, sam to bam conversion, peak calling with MACS, and finally creates bigwig files from bam files using bam2BigWig tool.

Dependencies

Place fastqc.gz in a working folder

mkdir work.folder
cp path-to-files/*fastq.gz work.folder
FastQC

Instalation (Linux), place FastQC folder in working directory:

cd work.folder
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
chmod 755 ./FastQC/fastqc
Reference genome

Download the reference genome, in this example it is human hg19:

mkdir ~/reference_genomes
cd ~/reference_genomes
mkdir hg19
cd hg19
wget --timestamping 
        'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit ' 
        -O hg19.2bit 
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
chmod 755 twoBitToFa
./twoBitToFa hg19.2bit hg19.fa
For the mouse genome:

mkdir mm10
cd mm10
wget --timestamping 
        ' http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit' 
        -O mm10.2bit    
./twoBitToFa mm10.2bit mm10.fa
Install BWA*

#Download to your ~ folder the latest version from http://sourceforge.net/projects/bio-bwa/files/
bunzip2 bwa-0.7.15.tar.bz2 
tar xvf bwa-0.7.15.tar
cd bwa-0.7.15
make

#edit ~/.bashrc to add bwa to your PATH 
nano ~/.bashrc
export PATH=$PATH:~/bwa-0.7.15
source ~/.bashrc
#test the installation
bwa
Indexing the reference genome with BWA

Use BWA to index the reference genome, use number of core on your machine, e.g. 64.

cd ~/reference_genomes
#for human genome
bwa index -a bwtsw hg19.fa
#for mouse genome
bwa index -a bwtsw mm10.fa
Download bam2bigwig

wget https://raw.githubusercontent.com/milospjanic/bam2bigwig/master/bam2bigwig.sh
chmod 775 bam2bigwig
Running

ChIPSeqFPro is composed of four pipelines that will run on either human genome hg19 or mouse genome mm10, using either paired-end (PE) or single-read (SR) sequences.

ChIPSeqFPro.PE.hg19.sh
ChIPSeqFPro.PE.mm10.sh
ChIPSeqFPro.SR.hg19.sh
ChIPSeqFPro.SR.mm10.sh
After placing fastq.gz files in the working folder run the script that is suitable for your experiment, e.g:

chmod 755 ChIPSeqFPro.PE.hg19.sh
./ChIPSeqFPro.PE.hg19.sh 64
The number indicates number of cores you want to allocate for the analysis.

Don't forget to place the FastQC folder into the working folder! These are the only requirment neccesary to be in the working directory, in addition to the fastqc.gz files

Dont forget that reference genome needs to be in your ~/reference_genomes folder, in case you switch to another user account script may not work because it searches for ~/reference_genomes folder.