Wednesday, May 20, 2015

Code for BWA mapping of the paired end reads - Picard SAM to BAM conversion - GATK calling of variants in VCF format

Note this code works for GATK version 1.6 with CountCovariates and TableRecalibration. In GATK 2.0 this process is now done with a dedicated tool called BaseRecalibrator followed by PrintReads

Code for:
- BWA mapping of paired end reads (files in example 141125_H22TJ_2305_3_L005_R1.fastq.gz and 141125_H22TJ_2305_3_L005_R2.fastq.gz)
- Picard Mark Duplicates that originate from PCR amplification (and that map at the same location)
- GATK realigning reads around indels
- Picard fixing mate information
- GATK quality score recalibration
- GATK UnifiedGenotyper SNP calling
- GATK filter probable false SNPs



#genome indexing with BWA - needs hg19.fa
bwa index -a bwtsw -p hg19 hg19.fa

#alignment with BWA
bwa aln -t 64 -f 141125_H22TJ_2305_3_L005_R1.sai hg19 141125_H22TJ_2305_3_L005_R1.fastq.gz
bwa aln -t 64 -f 141125_H22TJ_2305_3_L005_R2.sai hg19 141125_H22TJ_2305_3_L005_R2.fastq.gz


#BWA conversion sai to sam
bwa sampe -f 141125_H22TJ_2305_3_L005.sam -r "@RG\tID:2305\tLB:2305\tSM:2305\tPL:ILLUMINA" hg19 141125_H22TJ_2305_3_L005_R1.sai 141125_H22TJ_2305_3_L005_R2.sai 141125_H22TJ_2305_3_L005_R1.fastq.gz 141125_H22TJ_2305_3_L005_R2.fastq.gz

#Picard make genome dictionary and Samtools genome index
java -jar /usr/bin/picard.jar CreateSequenceDictionary R= hg19.fa O= hg19.dict
samtools faidx hg19.fa

#Picard sorting of sam and conversion to bam
java -Xmx4g -Djava.io.tmpdir=/tmp \
-jar picard.jar SortSam \
SO=coordinate \
INPUT=141125_H22TJ_2305_3_L005.sam \
OUTPUT=141125_H22TJ_2305_3_L005.bam \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true


#Picard marking of PCR duplicates (same start and end position in both paired reads)
java -Xmx4g -Djava.io.tmpdir=/tmp \
-jar picard.jar MarkDuplicates \
INPUT=141125_H22TJ_2305_3_L005.bam \
OUTPUT=141125_H22TJ_2305_3_L005.marked.bam \
METRICS_FILE=metrics \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=LENIENT


#GATK creates a table of possible indels
java -Xmx4g -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R hg19.fa \
-o 141125_H22TJ_2305_3_L005.marked.bam.list \
-I 141125_H22TJ_2305_3_L005.marked.bam


#GATK realigning reads around those indels
java -Xmx4g -Djava.io.tmpdir=/tmp \
-jar GenomeAnalysisTK.jar \
-I 141125_H22TJ_2305_3_L005.marked.bam \
-R hg19.fa \
-T IndelRealigner \
-targetIntervals 141125_H22TJ_2305_3_L005.marked.bam.list \
-o 141125_H22TJ_2305_3_L005.marked.realigned.bam


#Picard fixes mate information that may be changed during the realignment process
java -Djava.io.tmpdir=/tmp/flx-auswerter \
-jar picard.jar FixMateInformation \
INPUT=141125_H22TJ_2305_3_L005.marked.realigned.bam \
OUTPUT=141125_H22TJ_2305_3_L005.marked.realigned.fixed.bam \
SO=coordinate \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true


#GATK quality score recalibration as quality data from the sequencer isn't always very accurate - requires a dbSNP file
java -Xmx4g -jar GenomeAnalysisTK.jar \-l INFO \
-R hg19.fa \
--DBSNP dbsnp142.txt \
-I 141125_H22TJ_2305_3_L005.marked.realigned.fixed.bam \
-T CountCovariates \
-cov ReadGroupCovariate \
-cov QualityScoreCovariate \
-cov CycleCovariate \
-cov DinucCovariate \
-recalFile 141125_H22TJ_2305_3_L005.recal_data.csv


#GATK quality score recalibration - table recalibration
java -Xmx4g -jar GenomeAnalysisTK.jar \
-l INFO \
-R hg19.fa \
-I 141125_H22TJ_2305_3_L005.marked.realigned.fixed.bam \
-T TableRecalibration \
--out 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.bam \
-recalFile 141125_H22TJ_2305_3_L005.recal_data.csv

#GATK UnifiedGenotyper SNP calling 
java -Xmx4g -jar GenomeAnalysisTK.jar \
-glm BOTH \
-R hg19.fa \
-T UnifiedGenotyper \
-I 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.bam \
-D dbsnp142.txt \
-o 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.vcf \
-metrics snps.metrics \
-stand_call_conf 50.0 \
-stand_emit_conf 10.0 \
-dcov 1000 \
-A DepthOfCoverage \
-A AlleleBalance \


#filter (flag) probable false SNPs (clustered in window of 10bp; with low coverage; very low quality etc.)
java -Xmx4g -jar GenomeAnalysisTK.jar \
-R hg19.fa \
-T VariantFiltration \
-B:variant,VCF 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.vcf \
-o 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.filtered.vcf \
--clusterWindowSize 10 \
--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
--filterName "HARD_TO_VALIDATE" \
--filterExpression "DP < 5 " \
--filterName "LowCoverage" \
--filterExpression "QUAL < 30.0 " \
--filterName "VeryLowQual" \
--filterExpression "QUAL > 30.0 && QUAL < 50.0 " \
--filterName "LowQual" \
--filterExpression "QD < 1.5 " \
--filterName "LowQD" \
--filterExpression "SB > -10.0 " \
--filterName "StrandBias"

No comments:

Post a Comment