Friday, May 29, 2015

Add chr to vcf file

GATK may give you an error message like this indicating that your vcf file does not contain chr identifier:

##### ERROR   /home/mpjanic/alea/bin/All.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
##### ERROR   reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn_gl000234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, chr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231, chrUn_gl000229, chrM, chrUn_gl000226, chr18_gl000207_random]
##### ERROR ------------------------------------------------------------------------------------------


Here is a simple awk code to add chr to the VCF file that is missing it:

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' file.vcf > file_with_chr.vcf

Monday, May 25, 2015

Turning off Illumina specific FASTQ encoding

One of the common errors in the code for BWA mapping is -I option that will make it use older Illumina specific FASTQ encoding instead of Sanger style FASTQ that is now used by Illumina pipes.
If you try to use Picard to convert SAM to BAM afterwards you will get this error:

java -Xmx4g -Djava.io.tmpdir=/tmp \
-jar /usr/bin/picard.jar SortSam \
SO=coordinate \
INPUT=141125_H22TJ_2305_3_L005.sam \
OUTPUT=141125_H22TJ_2305_3_L005.bam \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true

picard.sam.SortSam INPUT=141125_H22TJ_2305_3_L005.sam OUTPUT=141125_H22TJ_2305_3_L005.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true    VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false
[Sat May 23 01:50:25 PDT 2015] Executing as mpjanic@valkyr on Linux 3.13.0-44-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
Ignoring SAM validation error due to lenient parsing:
Error parsing text SAM file. length(QUAL) != length(SEQ); File 141125_H22TJ_2305_3_L005.sam; Line 96
Line: ST-E00136:51:H22TJCCXX:5:1101:2502:1783   77      *       0       0       *       *       0       0       AATAAACCTCTCCTCCCCGGCGCCGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNC
ANNANGCGANAGGGCAGAGACAGAGTGGGGGCGGAGCAGGT     "'''","'""""'"
[Sat May 23 01:50:25 PDT 2015] picard.sam.SortSam done. Elapsed time: 0.00 minutes.


So remove -I option from bwa code, and it should be fine.

Wednesday, May 20, 2015

Code for BWA mapping of the paired end reads - Picard SAM to BAM conversion - GATK calling of variants in VCF format

Note this code works for GATK version 1.6 with CountCovariates and TableRecalibration. In GATK 2.0 this process is now done with a dedicated tool called BaseRecalibrator followed by PrintReads

Code for:
- BWA mapping of paired end reads (files in example 141125_H22TJ_2305_3_L005_R1.fastq.gz and 141125_H22TJ_2305_3_L005_R2.fastq.gz)
- Picard Mark Duplicates that originate from PCR amplification (and that map at the same location)
- GATK realigning reads around indels
- Picard fixing mate information
- GATK quality score recalibration
- GATK UnifiedGenotyper SNP calling
- GATK filter probable false SNPs



#genome indexing with BWA - needs hg19.fa
bwa index -a bwtsw -p hg19 hg19.fa

#alignment with BWA
bwa aln -t 64 -f 141125_H22TJ_2305_3_L005_R1.sai hg19 141125_H22TJ_2305_3_L005_R1.fastq.gz
bwa aln -t 64 -f 141125_H22TJ_2305_3_L005_R2.sai hg19 141125_H22TJ_2305_3_L005_R2.fastq.gz


#BWA conversion sai to sam
bwa sampe -f 141125_H22TJ_2305_3_L005.sam -r "@RG\tID:2305\tLB:2305\tSM:2305\tPL:ILLUMINA" hg19 141125_H22TJ_2305_3_L005_R1.sai 141125_H22TJ_2305_3_L005_R2.sai 141125_H22TJ_2305_3_L005_R1.fastq.gz 141125_H22TJ_2305_3_L005_R2.fastq.gz

#Picard make genome dictionary and Samtools genome index
java -jar /usr/bin/picard.jar CreateSequenceDictionary R= hg19.fa O= hg19.dict
samtools faidx hg19.fa

#Picard sorting of sam and conversion to bam
java -Xmx4g -Djava.io.tmpdir=/tmp \
-jar picard.jar SortSam \
SO=coordinate \
INPUT=141125_H22TJ_2305_3_L005.sam \
OUTPUT=141125_H22TJ_2305_3_L005.bam \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true


#Picard marking of PCR duplicates (same start and end position in both paired reads)
java -Xmx4g -Djava.io.tmpdir=/tmp \
-jar picard.jar MarkDuplicates \
INPUT=141125_H22TJ_2305_3_L005.bam \
OUTPUT=141125_H22TJ_2305_3_L005.marked.bam \
METRICS_FILE=metrics \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=LENIENT


#GATK creates a table of possible indels
java -Xmx4g -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R hg19.fa \
-o 141125_H22TJ_2305_3_L005.marked.bam.list \
-I 141125_H22TJ_2305_3_L005.marked.bam


#GATK realigning reads around those indels
java -Xmx4g -Djava.io.tmpdir=/tmp \
-jar GenomeAnalysisTK.jar \
-I 141125_H22TJ_2305_3_L005.marked.bam \
-R hg19.fa \
-T IndelRealigner \
-targetIntervals 141125_H22TJ_2305_3_L005.marked.bam.list \
-o 141125_H22TJ_2305_3_L005.marked.realigned.bam


#Picard fixes mate information that may be changed during the realignment process
java -Djava.io.tmpdir=/tmp/flx-auswerter \
-jar picard.jar FixMateInformation \
INPUT=141125_H22TJ_2305_3_L005.marked.realigned.bam \
OUTPUT=141125_H22TJ_2305_3_L005.marked.realigned.fixed.bam \
SO=coordinate \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true


#GATK quality score recalibration as quality data from the sequencer isn't always very accurate - requires a dbSNP file
java -Xmx4g -jar GenomeAnalysisTK.jar \-l INFO \
-R hg19.fa \
--DBSNP dbsnp142.txt \
-I 141125_H22TJ_2305_3_L005.marked.realigned.fixed.bam \
-T CountCovariates \
-cov ReadGroupCovariate \
-cov QualityScoreCovariate \
-cov CycleCovariate \
-cov DinucCovariate \
-recalFile 141125_H22TJ_2305_3_L005.recal_data.csv


#GATK quality score recalibration - table recalibration
java -Xmx4g -jar GenomeAnalysisTK.jar \
-l INFO \
-R hg19.fa \
-I 141125_H22TJ_2305_3_L005.marked.realigned.fixed.bam \
-T TableRecalibration \
--out 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.bam \
-recalFile 141125_H22TJ_2305_3_L005.recal_data.csv

#GATK UnifiedGenotyper SNP calling 
java -Xmx4g -jar GenomeAnalysisTK.jar \
-glm BOTH \
-R hg19.fa \
-T UnifiedGenotyper \
-I 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.bam \
-D dbsnp142.txt \
-o 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.vcf \
-metrics snps.metrics \
-stand_call_conf 50.0 \
-stand_emit_conf 10.0 \
-dcov 1000 \
-A DepthOfCoverage \
-A AlleleBalance \


#filter (flag) probable false SNPs (clustered in window of 10bp; with low coverage; very low quality etc.)
java -Xmx4g -jar GenomeAnalysisTK.jar \
-R hg19.fa \
-T VariantFiltration \
-B:variant,VCF 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.vcf \
-o 141125_H22TJ_2305_3_L005.marked.realigned.fixed.recal.filtered.vcf \
--clusterWindowSize 10 \
--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
--filterName "HARD_TO_VALIDATE" \
--filterExpression "DP < 5 " \
--filterName "LowCoverage" \
--filterExpression "QUAL < 30.0 " \
--filterName "VeryLowQual" \
--filterExpression "QUAL > 30.0 && QUAL < 50.0 " \
--filterName "LowQual" \
--filterExpression "QD < 1.5 " \
--filterName "LowQD" \
--filterExpression "SB > -10.0 " \
--filterName "StrandBias"

Monday, May 18, 2015

List files that match a pattern and do not match another pattern

 If you want to list files that match certain pattern use ls and grep in a pipe:
  
ls | grep '.vcf$'

1410UNHS-0007_2108_1_SNP_INDEL.vcf
1410UNHS-0007_2305_3_SNP_INDEL.vcf
1411KHX-0032_1522-2_SNP_INDEL.vcf
1411KHX-0032_2356-3_SNP_INDEL.vcf
1503UNHX-0001_0020805_4_SNP_INDEL.vcf
1503UNHX-0001_200212_2_SNP_INDEL.vcf
1503UNHX-0001_2510_4_SNP_INDEL.vcf
1503UNHX-0001_2989_3_SNP_INDEL.vcf
1503UNHX-0001_CA1346_SNP_INDEL.vcf
1503UNHX-0001_CA1508_SNP_INDEL.vcf
new_1410UNHS-0007_2108_1_SNP_INDEL.vcf
new_1410UNHS-0007_2305_3_SNP_INDEL.vcf
new_1411KHX-0032_1522-2_SNP_INDEL.vcf
new_1411KHX-0032_2356-3_SNP_INDEL.vcf
new_1503UNHX-0001_0020805_4_SNP_INDEL.vcf
new_1503UNHX-0001_200212_2_SNP_INDEL.vcf
new_1503UNHX-0001_2510_4_SNP_INDEL.vcf
new_1503UNHX-0001_2989_3_SNP_INDEL.vcf
new_1503UNHX-0001_CA1346_SNP_INDEL.vcf
new_1503UNHX-0001_CA1508_SNP_INDEL.vcf


But now if you want to exclude from this list those files that contain "new" at the beginning use grep -v

ls | grep '.vcf$' | grep -v '^new'

1410UNHS-0007_2108_1_SNP_INDEL.vcf
1410UNHS-0007_2305_3_SNP_INDEL.vcf
1411KHX-0032_1522-2_SNP_INDEL.vcf
1411KHX-0032_2356-3_SNP_INDEL.vcf
1503UNHX-0001_0020805_4_SNP_INDEL.vcf
1503UNHX-0001_200212_2_SNP_INDEL.vcf
1503UNHX-0001_2510_4_SNP_INDEL.vcf
1503UNHX-0001_2989_3_SNP_INDEL.vcf
1503UNHX-0001_CA1346_SNP_INDEL.vcf
1503UNHX-0001_CA1508_SNP_INDEL.vcf

Monday, May 11, 2015

Combine two variables in the same order in bash

If you want to combine two for loops in bash, you will get all possible combinations, e.g.

A="0 1 2 3"
B="ab CD ef GH"
for i in $A ; do
    for j in $B ; do
       echo $i$j
    done
done


0ab
0CD
0ef
0GH
1ab
1CD
1ef
1GH
2ab
2CD
2ef
2GH
3ab
3CD
3ef
3GH



If you want to combine variables of the same order use the following code:

A=(0 1 2 3); B=(ab CD ef GH)
for i in $(seq 0 8)
do
echo ${A[$i]}${B[$i]}
done


0ab
1CD
2ef
3GH


For example if you have files that you have to merge you can use this loop.n this example we merge 4 fastq files for 9 samples with different names.


A=(1 2 3 4 5 6 7 14 15); B=("90715018_S1" "317155-2" "7103002-3" "1042702-4" "313605-4" "24635-2" "2040401" "24156" "1483"); for i in $(seq 0 9);
do
cat ${B[$i]}_S${A[$i]}_L001_R1_001.fastq.gz ${B[$i]}_S${A[$i]}_L002_R1_001.fastq.gz ${B[$i]}_S${A[$i]}_L003_R1_001.fastq.gz ${B[$i]}_S${A[$i]}_L004_R1_001.fastq.gz >  ${B[$i]}_S${A[$i]}_L001_L004_R1_001$.fastq.gz


cat ${B[$i]}_S${A[$i]}_L001_R2_001.fastq.gz ${B[$i]}_S${A[$i]}_L002_R2_001.fastq.gz ${B[$i]}_S${A[$i]}_L003_R2_001.fastq.gz ${B[$i]}_S${A[$i]}_L004_R2_001.fastq.gz >  ${B[$i]}_S${A[$i]}_L001_L004_R2_001$.fastq.gz


done

For loop with discontinuous numbers

To make a for loop in bash with e.g. 7 consecutive numbers from 1-7 do:

for j in {1..7}
do
echo $j
done


1
2
3
4
5
6
7



But if you want to make a loop that contains discontinuous numbers 1-7 and 14-15 simply add:

for j in {1..7} 14 15
do
echo $j
done


1
2
3
4
5
6
7

14
15

or:


for j in {1..7} {14..15}
do
echo $j
done


1
2
3
4
5
6
7

14
15

Thursday, May 7, 2015

Select columns that are delimited with irregular number of multiple spaces

To select columns in a file that are delimited with irregular number of consecutive spaces you may think of using cut command but this will not work.

In a file that is delimited with 7, 6, 5, 4 etc. space characters following commands will not select needed columns

cut -d ' ' -f 2,3,4,5 snp.142.txt| head

585     chr1    10019   10020   rs376643643     0       +       A       A       -/A     genomic deletion        unknown 0       0       near-gene-5     exact   1               1       SSMP,   0
585     chr1    10056   10056   rs373328635     0       +       -       -       -/A     genomic insertion       unknown 0       0       near-gene-5     between 1               1       SSMP,   0
585     chr1    10107   10108   rs62651026      0       +       C       C       C/T     genomic single  unknown 0       0       near-gene-5     exact   1               1       BCMHGSC_JDW,    0
585     chr1    10108   10109   rs376007522     0       +       A       A       A/T     genomic single  unknown 0       0       near-gene-5     exact   1               1       BILGI_BIOE,     0
585     chr1    10138   10139   rs368469931     0       +       A       A       A/T     genomic single  unknown 0       0       near-gene-5     exact   1               1       BILGI_BIOE,     0
585     chr1    10144   10145   rs144773400     0       +       A       A       -/A     genomic deletion        unknown 0       0       near-gene-5     exact   1               1       BL,     0



Also specifying multiple spaces is not possible:

cut -d '     ' -f 2,3,4,5 snp142.txt |head
 

cut: the delimiter must be a single character


Instead use awk command


awk '{print $2,$3,$4,$5}' snp142.txt | head

chr1 10019 10020 rs376643643
chr1 10056 10056 rs373328635
chr1 10107 10108 rs62651026
chr1 10108 10109 rs376007522
chr1 10138 10139 rs368469931
chr1 10144 10145 rs144773400

Monday, May 4, 2015

Delete files with size 0 in Unix

 If you are writing a script and execute it and something's not working, you may end up with a lot of empty files that you want to delete. 
Use a simple find command with the -size argument

find . -size 0 -delete