Wednesday, November 25, 2015

How to delete large folders with rsync

If you have large folders with large numbers of big files, using rm to delete it completely might cause problems or be relatively slow.
Use this elegant way to remove the whole folder with rsync that is also substantially faster than using rm command.


mkdir empty_folder

rsync -a --delete empty_folder/ folder_to_delete/



Tuesday, November 17, 2015

Filter dbSNP vcf with a list of SNP IDs

If you want to get vcf file with a specific set of SNPs, e.g. lets say that you have a list of SNP IDs in one file and a complete dbSNP vcf set in another file, you can use awk scripting to quickly obtain a vcf file with your set of SNPs.

File with SNP IDs, named CAD_SNP_positions.bed

chr1    2162944 2162945 rs590367
chr1    2163568 2163569 rs263533
chr1    2164116 2164117 rs263532
chr1    2164699 2164700 rs377599
...

Vcf file with dbSNP variants, named dbSNP-common_all.vcf.gz
...
##INFO=<ID=COMMON,Number=1,Type=Integer,Description="RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10177   rs367896724     A       AC      .       .       RS=367896724;RSPOS=10177;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020005140026000200;WGT=1;VC=DIV;R5;ASP;VLD;KGPhase3;CAF=0.5747,0.4253;COMMON=1
1       10352   rs145072688     T       TA      .       .       RS=145072688;RSPOS=10353;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000020005000002000200;WGT=1;VC=DIV;R5;ASP;CAF=0.5625,0.4375;COMMON=1
...

Do the following:

mpjanic@valkyr:~/vcf_subset_selection_script$ grep "#" <(gzip -dc dbSNP-common_all.vcf.gz) > CAD_SNP_common_all.vcf

mpjanic@valkyr:~/vcf_subset_selection_script$ awk 'NR==FNR {h[$4] = $4; next} {if(h[$3]) print$0}' CAD_SNP_positions.bed <(gzip -dc dbSNP-common_all.vcf.gz) >> CAD_SNP_common_all.vcf


Or in case you want to gzip it immediately pipe it to gzip:

mpjanic@valkyr:~/vcf_subset_selection_script$ grep "#" <(gzip -dc dbSNP-common_all.vcf.gz) | gzip > CAD_SNP_common_all.vcf

mpjanic@valkyr:~/vcf_subset_selection_script$ awk 'NR==FNR {h[$4] = $4; next} {if(h[$3]) print$0}' CAD_SNP_positions.bed <(gzip -dc dbSNP-common_all.vcf.gz) | gzip >> CAD_SNP_common_all.vcf





Saturday, November 14, 2015

How to execute multiple unix commands within find -exec command

In a previous post you could find a solution how to pipe multiple commands within unix find -exec command.

If you need to execute multiple commands without piping them, you can call -exec multiple times within a find command.

This commands will first invoke shell and run ls -lh, then run echo and list the name of the file.

sudo find ./  -iname '*fastq.gz' -exec sh -c "ls -lh {}" \; -exec echo {} \;

-rw-rw-r-- 1 clint clint 23G Jan 26  2015 ./1410UNHS-0007/2102/141107_H0E37_2102_L002_R1.fastq.gz
./1410UNHS-0007/2102/141107_H0E37_2102_L002_R1.fastq.gz
-rw-rw-r-- 1 clint clint 24G Jan 26  2015 ./1410UNHS-0007/2102/141107_H0E37_2102_L002_R2.fastq.gz
./1410UNHS-0007/2102/141107_H0E37_2102_L002_R2.fastq.gz
-rw-rw-r-- 1 clint clint 30G Jan 26  2015 ./1410UNHS-0007/2102/141125_H22TJ_2102_L002_R2.fastq.gz
./1410UNHS-0007/2102/141125_H22TJ_2102_L002_R2.fastq.gz
-rw-rw-r-- 1 clint clint 28G Jan 26  2015 ./1410UNHS-0007/2102/141125_H22TJ_2102_L002_R1.fastq.gz
./1410UNHS-0007/2102/141125_H22TJ_2102_L002_R1.fastq.gz
-rw-rw-r-- 1 clint clint 40G Jan 26  2015 ./1410UNHS-0007/1448_1/141125_H22TC_1448_1_L007_R1.fastq.gz
./1410UNHS-0007/1448_1/141125_H22TC_1448_1_L007_R1.fastq.gz
-rw-rw-r-- 1 clint clint 41G Jan 26  2015 ./1410UNHS-0007/1448_1/141125_H22TC_1448_1_L007_R2.fastq.gz
./1410UNHS-0007/1448_1/141125_H22TC_1448_1_L007_R2.fastq.gz
-rw-rw-r-- 1 clint clint 36G Jan 26  2015 ./1410UNHS-0007/2228_1/141125_H22TJ_2228_1_L004_R2.fastq.gz
./1410UNHS-0007/2228_1/141125_H22TJ_2228_1_L004_R2.fastq.gz
-rw-rw-r-- 1 clint clint 32G Jan 26  2015 ./1410UNHS-0007/2228_1/141125_H22TJ_2228_1_L004_R1.fastq.gz
./1410UNHS-0007/2228_1/141125_H22TJ_2228_1_L004_R1.fastq.gz
...

How to pipe multiple unix commands within find -exec command

If you want to e.g. find all fastq.gz files in your current folder and all its subfolders use sudo -find command:

sudo find ./  -iname '*fastq.gz' -exec ls -l {} \;


This will list all fastq.gz files in current folder and subfolders.
-rw-rw-r-- 1 clint clint 24254450268 Jan 26  2015 ./1410UNHS-0007/2102/141107_H0E37_2102_L002_R1.fastq.gz
-rw-rw-r-- 1 clint clint 24760915673 Jan 26  2015 ./1410UNHS-0007/2102/141107_H0E37_2102_L002_R2.fastq.gz
-rw-rw-r-- 1 clint clint 31186684227 Jan 26  2015 ./1410UNHS-0007/2102/141125_H22TJ_2102_L002_R2.fastq.gz
-rw-rw-r-- 1 clint clint 29152539538 Jan 26  2015 ./1410UNHS-0007/2102/141125_H22TJ_2102_L002_R1.fastq.gz
-rw-rw-r-- 1 clint clint 42068280338 Jan 26  2015 ./1410UNHS-0007/1448_1/141125_H22TC_1448_1_L007_R1.fastq.gz
-rw-rw-r-- 1 clint clint 43958530387 Jan 26  2015 ./1410UNHS-0007/1448_1/141125_H22TC_1448_1_L007_R2.fastq.gz
-rw-rw-r-- 1 clint clint 37881056340 Jan 26  2015 ./1410UNHS-0007/2228_1/141125_H22TJ_2228_1_L004_R2.fastq.gz

-rw-rw-r-- 1 clint clint 34110547387 Jan 26  2015 ./1410UNHS-0007/2228_1/141125_H22TJ_2228_1_L004_R1.fastq.gz
...

Now if you want to perform an operation on these files that involves piping multiple unix commands invoke shell with sh -c

sudo find ./  -iname '*fastq.gz' -exec sh -c "zcat {} | paste - - - - | wc -l" \;


This example will count the number of sequences in all existing fastq.gz files with a pipe composed of zcat, paste and wc commands.