Thursday, May 26, 2016

Looping samtools for multiple sam and converting sam>bam through find/xargs

It might be the case that you have lots of sam files in multiple subfolders and you want to issue one command to convert them to bam files. This might be particularly useful when sam files have the same name and you want to keep converted bam files in the subfolders. To convert sam to bam you can use these three commands that include find piped to xargs:

find . -name *sam | xargs -I % sh -c 'echo %; samtools view -bS % > %.bam;'
find . -name *sam.bam | xargs -I % sh -c 'echo %; samtools sort % %.sorted;'
find . -name *sam.bam.sorted | xargs -I % sh -c 'echo %; samtools index %;'

Here is the output of the second command where you can see that find command will use files in subfolders N deep named /Pass2 generated by STAR and pipe it to xargs.

find . -name *sam.bam | xargs -I % sh -c 'echo %; samtools sort % %.sorted;'
./mS10_1.fastq.gz_mS10_2.fastq.gz/Pass2/Aligned.out.sam.bam
[bam_sort_core] merging from 80 files...
./FBS2_S4_merged_R1_001.fastq.gz_FBS2_S4_merged_R2_001.fastq.gz/Pass2/Aligned.out.sam.bam
[bam_sort_core] merging from 101 files...
./FBS4_S5_merged_R1_001.fastq.gz_FBS4_S5_merged_R2_001.fastq.gz/Pass2/Aligned.out.sam.bam
[bam_sort_core] merging from 54 files...
./FBS6_S6_merged_R1_001.fastq.gz_FBS6_S6_merged_R2_001.fastq.gz/Pass2/Aligned.out.sam.bam
[bam_sort_core] merging from 52 files...
./SF4_S1_merged_R1_001.fastq.gz_SF4_S1_merged_R2_001.fastq.gz/Pass2/Aligned.out.sam.bam
[bam_sort_core] merging from 93 files...

Thursday, May 12, 2016

How to remove transcript version from GENCODE/ENSEMBL annotations

If you have downloaded GENCODE/ENSEMBL annotations and used it for example with featurecounts or HTSeq to create reads counts and now you need to convert back ENSEMBL IDs to gene symbols you can use my script ensemble2genename that will pull annotations from Biomart:

https://github.com/milospjanic/ensemble2genename

However to use this script on GENCODE annotations you will have to remove transcript version eg. ENSG00000223972.4 has to be converted to ENSG00000223972. To do this use this sed oneliner:


sed 's/\.[0-9]*//g' file.txt

Thursday, May 5, 2016

Merging multiple files using first column as hash keys with awk - part 3

Modification of the script from the previous post, where rows with missing values are not excluded but rather missing values are set to 0.

NR==FNR { h1[$1] = $1; h2[$1] = $2; next; }
NR!=FNR { l=FNR-1;}
NF{ if($1 in h1) h2[$1] = h2[$1] OFS $2;}
NF{ if($1 in h1 && $2=="") h2[$1] = h2[$1] OFS 0}
END { for(k in h2)
  if(split(h2[k], a) > ARGC-2)
    print k OFS h2[k]
}
Example, 4th row in 4.test is missing a value in col2.

mail 5
now 7
tomorrow 7
string 5
do 6
DN52ei1f:~ milospjanic$ cat 2.test 
mail 4
now 4
tomorrow 4
string 4
do 4
DN52ei1f:~ milospjanic$ cat 3.test 
mail 6
now 6
tomorrow 7
string 5
do 67
DN52ei1f:~ milospjanic$ cat 4.test 
mail 89
now 75
tomorrow 75
string 
do 555
DN52ei1f:~ milospjanic$ awk -f script.awk 1.test 2.test 3.test 4.test 
tomorrow 7 4 7 75
do 6 4 67 555
now 7 4 6 75
string 5 4 5  0
mail 5 4 6 89

Wednesday, May 4, 2016

Merging multiple files using first column as hash keys with awk - part 2


Continuing on the previous post, if one of the files has a missing value(s) even though they contain the correct element in first column, this may cause the output table to be in wrong order due to skipping of the values. Use this code to eliminate rows that do not contain values across all tested files:

NR==FNR { h1[$1] = $1; h2[$1] = $2; next }
NR!=FNR { l=FNR-1 }
NF{ if($1 in h1) h2[$1] = h2[$1] OFS $2;}
END { for(k in h2)
  if(split(h2[k], a) > ARGC-2)
    print k OFS h2[k]
}
Output without string due to missing value in 4.test:

DN52ei1m:~ milospjanic$ cat 1.test 
mail 5
now 7
tomorrow 7
string 5
do 6
DN52ei1m:~ milospjanic$ cat 2.test 
mail 4
now 4
tomorrow 4
string 4
do 4
DN52ei1m:~ milospjanic$ cat 3.test 
mail 6
now 6
tomorrow 7
string 5
do 67
DN52ei1m:~ milospjanic$ cat 4.test 
mail 89
now 75
tomorrow 75
string 
do 555
DN52ei1m:~ milospjanic$ awk -f script.awk 1.test 2.test 3.test 4.test 
tomorrow 7 4 7 75
do 6 4 67 555
now 7 4 6 75
mail 5 4 6 89

Merging multiple files using first column as hash keys with awk

Lets say you have multiple files with the same first column and you want to merge them. Use this awk code, note that this makes sense only if the elements of the first column are the same (order is not important) as only in that case the output table order is preserved.


NR==FNR { h1[$1] = $1; h2[$1] = $2; next }
NF{ if($1 in h1) h2[$1] = h2[$1] OFS $2;}
END { for(k in h2)
  if(split(h2[k], a) > 1)
    print k OFS h2[k]
}
Save as e.g. script.awk. Here is an example of merging 4 files into one composite:

DN52ei1m:~ milospjanic$ cat 1.test 
mail 5
now 7
tomorrow 7
string 5
do 6
DN52ei1m:~ milospjanic$ cat 2.test 
mail 4
now 4
tomorrow 4
string 4
do 4
DN52ei1m:~ milospjanic$ cat 3.test 
mail 6
now 6
tomorrow 7
string 5
do 67
DN52ei1m:~ milospjanic$ cat 4.test 
mail 89
now 75
tomorrow 75
string 5
do 555
DN52ei1m:~ milospjanic$ awk -f script.awk 1.test 2.test 3.test 4.test 
tomorrow 7 4 7 75
do 6 4 67 555
now 7 4 6 75
string 5 4 5 5
mail 5 4 6 89