Sunday, November 13, 2016

Introducing rnaSeqFPro, a workflow for RNA-Seq differential gene expression analysis

rnaSeqFPro is a script that will do full processing of paired RNA-Seq data starting from fastq.gz files placed in the same folder. Script will sort files and process paired .fastq.gz files. rnaSeqFPro will perform Fastqc quality control, it will map paired fastq files to the reference genome hg19 using STAR's second pass mapping.


Dependencies

Place fastqc.gz in a working folder

mkdir work.folder
cp path-to-files/*fastq.gz work.folder
FastQC

Instalation (Linux):

cd work.folder
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
chmod 755 ./FastQC/fastqc
cp ./FastQC/fastqc /usr/local/bin #may not work run it locally via link:
ln -s ./FastQC/fastqc .
STAR

Instalation (Linux):

# Get the latest STAR source
wget https://github.com/alexdobin/STAR/archive/2.5.2b.tar.gz
tar -xzf 2.5.2b.tar.gz
cd STAR-2.5.2b

# Build STAR
make STAR

# If you have a TeX environment, build the documentation
make manual

chmod 755 STAR
cp STAR /usr/local/bin
Reference genome

Download the reference genome, in this example it is human hg19:

mkdir ~/reference_genomes
cd ~/reference_genomes
mkdir hg19
cd hg19
wget --timestamping 
        'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit ' 
        -O hg19.2bit 
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
chmod 755 twoBitToFa
./twoBitToFa hg19.2bit hg19.fa
Indexing the reference genome

Use STAR to index the reference genome, use number of core on your machine, e.g. 64.

cd ~/reference_genomes
STAR  --runMode genomeGenerate --runThreadN 64 --genomeDir ./ --genomeFastaFiles hg19.fa
Download GENCODE transcript annotation

For example for human hg19 genome:

wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.annotation.gtf.gz
Install featureCounts

Install featureCounts. Downoad Subread binary from Sourceforge.

wget https://sourceforge.net/projects/subread/files/subread-1.5.1/subread-1.5.1-Linux-x86_64.tar.gz/download
Download fileMulti2TableMod1.

wget https://raw.githubusercontent.com/milospjanic/fileMulti2TableMod1/master/fileMulti2TableMod1.awk
In R install RGSEPD from Bioconductor.

source("https://bioconductor.org/biocLite.R")
biocLite("rgsepd")
In R install DESeq2.

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
In R install goseq

source("https://bioconductor.org/biocLite.R")
biocLite("goseq")
Download Kallisto binary.

wget https://github.com/pachterlab/kallisto/releases/download/v0.43.0/kallisto_linux-v0.43.0.tar.gz
Creating meta data table is necessary for RGSEPD to perform analysis using DESeq2/goseq. Table 1. shows an example of a meta data sheet. Note that sample names must be shown without .fastq.gz extension.

Sample  Condition       SHORTNAME
file.name.1 A CONDITION1
file.name.2 A CONDITION1
file.name.3 B CONDITION2
file.name.3 B CONDITION2
rnaSeqFPro is composed of four pipelines that will run a RGSEPD version on either human genome hg19 or mouse genome mm10, using either paired-end (PE) or single-read (SR) sequences.

rnaSeqFPro.PE.hg19.sh
rnaSeqFPro.PE.mm10.sh
rnaSeqFPro.SR.hg19.sh
rnaSeqFPro.SR.mm10.sh
Four additional pipelines are available to run a Kallisto version: PE hg19, SR hg19, PE mm10, and SR mm10.

rnaSeqFPro.PE.hg19.Kallisto.sh
rnaSeqFPro.PE.mm10.Kallisto.sh
rnaSeqFPro.SR.hg19.Kallisto.sh
rnaSeqFPro.SR.mm10.Kallisto.sh
After placing files in the working folder run the script that is suitable for your experiment, e.g:

chmod 755 rnaSeqFPro.PE.hg19.sh
./rnaSeqFPro.PE.hg19.sh
Download from GitHub:
https://github.com/milospjanic/rnaSeqFPro

Friday, November 4, 2016

Parallelize DESeq2 processing of RNA-Seq data with xargs. Code to perform multiple comparison of DE gene overlaps.

This is a code to automate the downstream comparisons of DE gene lists obtained using DESeq2. In this example, lists are located in various subfolders of the current folder.

Follow the code to calculate the number of DE genes in each subfolder, cut the first column with ReFSeq ID, save as separate file, remove quotations, use nested for loops to iterate through every combination of DE list pairs, compare each pair of DE gene lists using awk hash tables, save the overlap, calculate percentage of overlapping genes using the initial number of genes in each DE list in a pair.


#DESeq2 output file with ReFSeq IDs. In this example output files are located in multiple subfolders of the current folder
head ./CTR-TCF/DESEQ.RES.Ax1.Ex1.Annote_Filter.csv
"REFSEQ","baseMean","Ax1","Ex1","(X/Y)","LOG2(X/Y)","lfcSE","PVAL","PADJ","HGNC","ENTREZ"
"NM_001004366",52.8590278877529,6.79294520135589,1.80248640318681,45.9629650035973,5.52239996201874,1.4097928326663,8.95941020937956e-05,0.0164558142882275,"NM_001004366","NM_001004366"
"NM_001007223",41.7568221621828,8.16964446568062,1.80248640318681,118.282063619244,6.88608750868106,1.5958035927171,1.5951463982438e-05,0.0053387777564333,"NM_001007223","NM_001007223"
"NM_001008231",1696.15244217266,9.34709439325593,6.55127081394245,6.68546502082769,2.74102791333421,0.721750109768267,0.000146014310245597,0.0224400155776422,"NM_001008231","NM_001008231"
"NM_001025388",1817.64545598915,5.50851247905198,10.3459116079456,0.0362363048504379,-4.78642034382073,0.730277856835578,5.5923907091726e-11,1.16930170591509e-07,"NM_001025388","NM_001025388"
"NM_001033331",89.0716650432561,7.88692209047355,3.62921761184198,20.6510055475495,4.36814012656324,1.19391005598866,0.000253513276060988,0.0310419792744272,"NM_001033331","NM_001033331"
"NM_001037809",81.4745806890257,6.51501911066681,2.76040726966177,26.5682533972088,4.73163148274551,1.14953225287611,3.85276448555085e-05,0.0096669604860005,"NM_001037809","NM_001037809"
"NM_001037859",650.913051455931,11.2354640663983,7.20231733786188,13.4780131829126,3.75253593692515,0.949172065524538,7.7021632800311e-05,0.0152634580474406,"NM_001037859","NM_001037859"
"NM_001037865",39.3802742679229,7.4021164714241,1.80248640318681,60.2392280286425,5.9126313767434,1.51175340371297,9.18735384352608e-05,0.0166711730406441,"NM_001037865","NM_001037865"
"NM_001044751",23759.7730889387,10.6451432030245,14.122465637817,0.107058595986039,-3.22352745714107,0.869196041818117,0.000208382776310629,0.0275302894211788,"NM_001044751","NM_001044751"

#Calculate number of DE genes in each subfolder with xargs
find . -name "*Annote_Filter.csv" | xargs -I % sh -c 'wc -l %;'
1869 ./TCF-KDN/DESEQ.RES.Cx2.Ex1.Annote_Filter.csv
2926 ./HRT-LVR/DESEQ.RES.Bx2.Dx2.Annote_Filter.csv
157 ./CTR-TCF/DESEQ.RES.Ax1.Ex1.Annote_Filter.csv
384 ./TCF-HRT/DESEQ.RES.Bx2.Ex1.Annote_Filter.csv
2526 ./CTR-LVR/DESEQ.RES.Ax1.Dx2.Annote_Filter.csv
2515 ./HRT-KDN/DESEQ.RES.Bx2.Cx2.Annote_Filter.csv
1826 ./KDN-LVR/DESEQ.RES.Cx2.Dx2.Annote_Filter.csv
2053 ./TCF-LVR/DESEQ.RES.Dx2.Ex1.Annote_Filter.csv
175 ./CTR-HRT/DESEQ.RES.Ax1.Bx2.Annote_Filter.csv
2170 ./CTR-KDN/DESEQ.RES.Ax1.Cx2.Annote_Filter.csv

#Cut the first column with RefSeq ID in each DE gene list
find . -name "*Annote_Filter.csv" | xargs -I % sh -c 'cut -f1 -d "," % | head;'
"REFSEQ"
"NM_001001180"
"NM_001001444"
"NM_001001488"
"NM_001001489"
"NM_001001490"
"NM_001001979"
"NM_001001986"
"NM_001004141"
"NM_001004143"
"REFSEQ"
"NM_001001179"
"NM_001001309"
"NM_001001322"
"NM_001001892"
"NM_001001979"
"NM_001002012"
"NM_001002927"
"NM_001003719"
"NM_001003893"
"REFSEQ"
"NM_001004366"
"NM_001007223"
"NM_001008231"
"NM_001025388"
"NM_001033331"
"NM_001037809"
"NM_001037859"
"NM_001037865"
"NM_001044751"

...

#Cut first column on each DE list, save as separate file, filter quotations
find . -name "*Annote_Filter.csv" | xargs -I % sh -c 'cut -f1 -d "," % > %.cut;'

find . -name "*Annote_Filter.csv.cut" | xargs -I % sed -i 's/"//g' %

find . -name "*Annote_Filter.csv.cut" | xargs -I % sh -c 'head %;'
REFSEQ
NM_001001180
NM_001001444
NM_001001488
NM_001001489
NM_001001490
NM_001001979
NM_001001986
NM_001004141
NM_001004143
REFSEQ
NM_001001179
NM_001001309
NM_001001322
NM_001001892
NM_001001979
NM_001002012
NM_001002927
NM_001003719
NM_001003893
REFSEQ
NM_001004366
NM_001007223
NM_001008231
NM_001025388
NM_001033331
NM_001037809
NM_001037859
NM_001037865
NM_001044751

...

#List .cut files and save as file
find . -name "*Annote_Filter.csv.cut" > list

#Remove end of lines and substitute with space
tr '\n' ' ' < list > list2

mv list2 list

#Create two for loops and list through all combination of DE list pairs
#Compare each pair with awk, create hash tag from first file and use it to search second file
#Save comparison as .intersect file
#Count the number of line in each .intersect file

for i in $(cat list)
do for j in $(cat list)
do
awk 'NR==FNR {h[$1] = $1; next} {if (h[$1]) print $0}' $i $j > $(basename $(dirname $i)).$(basename $(dirname $j)).intersect
echo Total number of overlaps $(basename $(dirname $i)) - $(basename $(dirname $j)) $(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)
done
done

Total number of overlaps TCF-KDN - TCF-KDN 1869
Total number of overlaps TCF-KDN - HRT-LVR 933
Total number of overlaps TCF-KDN - CTR-TCF 71
Total number of overlaps TCF-KDN - TCF-HRT 187
Total number of overlaps TCF-KDN - CTR-LVR 887
Total number of overlaps TCF-KDN - HRT-KDN 1413
Total number of overlaps TCF-KDN - KDN-LVR 805
Total number of overlaps TCF-KDN - TCF-LVR 1077
Total number of overlaps TCF-KDN - CTR-HRT 84
Total number of overlaps TCF-KDN - CTR-KDN 1247
Total number of overlaps HRT-LVR - TCF-KDN 933
Total number of overlaps HRT-LVR - HRT-LVR 2926
Total number of overlaps HRT-LVR - CTR-TCF 76
Total number of overlaps HRT-LVR - TCF-HRT 221
Total number of overlaps HRT-LVR - CTR-LVR 1878
Total number of overlaps HRT-LVR - HRT-KDN 1411
Total number of overlaps HRT-LVR - KDN-LVR 1026
Total number of overlaps HRT-LVR - TCF-LVR 1589

...

#Create two nested for loops, iterate through all pairs of DE lists
#Calculate percent of overlapping genes using initial number of genes in each DE list in a pair

for i in $(cat list)
do for j in $(cat list)
do
echo Percent of overlaps $(basename $(dirname $i)) - $(basename $(dirname $j)) $(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)/$(wc -l $i | cut -d " " -f1) $(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)/$(wc -l $j |  cut -d " " -f1);
echo $(basename $(dirname $i))
echo "scale=4;$(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)/$(wc -l $i | cut -d " " -f1)" | bc
echo $(basename $(dirname $j))
echo "scale=4;$(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)/$(wc -l $j | cut -d " " -f1)" | bc
done
done

Percent of overlaps TCF-KDN - TCF-KDN 1869/1869 1869/1869
TCF-KDN
1.0000
TCF-KDN
1.0000
Percent of overlaps TCF-KDN - HRT-LVR 933/1869 933/2926
TCF-KDN
.4991
HRT-LVR
.3188
Percent of overlaps TCF-KDN - CTR-TCF 71/1869 71/157
TCF-KDN
.0379
CTR-TCF
.4522
Percent of overlaps TCF-KDN - TCF-HRT 187/1869 187/384
TCF-KDN
.1000
TCF-HRT
.4869
Percent of overlaps TCF-KDN - CTR-LVR 887/1869 887/2526
TCF-KDN
.4745
CTR-LVR
.3511
Percent of overlaps TCF-KDN - HRT-KDN 1413/1869 1413/2515
TCF-KDN
.7560
HRT-KDN
.5618

...

#Save output as a file

for i in $(cat list)
do for j in $(cat list)
do
echo Percent of overlaps $(basename $(dirname $i)) - $(basename $(dirname $j)) $(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)/$(wc -l $i | cut -d " " -f1) $(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)/$(wc -l $j |  cut -d " " -f1) >> output.txt 
echo $(basename $(dirname $i)) >> output.txt
echo "scale=4;$(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)/$(wc -l $i | cut -d " " -f1)" | bc >> output.txt
echo $(basename $(dirname $j)) >> output.txt
echo "scale=4;$(wc -l $(basename $(dirname $i)).$(basename $(dirname $j)).intersect | cut -d " " -f1)/$(wc -l $j | cut -d " " -f1)" | bc >> output.txt
done
done


Tuesday, November 1, 2016

Reassigning GENCODE counted reads on RefSeq identifiers. Solution to removing duplicated rows on ID column and selecting by summing on counts using awk scripting

In case you have used Gencode transcripts with featureCounts and have counted mapped reads on Gencode transcripts, in some cases you need to revert back to RefSeq ID (NM_  and NR_). The problem that appears in this case is that multiple Gencode IDs will correspond to a single RefSeq identifier, e.g. in this example NC000078 and NC000080.

head mastertable.genename
        
NC_000072       2       0       0       3       0       4       0       0
NC_000078       0       0       0       0       0       0       3       1
NC_000078       12      0       0       0       0       0       6       1
NC_000079       0       0       0       0       4       0       2       0
NC_000080       0       0       0       0       0       0       0       0
NC_000080       0       0       0       0       0       0       0       0
NC_000080       2       0       0       2       0       0       0       0
NM_001001130    113     149     119     154     149     149     138     199
NM_001001144    1565    855     947     1501    961     1295    874     1449
What you need to do is to select a most dominant RNA isoform and assign a RefSeq ID to it, and eliminate those isoforms that are not expressed and have mainly 0 counts. To do so first calculate sum for each row, and plot it as first column.

awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | head -n10
               
9       NC_000072       2       0       0       3       0       4       0       0
4       NC_000078       0       0       0       0       0       0       3       1
19      NC_000078       12      0       0       0       0       0       6       1
6       NC_000079       0       0       0       0       4       0       2       0
0       NC_000080       0       0       0       0       0       0       0       0
0       NC_000080       0       0       0       0       0       0       0       0
4       NC_000080       2       0       0       2       0       0       0       0
1170    NM_001001130    113     149     119     154     149     149     138     199
9447    NM_001001144    1565    855     947     1501    961     1295    874     1449
Next, sort using column 2 by RefSeq ID, and then by column 1, use reverse numerical sort from the highest to the lowest sum.

awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | sort -k2,2 -k1,1nr | head -n10

9       NC_000072       2       0       0       3       0       4       0       0
19      NC_000078       12      0       0       0       0       0       6       1
4       NC_000078       0       0       0       0       0       0       3       1
6       NC_000079       0       0       0       0       4       0       2       0
4       NC_000080       2       0       0       2       0       0       0       0
0       NC_000080       0       0       0       0       0       0       0       0
0       NC_000080       0       0       0       0       0       0       0       0
1170    NM_001001130    113     149     119     154     149     149     138     199
9447    NM_001001144    1565    855     947     1501    961     1295    874     1449
Finally, use awk to eliminate rows duplicated on one column, use awk '!a[$2]++' as a short version of awk '{if(! a[$2]){print; a[$2]++}}' meaning if the current second field ($1) is not present in the a array, which will happen if it is a unique field, print the line and add the 2st field to the a array. Next time we see the same field, it will be already in the array and it will not be printed due to the condition if(!a[$2]).


awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | sort -k2,2 -k1,1nr | awk '!a[$2]++' | head -n10

9       NC_000072       2       0       0       3       0       4       0       0
19      NC_000078       12      0       0       0       0       0       6       1
6       NC_000079       0       0       0       0       4       0       2       0
4       NC_000080       2       0       0       2       0       0       0       0
1170    NM_001001130    113     149     119     154     149     149     138     199
9447    NM_001001144    1565    855     947     1501    961     1295    874     1449
211     NM_001001152    46      18      56      56      7       12      3       13
19      NM_001001160    2       6       2       0       2       0       3       4
60      NM_001001177    9       0       4       10      10      2       23      2
Finally remove the first column:

awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | sort -k2,2 -k1,1nr | awk '!a[$2]++' | cut -f2-| head -n10
        
NC_000072       2       0       0       3       0       4       0       0
NC_000078       12      0       0       0       0       0       6       1
NC_000079       0       0       0       0       4       0       2       0
NC_000080       2       0       0       2       0       0       0       0
NM_001001130    113     149     119     154     149     149     138     199
NM_001001144    1565    855     947     1501    961     1295    874     1449
NM_001001152    46      18      56      56      7       12      3       13
NM_001001160    2       6       2       0       2       0       3       4
NM_001001177    9       0       4       10      10      2       23      2
We see that only NC_000080 and NC_000078 isoforms that show highest sum expression across samples were kept.

Final code:

awk '{for(i=1;i<=NF;i++) t+=$i; print t"\t"$0; t=0}' mastertable.genename | 
sort -k2,2 -k1,1nr | 
awk '!a[$2]++' | 
cut -f2- > mastertable.genename.cleaned