Tuesday, August 30, 2016

Example of cleaning GENCODE table with sed

This is one example how to clean your master table for RNA-Seq created using GENCODE transcript collection.
Table will contain first row with multiple gene IDs separated by |. If you need to keep only the transcript ID use the following approach with sed :


root@valkyr:/home/towerraid/mo/QC_Z259.B925_Genesips_Cundiff_Ins_071615.SE.RNASeqPolyA.RAPiD.Human# head mastertable.roundup.3
X071_2_E7_24h_TAG X071_2_E7_72h_TAG X071_2_E8_24h_TAG X071_2_E8_72h_TAG X334_1_E7_24h_TAG X334_1_E7_72h_TAG X334_1_E8_24h_TAG X334_1_E8_72h_TAG X756_3_E7_24h_TAG X756_3_E7_72h_TAG X756_3_E8_24h_TAG X756_3_E8_72h_TAG X835_1_E7_24h_TAG X835_1_E7_72h_TAG X835_1_E8_24h_TAG X835_1_E8_72h_TAG H1_E7_24h_TAG H1_E7_72h_TAG H1_E8_24h_TAG H1_E8_72h_TAG H7_E7_24h_TAG H7_E7_72h_TAG H7_E8_24h_TAG H7_E8_72h_TAG
ENST00000466638.5|ENSG00000130822.15|OTTHUMG00000024216.6|OTTHUMT00000337661.1|PNCK-021|PNCK|1204|retained_intron| 5 5 0 10 8 4 26 83 7 11 6 64 4 015 47 0 0 4 6 7 4 0 25
ENST00000499522.6|ENSG00000247081.7|OTTHUMG00000164798.4|OTTHUMT00000380348.1|BAALC-AS1-004|BAALC-AS1|1353|antisense| 4 6 5 10 15 12 12 9 7 4 13 12 14 810 11 18 25 9 6 8 10 2 9
ENST00000451029.5|ENSG00000105792.19|OTTHUMG00000023434.8|OTTHUMT00000139892.3|CFAP69-002|CFAP69|2614|nonsense_mediated_decay| 0 0 0 0 0 5 0 0 0 0 5 0 00
ENST00000616193.1|ENSG00000275771.1|OTTHUMG00000187876.1|OTTHUMT00000475652.1|AC140725.8-001|AC140725.8|880|processed_pseudogene| 0 0 0 0 0 0 0 1 0 0 0 01
ENST00000498318.1|ENSG00000204116.11|OTTHUMG00000021835.5|OTTHUMT00000057234.2|CHIC1-002|CHIC1|2800|nonsense_mediated_decay| 0 0 0 0 0 0 19 6 0 0 0 0 036 61 0 0 0 1 0 13 9 0


root@valkyr:/home/towerraid/mo/QC_Z259.B925_Genesips_Cundiff_Ins_071615.SE.RNASeqPolyA.RAPiD.Human# sed -e 's/|.*|.*|.*|.*|.*|//g' mastertable.roundup.3 | head
X071_2_E7_24h_TAG X071_2_E7_72h_TAG X071_2_E8_24h_TAG X071_2_E8_72h_TAG X334_1_E7_24h_TAG X334_1_E7_72h_TAG X334_1_E8_24h_TAG X334_1_E8_72h_TAG X756_3_E7_24h_TAG X756_3_E7_72h_TAG X756_3_E8_24h_TAG X756_3_E8_72h_TAG X835_1_E7_24h_TAG X835_1_E7_72h_TAG X835_1_E8_24h_TAG X835_1_E8_72h_TAG H1_E7_24h_TAG H1_E7_72h_TAG H1_E8_24h_TAG H1_E8_72h_TAG H7_E7_24h_TAG H7_E7_72h_TAG H7_E8_24h_TAG H7_E8_72h_TAG
ENST00000466638.5 5 5 0 10 8 4 26 83 7 11 6 64 4 0 15 47 0 0 4 6 7 4 0 25
ENST00000499522.6 4 6 5 10 15 12 12 9 7 4 13 12 14 8 10 11 18 25 9 6 8 10 2 9
ENST00000451029.5 0 0 0 0 0 5 0 0 0 0 5 0 0 0 3 0 0 0 0 0 0 0 3 0
ENST00000616193.1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1
ENST00000498318.1 0 0 0 0 0 0 19 6 0 0 0 0 0 0 36 61 0 0 0 1 0 13 9 0


Next, clear the transcript verson number also using sed :

root@valkyr:/home/towerraid/mo/QC_Z259.B925_Genesips_Cundiff_Ins_071615.SE.RNASeqPolyA.RAPiD.Human# sed -e 's/\.[0-9]*//g' mastertable.roundup.4 | head 
X071_2_E7_24h_TAG X071_2_E7_72h_TAG X071_2_E8_24h_TAG X071_2_E8_72h_TAG X334_1_E7_24h_TAG X334_1_E7_72h_TAG X334_1_E8_24h_TAG X334_1_E8_72h_TAG X756_3_E7_24h_TAG X756_3_E7_72h_TAG X756_3_E8_24h_TAG X756_3_E8_72h_TAG X835_1_E7_24h_TAG X835_1_E7_72h_TAG X835_1_E8_24h_TAG X835_1_E8_72h_TAG H1_E7_24h_TAG H1_E7_72h_TAG H1_E8_24h_TAG H1_E8_72h_TAG H7_E7_24h_TAG H7_E7_72h_TAG H7_E8_24h_TAG H7_E8_72h_TAG
ENST00000466638 5 5 0 10 8 4 26 83 7 11 6 64 4 0 15 47 0 0 4 6 7 4 0 25
ENST00000499522 4 6 5 10 15 12 12 9 7 4 13 12 14 8 10 11 18 25 9 6 8 10 2 9
ENST00000451029 0 0 0 0 0 5 0 0 0 0 5 0 0 0 3 0 0 0 0 0 0 0 3 0
ENST00000616193 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1
ENST00000498318 0 0 0 0 0 0 19 6 0 0 0 0 0 0 36 61 0 0 0 1 0 13 9 0

Monday, August 29, 2016

Useful script for preprocessing FeatureCounts to DESeq

This is a useful script if you do RNAseq for multiple samples/conditions and have featureCounts (or HTSeq, Kallisto, etc.) output files located in different subfolders with only the subfolder name distinguishing samples/conditions.

Script will go into subfolders, copy the output files, give them the correct sample names, and create a master table for DESeq or EdgeR using the subfolder names and column you want to extract from the files.

#! /bin/bash

# Find all files with extension $1 in subfolders and copy them into the current folder with changing the name of the file to their last subfolder plus .file ext

find -name *$1 | xargs -I % sh -c ' fileName=$(basename %); \
filePath=$(dirname %); \
lastDir=$(basename $filePath); \
echo copying $fileName to $(pwd)/$lastDir.file; \
cp % $lastDir.file \
;'

# Find all files with .file extension and cut the first and $2 column, save it as .cut file

find -name '*.file' | xargs -I % sh -c 'cut -f 1,'$2' %  > %.cut1;'

# remove header
find -name '*.file.cut1' | xargs -I % sh -c 'tail -n+2 % > %.cut2;'

# download fileMulti2TableMod1.awk

wget https://raw.githubusercontent.com/milospjanic/fileMulti2TableMod1/master/fileMulti2TableMod1.awk

# Find .file.cut files and call fileMulti2TableMod1.awk script to create master table

filescut=$(ls *.file.cut1.cut2) 
awk -f fileMulti2TableMod1.awk $(echo $filescut)> mastertable

# add header to mastertable

files=$(ls *.file) 
echo ${files} | sed 's/.file//g' > header
awk '{$1=" "$1}1' header > header2
cat header2 mastertable > mastertable.2

# remove .cut files
rm *.file.cut1
rm *.file.cut1.cut2

# remove fileMulti2TableMod1.awk
rm fileMulti2TableMod1.awk

#remove header
rm header
rm header2

mv mastertable.2 mastertable

Download and usage on GitHub

Thursday, August 25, 2016

Extract chromosome specific ChIA-PET interaction from joint file

If you have ChIA-PET output file (in this example downloaded from UCSC Genome Browser) with inter- and intra- chromosomal interactions:

mpjanic@valkyr:~/EMES/ChIA-PET_bed_files$ cut -f4 wgEncodeGisChiaPetK562CtcfInteractionsRep1.bed | head
chr1:121484520..121485419-chr10:42530561..42531081,2
chr1:121484520..121485419-chr10:42530561..42531081,2
chr1:214066326..214066826-chr10:72518556..72519163,2
chr1:214066326..214066826-chr10:72518556..72519163,2
chr1:31180640..31181156-chr11:66049646..66050541,2
chr1:31180640..31181156-chr11:66049646..66050541,2
chr1:68651784..68652284-chr11:1100945..1101530,2
chr1:68651784..68652284-chr11:1100945..1101530,2
chr1:121484496..121485051-chr11:48865024..48865528,2
chr1:121484496..121485051-chr11:48865024..48865528,2

And want to extract only intra- chromosomal interactions use the code:

mpjanic@valkyr:~/EMES/ChIA-PET_bed_files$ cut -f4 wgEncodeGisChiaPetK562CtcfInteractionsRep1.bed | grep chr1.*chr1: | head
chr1:839717..840790-chr1:872838..874070,6
chr1:839813..840540-chr1:998738..999431,2
chr1:839841..840733-chr1:855600..856780,3
chr1:839974..840594-chr1:847929..848569,2
chr1:913753..914320-chr1:1199829..1200629,2
chr1:919077..920172-chr1:998023..999932,8
chr1:919212..920171-chr1:1219377..1219906,2
chr1:919215..919720-chr1:1027180..1027941,2
chr1:919216..919764-chr1:936906..937428,2

Loop for all chromosomes:

mpjanic@valkyr:~/EMES/ChIA-PET_bed_files$ cat commands 
for i in {1..22} X Y;
do
echo chr$i specific interaction
cut -f4 wgEncodeGisChiaPetK562CtcfInteractionsRep1.bed | grep chr$i.*chr$i: | wc -l
done

Extract chromosome specific ChIA-PET interaction from joint file

If you have ChIA-PET output file with inter- and intra- chromosomal interactions:

mpjanic@valkyr:~/EMES/ChIA-PET_bed_files$ cut -f4 wgEncodeGisChiaPetK562CtcfInteractionsRep1.bed | head
chr1:121484520..121485419-chr10:42530561..42531081,2
chr1:121484520..121485419-chr10:42530561..42531081,2
chr1:214066326..214066826-chr10:72518556..72519163,2
chr1:214066326..214066826-chr10:72518556..72519163,2
chr1:31180640..31181156-chr11:66049646..66050541,2
chr1:31180640..31181156-chr11:66049646..66050541,2
chr1:68651784..68652284-chr11:1100945..1101530,2
chr1:68651784..68652284-chr11:1100945..1101530,2
chr1:121484496..121485051-chr11:48865024..48865528,2
chr1:121484496..121485051-chr11:48865024..48865528,2

And want to extract only intra- use the code:

mpjanic@valkyr:~/EMES/ChIA-PET_bed_files$ cut -f4 wgEncodeGisChiaPetK562CtcfInteractionsRep1.bed | grep chr1.*chr1: | head
chr1:839717..840790-chr1:872838..874070,6
chr1:839813..840540-chr1:998738..999431,2
chr1:839841..840733-chr1:855600..856780,3
chr1:839974..840594-chr1:847929..848569,2
chr1:913753..914320-chr1:1199829..1200629,2
chr1:919077..920172-chr1:998023..999932,8
chr1:919212..920171-chr1:1219377..1219906,2
chr1:919215..919720-chr1:1027180..1027941,2
chr1:919216..919764-chr1:936906..937428,2
chr1:919228..920215-chr1:967920..968557,2

Loop for all chromosomes:

mpjanic@valkyr:~/EMES/ChIA-PET_bed_files$ cat commands 
for i in {1..22} X Y;
do
echo chr$i specific interaction
cut -f4 wgEncodeGisChiaPetK562CtcfInteractionsRep1.bed | grep chr$i.*chr$i: | wc -l
done

Friday, August 19, 2016

GOSeq code for GO term enrichment of differentially expressed genes

If you have an output of the DESeq analysis of differentially expressed genes:

    X                id   baseMean  baseMeanA  baseMeanB foldChange
1   1 ENSG00000223972.4  0.3089417  0.5406481  0.0000000   0.000000
2   2 ENSG00000227232.4 51.6856965 46.1776533 59.0297541   1.278319
3   3 ENSG00000243485.2  0.0000000  0.0000000  0.0000000         NA
4   4 ENSG00000237613.2  0.0000000  0.0000000  0.0000000         NA
5   5 ENSG00000268020.2  0.0000000  0.0000000  0.0000000         NA
6   6 ENSG00000240361.1  0.0000000  0.0000000  0.0000000         NA
7   7 ENSG00000186092.4  0.0000000  0.0000000  0.0000000         NA
8   8 ENSG00000238009.2  0.5177663  0.2550477  0.8680578   3.403512
9   9 ENSG00000239945.1  0.0000000  0.0000000  0.0000000         NA
10 10 ENSG00000233750.3  0.4583062  0.8020358  0.0000000   0.000000
   log2FoldChange      pval      padj
1            -Inf 0.7521333 1.0000000
2       0.3542475 0.1023805 0.6247104
3              NA        NA        NA
4              NA        NA        NA
5              NA        NA        NA
6              NA        NA        NA
7              NA        NA        NA
8       1.7670242 0.6498254 1.0000000
9              NA        NA        NA
10           -Inf 0.5045118 1.0000000

and want to perform GOSeq analysis, that corrects for the length bias (shown in figure below - gene size effect shown in 400 gene bins as proportion of differentially expressed genes),



use the following code that will correct the ENSEMBL gene IDs (remove appended version number) and select genes with significant adjusted p-value:

library(goseq)
table<-read.delim("Result_Table.csv", header=T, sep=",")
table$id<-gsub("\\.[0-9].*", "",table$id)
genes=as.integer(table$padj<.05)
genes[is.na(genes)] <- 0
names(genes)=table$id
head(genes)
head(supportedOrganisms())
supportedOrganisms()[supportedOrganisms()$Genome=="hg19",]
pwf=nullp(genes,"hg19","ensGene")
GO.wall=goseq(pwf,"hg19","ensGene")
head(GO.wall)


Output in GO.wall table are significant GO terms:

> head(GO.wall)
        category over_represented_pvalue under_represented_pvalue numDEInCat
11673 GO:0044421            1.202999e-27                        1        610
2646  GO:0005576            1.593504e-26                        1        683
2677  GO:0005615            1.531784e-25                        1        261
11741 GO:0044699            1.158197e-22                        1       1764
11767 GO:0044763            3.101037e-22                        1       1645
5893  GO:0016477            1.439573e-21                        1        267
      numInCat                             term ontology
11673     3665        extracellular region part       CC
2646      4353             extracellular region       CC
2677      1302              extracellular space       CC
11741    13254          single-organism process       BP
11767    12106 single-organism cellular process       BP
5893      1176                   cell migration       BP

Wednesday, August 3, 2016

Making grouped barplot in ggplot2

Lets say you want to make a simple grouped barplot in R
using the data

cat for.R.plot
0.12 TF1-a
0.10 TF1-b
0.78 TF2-a
0.67 TF2-b
0.39 TF3-a

0.34 TF3-b

load ggplot2 package

library(ggplot2)


Read data

x<-read.delim("for.R.plot", header=F,sep=" ")

Add column with group names

x$V3<- c("TF1","TF1","TF2","TF2","TF3","TF3")


Plot in ggplot

pdf("output.pdf")
> ggplot(data=x, aes(x=V2,y=V1,fill=V3)) + scale_fill_manual(values=c("purple", "blue", "darkgreen")) +geom_bar(stat = "identity") +labs(title="Title of graph") + xlab("xlab title") + ylab("ylab title")+ theme(axis.text.x = element_text(colour="grey20",size=15,angle=45,hjust=.5,vjust=.5,face="plain"),axis.text.y = element_text(colour="grey20",size=14),axis.title.x = element_text(colour="grey20",size=14),axis.title.y = element_text(colour="grey20",size=14),plot.title = element_text(size=15))+theme(legend.title=element_blank())
> dev.off()


Output file: