Wednesday, March 8, 2017

Collapsing transcript expression levels into gene levels by sum, maximum and average - awk code

In case you have generated a file containing gene, transcript (isoform), expression level,

DN0a272187:~ milospjanic$ cat file.txt
GENE1,ISOFORM1,4
GENE1,ISOFORM2,6
GENE1,ISOFORM3,9
GENE2,ISOFORM1,43
GENE2,ISOFORM2,16
GENE3,ISOFORM3,19
GENE3,ISOFORM1,43
GENE3,ISOFORM2,4
GENE4,ISOFORM1,21
Use this awk code to collapse this file and sum on the gene level. Sum all $3 in hash with $1 as a key, then loop through the array and print keys and values.

awk -F, '{array[$1]+=$3} END { for (i in array) {print i"," array[i]}}' file.txt
GENE1,19
GENE2,59
GENE3,66
GENE4,21
To collapse the isoform file and select the top isoform file use this awk code. Use hash max with the key $1 and value $3. If $3 is greater than max change max to new $3 value and put $3 in array. Next, loop through the array and print keys and values. 

awk -F, '{if($3>max[$1]){array[$1]=$3; max[$1]=$3}} END { for (i in array) {print i"," array[i]}}' file.txt
GENE1,9
GENE2,43
GENE3,43
GENE4,21
To plot average use the following code. Create array hash with $1 as keys that sums $3 values, and 'no' hash with $1 as key that increases by 1. Loop through the array and print keys and division of hash 'array' and 'no', which is a sum divided by number of transcripts.

awk -F, '{array[$1]+=$3; no[$1]+=1} END { for (i in array) {print i"," array[i]/no[i]}}'  file.txt
GENE1,6.33333
GENE2,29.5
GENE3,22
GENE4,21

Wednesday, March 1, 2017

Resolving DESeq error: non-numeric argument to mathematical function

Running the following DESeq code,

#!/usr/bin/Rscript
library(DESeq)
data<-read.table("mastertable.diff", header=T, row.names = 1, check.names=F)
meta<-read.delim("meta.data.diff", header=T)
conds<-factor(meta$Condition)
sampleTable<-data.frame(sampleName=colnames(data), condition=conds)
countsTable = data
cds <- newCountDataSet( countsTable, conds)
cds <- estimateSizeFactors( cds )
sizeFactors( cds )
head(counts(cds))
head(counts(cds,normalized=TRUE))
cds = estimateDispersions( cds, method="blind", sharingMode="fit-only" )
str( fitInfo(cds) )
plotDispEsts( cds )
res = nbinomTest( cds, "CTRL-MOCK", "CRISPR-DOWN" )
head(res)
plotMA(res)
hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="")
resSig = res[ res$padj < 0.1, ]
write.csv( res, file="Result_Table.csv" )
write.csv( resSig[ order( resSig$foldChange, -resSig$baseMean ), ] , file="DownReg_Result_Table.csv" )
write.csv( resSig[ order( -resSig$foldChange, -resSig$baseMean ), ], file="UpReg_Result_Table.csv" )
cdsBlind = estimateDispersions( cds, method="blind" )
vsd = varianceStabilizingTransformation( cdsBlind )
library("RColorBrewer")
library("gplots")
select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:250]
hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6))
select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:500]
heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6))
select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:1000]
heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6))
print(plotPCA(vsd, intgroup=c("condition")))
I got an error,

> cds <- newCountDataSet( countsTable, conds)
Error in round(countData) : non-numeric argument to mathematical function
Data frame looked fairly ok,

> head(data)
             up_207_1 up_207_5 up_207_7 up_207_11 up_207_13 up_207_17
ARL14EPL            0        0        0         0         1         0
LOC100169752        0        0        0         0         0         0
CAPN6               0        0        0         0         0         0
IFNA13             19       20       58        14        17        13
HSP90AB1         6030    10422    15058      3030      8696      3848
DACH1               0        0        0         0         0         0
But after checking classes and modes of R objects with sapply, columns were class: factor.

> sapply(data, mode)
 up_207_1  up_207_5  up_207_7 up_207_11 up_207_13 up_207_17 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
> sapply(data, class)
 up_207_1  up_207_5  up_207_7 up_207_11 up_207_13 up_207_17 
 "factor"  "factor"  "factor"  "factor"  "factor"  "factor" 
Changing the class of columns to numeric,

> numc <- sapply(data, is.factor)
> numc
 up_207_1  up_207_5  up_207_7 up_207_11 up_207_13 up_207_17 
     TRUE      TRUE      TRUE      TRUE      TRUE      TRUE 
> data[numc] <- lapply(data[numc], function(x) as.numeric(as.character(x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion
5: In FUN(X[[i]], ...) : NAs introduced by coercion
6: In FUN(X[[i]], ...) : NAs introduced by coercion

Rerunning the script with these lines gave the same error, as NAs were introduced. Then I checked what lines contain NA's in the data frame.

> data[!complete.cases(data),]
     up_207_1 up_207_5 up_207_7 up_207_11 up_207_13 up_207_17
Gene       NA       NA       NA        NA        NA        NA
So in data frame there was an empty line that was assigned NAs after converting with as.numeric. Removing all the lines with NAs,

> data<-na.omit(data)
> data[!complete.cases(data),]
[1] up_207_1  up_207_5  up_207_7  up_207_11 up_207_13 up_207_17
<0 rows> (or 0-length row.names)
Data frame is now numeric class and free of NAs. Rerunning the script was successful.