Wednesday, April 26, 2017

Awk code for counting isoforms in abundance.tsv files from Kallisto

If you have a series of abundance.tsv from Kallisto RNA-Seq quantification tool separated in different folder use this command to count protein coding isoforms in each output file

sudo find . -name abundance.tsv -exec sh -c "echo {}; grep protein_coding {} | wc -l"  \;
./1522_1hr_TNF_2/abundance.tsv
79795
./1522_6hr_TNF_2/abundance.tsv
79795
./1522_6hr_TGFB_2/abundance.tsv
79795
./1522_1hr_PMA_1/abundance.tsv
79795
./2989_6hr_TNF_2/abundance.tsv
79795
./2989_6hr_SF_1/abundance.tsv

...
To save the list of protein coding isoforms in each separate file use:

sudo find . -name abundance.tsv -exec sh -c "grep protein_coding {} > {}.protein_coding"  \;

To count isoforms that are not expressed use awk code that counts rows with est_counts==0

find . -type f -name abundance.tsv -exec sh -c 'awk "\$4==0{print \$0}" "{}" | wc -l' \;
108088
109840
113075
108092
119730
106363
110323
119521
117195
117358
98931

...
Similarly to count isoforms that are expressed use:

find . -type f -name abundance.tsv -exec sh -c 'awk "\$4!=0{print \$0}" "{}" | wc -l' \;
90532
88780
85545
90528
78890
92257
88297
79099
81425
81262
99689
91844
89495
...
List their sample names contained in the folder names:

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'echo $(basename $(dirname {}))' \;
1522_1hr_TNF_2
1522_6hr_TNF_2
1522_6hr_TGFB_2
1522_1hr_PMA_1
2989_6hr_TNF_2
2989_6hr_SF_1
1522_6hr_SF_2
2989_6hr_TNF_1
1522_1hr_TGFB_2
1522_1hr_TGFB_1
1522_1hr_PDGF_1
1522_6hr_PDGF_2
1522_6hr_TNF_1
2989_1hr_PDGF_1
...
 Make folders with their sample name:

find /home/mpjanic/HCASMC_RNASeq/ -type f -name abundance.tsv.protein_coding -exec sh -c 'mkdir $(basename $(dirname {}))' \;
Find out how many isoforms are expressed within the subgroup of protein coding isoforms:

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'awk "\$4!=0{print \$0}" "{}" | wc -l' \;
43139
42014
41017
43618
38208
43021
42031
38464
40069
40246
46087
43028
42478

...
Save in each folder files with isoforms that are expressed within the subgroup of protein coding isoforms

find . -type f -name abundance.tsv.protein_coding -exec sh -c 'awk "\$4!=0{print \$0}" "{}" > {}.expressed ' \;

Tuesday, April 4, 2017

Parsing dbSNP for insertions, single nucleotide polymorphisms and large deletions - awk code

If you download dbSNP database file in bed format using dBSNP, MySQL or UCSC Table browser,

(MySQL command for dbSNP147)

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -D hg19 -e 'SELECT chrom, chromStart, chromEnd, name FROM snp147Common' > snp147Common.bed
you will notice that coordinates of variants can be divided roughly into 3 categories:

1. insertions (same base pair coordinates), 
2. SNPs plus simple deletions (single base pair coordinates), 
and 
3. large deletions (more than 1 base pair difference in the coordinates).

To parse and separate these three categories, use the following awk code, checking if $2 equals $3 and placing the rows that fall into this category into .insertions file; then selecting those rows where $3=$2=1 and placing them in .snp.plus.simple.deletions; and finally selecting those rows that do not fall into the previous two selection criteria, that then have $2 and $3 separated with more than 1 bp difference.


#parse dbSNPs into insertions, SNPs and simple deletions, large deletions

if [ ! -f snp147Common.bed.insertions ]
then
awk '$2 == $3 {print $0}' snp147Common.bed > snp147Common.bed.insertions
fi

if [ ! -f snp147Common.bed.snp.plus.simple.deletions ]
then
awk '$3 == $2+1 {print $0}' snp147Common.bed > snp147Common.bed.snp.plus.simple.deletions
fi

if [ ! -f snp147Common.bed.large.deletions ]
then
awk '{if ($3 != $2+1 && $2 != $3) print $0}' snp147Common.bed > snp147Common.bed.large.deletions
fi
Check the output files whether they satisfy the criteria for selection:


mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.insertions
chr1 10177 10177 rs367896724
chr1 10352 10352 rs555500075
chr1 13417 13417 rs777038595
chr1 15903 15903 rs557514207
chr1 54712 54712 rs568927205
chr1 91551 91551 rs375085441
chr1 249275 249275 rs200079338
chr1 255923 255923 rs199745078
chr1 363244 363244 rs572571697
chr1 604229 604229 rs556776674
mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.snp.plus.simple.deletions
chr1 11007 11008 rs575272151
chr1 11011 11012 rs544419019
chr1 13109 13110 rs540538026
chr1 13115 13116 rs62635286
chr1 13117 13118 rs62028691
chr1 13272 13273 rs531730856
chr1 14463 14464 rs546169444
chr1 14598 14599 rs531646671
chr1 14603 14604 rs541940975
chr1 14672 14673 rs4690
mpjanic@zoran:~/chrPos2rsID$ head snp147Common.bed.large.deletions
chr1 17358 17361 rs749387668
chr1 63735 63738 rs201888535
chr1 66435 66437 rs560481224
chr1 82133 82135 rs550749506
chr1 129010 129013 rs377161483
chr1 267227 267230 rs374780253
chr1 532325 532327 rs577455319
chr1 612688 612691 rs201365517
chr1 691567 691571 rs566250387
chr1 701779 701783 rs201234755

Wednesday, March 8, 2017

Collapsing transcript expression levels into gene levels by sum, maximum and average - awk code

In case you have generated a file containing gene, transcript (isoform), expression level,

DN0a272187:~ milospjanic$ cat file.txt
GENE1,ISOFORM1,4
GENE1,ISOFORM2,6
GENE1,ISOFORM3,9
GENE2,ISOFORM1,43
GENE2,ISOFORM2,16
GENE3,ISOFORM3,19
GENE3,ISOFORM1,43
GENE3,ISOFORM2,4
GENE4,ISOFORM1,21
Use this awk code to collapse this file and sum on the gene level. Sum all $3 in hash with $1 as a key, then loop through the array and print keys and values.

awk -F, '{array[$1]+=$3} END { for (i in array) {print i"," array[i]}}' file.txt
GENE1,19
GENE2,59
GENE3,66
GENE4,21
To collapse the isoform file and select the top isoform file use this awk code. Use hash max with the key $1 and value $3. If $3 is greater than max change max to new $3 value and put $3 in array. Next, loop through the array and print keys and values. 

awk -F, '{if($3>max[$1]){array[$1]=$3; max[$1]=$3}} END { for (i in array) {print i"," array[i]}}' file.txt
GENE1,9
GENE2,43
GENE3,43
GENE4,21
To plot average use the following code. Create array hash with $1 as keys that sums $3 values, and 'no' hash with $1 as key that increases by 1. Loop through the array and print keys and division of hash 'array' and 'no', which is a sum divided by number of transcripts.

awk -F, '{array[$1]+=$3; no[$1]+=1} END { for (i in array) {print i"," array[i]/no[i]}}'  file.txt
GENE1,6.33333
GENE2,29.5
GENE3,22
GENE4,21

Wednesday, March 1, 2017

Resolving DESeq error: non-numeric argument to mathematical function

Running the following DESeq code,

#!/usr/bin/Rscript
library(DESeq)
data<-read.table("mastertable.diff", header=T, row.names = 1, check.names=F)
meta<-read.delim("meta.data.diff", header=T)
conds<-factor(meta$Condition)
sampleTable<-data.frame(sampleName=colnames(data), condition=conds)
countsTable = data
cds <- newCountDataSet( countsTable, conds)
cds <- estimateSizeFactors( cds )
sizeFactors( cds )
head(counts(cds))
head(counts(cds,normalized=TRUE))
cds = estimateDispersions( cds, method="blind", sharingMode="fit-only" )
str( fitInfo(cds) )
plotDispEsts( cds )
res = nbinomTest( cds, "CTRL-MOCK", "CRISPR-DOWN" )
head(res)
plotMA(res)
hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="")
resSig = res[ res$padj < 0.1, ]
write.csv( res, file="Result_Table.csv" )
write.csv( resSig[ order( resSig$foldChange, -resSig$baseMean ), ] , file="DownReg_Result_Table.csv" )
write.csv( resSig[ order( -resSig$foldChange, -resSig$baseMean ), ], file="UpReg_Result_Table.csv" )
cdsBlind = estimateDispersions( cds, method="blind" )
vsd = varianceStabilizingTransformation( cdsBlind )
library("RColorBrewer")
library("gplots")
select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:250]
hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6))
select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:500]
heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6))
select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:1000]
heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6))
print(plotPCA(vsd, intgroup=c("condition")))
I got an error,

> cds <- newCountDataSet( countsTable, conds)
Error in round(countData) : non-numeric argument to mathematical function
Data frame looked fairly ok,

> head(data)
             up_207_1 up_207_5 up_207_7 up_207_11 up_207_13 up_207_17
ARL14EPL            0        0        0         0         1         0
LOC100169752        0        0        0         0         0         0
CAPN6               0        0        0         0         0         0
IFNA13             19       20       58        14        17        13
HSP90AB1         6030    10422    15058      3030      8696      3848
DACH1               0        0        0         0         0         0
But after checking classes and modes of R objects with sapply, columns were class: factor.

> sapply(data, mode)
 up_207_1  up_207_5  up_207_7 up_207_11 up_207_13 up_207_17 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
> sapply(data, class)
 up_207_1  up_207_5  up_207_7 up_207_11 up_207_13 up_207_17 
 "factor"  "factor"  "factor"  "factor"  "factor"  "factor" 
Changing the class of columns to numeric,

> numc <- sapply(data, is.factor)
> numc
 up_207_1  up_207_5  up_207_7 up_207_11 up_207_13 up_207_17 
     TRUE      TRUE      TRUE      TRUE      TRUE      TRUE 
> data[numc] <- lapply(data[numc], function(x) as.numeric(as.character(x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion
5: In FUN(X[[i]], ...) : NAs introduced by coercion
6: In FUN(X[[i]], ...) : NAs introduced by coercion

Rerunning the script with these lines gave the same error, as NAs were introduced. Then I checked what lines contain NA's in the data frame.

> data[!complete.cases(data),]
     up_207_1 up_207_5 up_207_7 up_207_11 up_207_13 up_207_17
Gene       NA       NA       NA        NA        NA        NA
So in data frame there was an empty line that was assigned NAs after converting with as.numeric. Removing all the lines with NAs,

> data<-na.omit(data)
> data[!complete.cases(data),]
[1] up_207_1  up_207_5  up_207_7  up_207_11 up_207_13 up_207_17
<0 rows> (or 0-length row.names)
Data frame is now numeric class and free of NAs. Rerunning the script was successful.

Wednesday, February 22, 2017

Transfer folder contents via FTP without prompting

If you need to transfer multiple files via ftp from a folder, as folder transfer per se is not supported, go into local folder, connect to ftp server and use mput to place multiple files. The problem with this option is that you would be prompted to confirm each file, that may be a problem for large folders. Thus, to avoid this connect to ftp with -i option.


Miloss-MacBook-Air:~ cd local_folder
Miloss-MacBook-Air:~ milospjanic$ ftp -i sftp.lsnet.ucla.edu
Connected to sftp.lsnet.ucla.edu.
220 (vsFTPd 2.2.2)
Name (sftp.lsnet.ucla.edu:milospjanic): username
331 Please specify the password.
Password: password
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> mput *

Friday, February 17, 2017

Removing duplicate genes based on the conditions of another column for differential expression and gene ontology pipelines in R

Some differential gene expression tools may reject your input table if for some reason gene name has been duplicated in another row. If your data frame contains duplicated rows on a single column (e.g. gene name) you can remove them in R using the following code. Note that this will remove any subsequent occurrence of the duplicated gene, therefore preserving only the first occurrence.

table<-read.delim("table.csv", header=T, sep=",")
table
    X         id    baseMean   baseMeanA  baseMeanB foldChange log2FoldChange
1   1 SEC24B-AS1    3.837647   3.0343409   4.640954  1.5294767     0.61303813
2   2       A1BG    1.769987   0.7314147   2.808559  3.8398998     1.94106866
3   3       A1CF    0.000000   0.0000000   0.000000         NA             NA
4   4      GGACT    5.722972   3.5491983   7.896745  2.2249378     1.15376499
5   5        A2M    0.000000   0.0000000   0.000000         NA             NA
6   6      A2ML1    0.000000   0.0000000   0.000000         NA             NA
7   7      A2MP1    0.000000   0.0000000   0.000000         NA             NA
8   8     A4GALT  261.976303 281.3563018 242.596304  0.8622387    -0.21384071
9   9      A4GNT    0.000000   0.0000000   0.000000         NA             NA
10 10       AAAS  237.463538 240.8107614 234.116315  0.9722004    -0.04067439
11 11       AACS  262.054727 268.0227018 256.086753  0.9554666    -0.06572258
12 12      GGACT 1000.000000   0.0000000   0.000000         NA             NA
13 13     A4GALT 1000.000000   0.0000000   0.000000         NA             NA
14 14 SEC24B-AS1 1000.000000   0.0000000   0.000000         NA             NA
        pval padj
1  0.4469177    1
2  0.3359994    1
3         NA   NA
4  0.1943902    1
5         NA   NA
6         NA   NA
7         NA   NA
8  0.3760542    1
9         NA   NA
10 0.8189333    1
11 0.6515563    1
12        NA   NA
13        NA   NA
14        NA   NA
> table.2 <- subset(table, !duplicated(table[,2])) > table.2 X id baseMean baseMeanA baseMeanB foldChange log2FoldChange 1 1 SEC24B-AS1 3.837647 3.0343409 4.640954 1.5294767 0.61303813 2 2 A1BG 1.769987 0.7314147 2.808559 3.8398998 1.94106866 3 3 A1CF 0.000000 0.0000000 0.000000 NA NA 4 4 GGACT 5.722972 3.5491983 7.896745 2.2249378 1.15376499 5 5 A2M 0.000000 0.0000000 0.000000 NA NA 6 6 A2ML1 0.000000 0.0000000 0.000000 NA NA 7 7 A2MP1 0.000000 0.0000000 0.000000 NA NA 8 8 A4GALT 261.976303 281.3563018 242.596304 0.8622387 -0.21384071 9 9 A4GNT 0.000000 0.0000000 0.000000 NA NA 10 10 AAAS 237.463538 240.8107614 234.116315 0.9722004 -0.04067439 11 11 AACS 262.054727 268.0227018 256.086753 0.9554666 -0.06572258 pval padj 1 0.4469177 1 2 0.3359994 1 3 NA NA 4 0.1943902 1 5 NA NA 6 NA NA 7 NA NA 8 0.3760542 1 9 NA NA 10 0.8189333 1 11 0.6515563 1
You can see that the last three rows were removed as those genes were previously repeated. In some cases you would need to remove duplicated genes based on some condition in another column.

If you need to filter rows duplicated on a certain column by, e.g. sorting on another column, use this rather elegant code that involved presorting followed by subsetting.

table<-read.delim("table.csv", header=T, sep=",")
> table
    X         id    baseMean   baseMeanA  baseMeanB foldChange log2FoldChange
1   1 SEC24B-AS1    3.837647   3.0343409   4.640954  1.5294767     0.61303813
2   2       A1BG    1.769987   0.7314147   2.808559  3.8398998     1.94106866
3   3       A1CF    0.000000   0.0000000   0.000000         NA             NA
4   4      GGACT    5.722972   3.5491983   7.896745  2.2249378     1.15376499
5   5        A2M    0.000000   0.0000000   0.000000         NA             NA
6   6      A2ML1    0.000000   0.0000000   0.000000         NA             NA
7   7      A2MP1    0.000000   0.0000000   0.000000         NA             NA
8   8     A4GALT  261.976303 281.3563018 242.596304  0.8622387    -0.21384071
9   9      A4GNT    0.000000   0.0000000   0.000000         NA             NA
10 10       AAAS  237.463538 240.8107614 234.116315  0.9722004    -0.04067439
11 11       AACS  262.054727 268.0227018 256.086753  0.9554666    -0.06572258
12 12      GGACT 1000.000000   0.0000000   0.000000         NA             NA
13 13     A4GALT 1000.000000   0.0000000   0.000000         NA             NA
14 14 SEC24B-AS1 1000.000000   0.0000000   0.000000         NA             NA
        pval padj
1  0.4469177    1
2  0.3359994    1
3         NA   NA
4  0.1943902    1
5         NA   NA
6         NA   NA
7         NA   NA
8  0.3760542    1
9         NA   NA
10 0.8189333    1
11 0.6515563    1
12        NA   NA
13        NA   NA
14        NA   NA
> table = table[order(table[,'id'],-table[,'baseMean']),]
> table
    X         id    baseMean   baseMeanA  baseMeanB foldChange log2FoldChange
2   2       A1BG    1.769987   0.7314147   2.808559  3.8398998     1.94106866
3   3       A1CF    0.000000   0.0000000   0.000000         NA             NA
5   5        A2M    0.000000   0.0000000   0.000000         NA             NA
6   6      A2ML1    0.000000   0.0000000   0.000000         NA             NA
7   7      A2MP1    0.000000   0.0000000   0.000000         NA             NA
13 13     A4GALT 1000.000000   0.0000000   0.000000         NA             NA
8   8     A4GALT  261.976303 281.3563018 242.596304  0.8622387    -0.21384071
9   9      A4GNT    0.000000   0.0000000   0.000000         NA             NA
10 10       AAAS  237.463538 240.8107614 234.116315  0.9722004    -0.04067439
11 11       AACS  262.054727 268.0227018 256.086753  0.9554666    -0.06572258
12 12      GGACT 1000.000000   0.0000000   0.000000         NA             NA
4   4      GGACT    5.722972   3.5491983   7.896745  2.2249378     1.15376499
14 14 SEC24B-AS1 1000.000000   0.0000000   0.000000         NA             NA
1   1 SEC24B-AS1    3.837647   3.0343409   4.640954  1.5294767     0.61303813
        pval padj
2  0.3359994    1
3         NA   NA
5         NA   NA
6         NA   NA
7         NA   NA
13        NA   NA
8  0.3760542    1
9         NA   NA
10 0.8189333    1
11 0.6515563    1
12        NA   NA
4  0.1943902    1
14        NA   NA
1  0.4469177    1
> table.2 <- subset(table, !duplicated(table[,2]))
> table.2
    X         id    baseMean   baseMeanA  baseMeanB foldChange log2FoldChange
2   2       A1BG    1.769987   0.7314147   2.808559  3.8398998     1.94106866
3   3       A1CF    0.000000   0.0000000   0.000000         NA             NA
5   5        A2M    0.000000   0.0000000   0.000000         NA             NA
6   6      A2ML1    0.000000   0.0000000   0.000000         NA             NA
7   7      A2MP1    0.000000   0.0000000   0.000000         NA             NA
13 13     A4GALT 1000.000000   0.0000000   0.000000         NA             NA
9   9      A4GNT    0.000000   0.0000000   0.000000         NA             NA
10 10       AAAS  237.463538 240.8107614 234.116315  0.9722004    -0.04067439
11 11       AACS  262.054727 268.0227018 256.086753  0.9554666    -0.06572258
12 12      GGACT 1000.000000   0.0000000   0.000000         NA             NA
14 14 SEC24B-AS1 1000.000000   0.0000000   0.000000         NA             NA
        pval padj
2  0.3359994    1
3         NA   NA
5         NA   NA
6         NA   NA
7         NA   NA
13        NA   NA
9         NA   NA
10 0.8189333    1
11 0.6515563    1
12        NA   NA
14        NA   NA

Wednesday, February 1, 2017

Code for plotting up- and down-regulated gene FC and p-val distributions in R

In differential expression analysis, use the DE output table:

> head(res)
          id baseMean baseMeanA baseMeanB foldChange log2FoldChange      pval
1 SEC24B-AS1 3.837647 3.0343409  4.640954   1.529477      0.6130381 0.4469177
2       A1BG 1.769987 0.7314147  2.808559   3.839900      1.9410687 0.3359994
3       A1CF 0.000000 0.0000000  0.000000        NaN            NaN        NA
4      GGACT 5.722972 3.5491983  7.896745   2.224938      1.1537650 0.1943902
5        A2M 0.000000 0.0000000  0.000000        NaN            NaN        NA
6      A2ML1 0.000000 0.0000000  0.000000        NaN            NaN        NA
  padj
1    1
2    1
3   NA
4    1
5   NA
6   NA
Subset and create two data frames with up and down-regulated genes. Use na.omit to remove row with NA.

> down<-na.omit(res[res$log2FoldChange<(-0.5)&res$padj<0.05,])
> head (down)
            id     baseMean   baseMeanA    baseMeanB foldChange log2FoldChange
139        ACE    81.090090   128.42927 3.375091e+01 0.26279761     -1.9279759
266   ADAMTS15   109.686212   142.19707 7.717536e+01 0.54273520     -0.8816796
315      ADH1B   224.356545   348.75853 9.995456e+01 0.28660106     -1.8028842
421     AHNAK2   493.239479   623.24438 3.632346e+02 0.58281245     -0.7788964
435       AIM1   311.990989   409.72136 2.142606e+02 0.52294227     -0.9352764
              pval         padj
139   3.009363e-13 6.979143e-10
266   1.234959e-04 1.774189e-02
315   9.575011e-18 5.181357e-14
421   2.785485e-05 5.652445e-03
435   1.215970e-05 2.860877e-03

> up<-na.omit(res[res$log2FoldChange>0.5&res$padj<0.05,])
> head(up)
            id    baseMean    baseMeanA   baseMeanB foldChange log2FoldChange
198      ACTG2    28.00957    13.571106    42.44803   3.127824      1.6451595
206      ACTN1  7221.10709  5591.227639  8850.98654   1.583013      0.6626732
783      APLP1    56.91524    36.938228    76.89226   2.081645      1.0577237
1416      BEX1   134.33394    83.364379   185.30350   2.222814      1.1523872
2832      CCL2  3879.93627  2577.263209  5182.60934   2.010896      1.0078388
              pval         padj
198   2.693990e-05 5.535979e-03
206   2.481289e-04 3.196924e-02
783   1.122944e-04 1.672465e-02
1416  1.364952e-07 5.831217e-05
2832  3.065576e-08 1.605373e-05
Next, concatenate log2FoldChange from up and down dataframes to the dat$dens, and concatenate and repeat DOWN and UP elements exactly length(down$log2FoldChange) and length(up$log2FoldChange) times to the dat$lines column of the dat data frame.

> dat <- data.frame(dens = c(down$log2FoldChange, up$log2FoldChange), lines = rep(c("DOWN", "UP"), c(length(down$log2FoldChange), length(up$log2FoldChange))))

> head (dat)
          dens lines
1   -1.9279759  DOWN
2   -0.8816796  DOWN
3   -1.8028842  DOWN
4   -0.7788964  DOWN
5   -0.9352764  DOWN

> tail (dat)
143  1.2182830    UP
144  1.0698604    UP
145  0.7518891    UP
146  1.1097739    UP
147  0.6468804    UP
148  0.9988166    UP
Plot dat as density plot using ggplot2,

> library(ggplot2)

> pdf("pvalues.pdf")
> ggplot(dat, aes(x = dens, fill = lines)) + geom_density(alpha = 0.5) + ggtitle("Upregulated and downregulated gene -logFC distribution") +xlab("-logFC")+ylab("Density")+theme(axis.text=element_text(size=18),
+         axis.title=element_text(size=16), plot.title=element_text(size=16))
> dev.off()



Similarly, make a density plot for adjusted p-values.

> dat <- data.frame(dens = c(-log(down$padj), -log(up$padj)), lines = rep(c("DOWN", "UP"), c(length(down$padj), length(up$padj))))

> head (dat)
       dens lines
1 21.082925  DOWN
2  4.031827  DOWN
3 30.591124  DOWN
4  5.175667  DOWN
5  5.856627  DOWN
6  4.568621  DOWN
> tail (dat)
         dens lines
143 10.512156    UP
144 12.505760    UP
145  6.208198    UP
146 14.345963    UP
147  3.691205    UP
148 10.043970    UP
> pdf("pvalues.pdf")
> ggplot(dat, aes(x = dens, fill = lines)) + geom_density(alpha = 0.5) + ggtitle("Upregulated and downregulated gene pvalues distribution") +xlab("-logPvalue")+ylab("Density")+theme(axis.text=element_text(size=18),
+         axis.title=element_text(size=16), plot.title=element_text(size=16))
> dev.off()