Friday, January 12, 2018

Code to plot GTEx color coded expression levels

These are the instructions to plot GTEx color coded expression levels in R and ggplot2. You need to have two files, one with GTEx tissue names and expression values,

> RPKM<-read.table("RPKM.txt", head=TRUE,stringsAsFactors=FALSE,sep="\t")

> RPKM
                                 GTEx_tissue       RPKM
1                     Adipose - Subcutaneous  66.859651
2               Adipose - Visceral (Omentum)  24.028442
3                              Adrenal Gland 116.200920
4                             Artery - Aorta   2.133841
5                          Artery - Coronary  67.965125
6                            Artery - Tibial  86.771264
7                                    Bladder  52.417672
8                           Brain - Amygdala  48.820335
9   Brain - Anterior cingulate cortex (BA24)  51.514548
10           Brain - Caudate (basal ganglia)   1.032300
...
And a file with GTEx tissues and color code.

> COLOR<-read.table("COLOR.txt", head=TRUE,stringsAsFactors=FALSE,sep="\t")
> COLOR
                          tissue_site_detail tissue_color_hex
1                     Adipose - Subcutaneous           FF6600
2               Adipose - Visceral (Omentum)           FFAA00
3                              Adrenal Gland           33DD33
4                             Artery - Aorta           FF5555
5                          Artery - Coronary           FFAA99
6                            Artery - Tibial           FF0000
7                                    Bladder           AA0000
8                           Brain - Amygdala           EEEE00
9   Brain - Anterior cingulate cortex (BA24)           EEEE00
10           Brain - Caudate (basal ganglia)           EEEE00
...
First lets order the RPKM data frame.

> RPKM<-RPKM[order(RPKM$RPKM),]
> RPKM
                                 GTEx_tissue       RPKM
10           Brain - Caudate (basal ganglia)   1.032300
53                               Whole Blood   1.249288
26                           Colon - Sigmoid   1.515215
4                             Artery - Aorta   2.133841
18           Brain - Putamen (basal ganglia)   5.461759
32                  Heart - Atrial Appendage   6.369740
42                                 Pituitary   6.862820
39                            Nerve - Tibial   9.316173
35                                     Liver  11.389639
29                        Esophagus - Mucosa  12.715260
25                       Cervix - Endocervix  21.005066
...
Next, merge RPKM and COLOR data frames while preserving the original order.

> DATA<-merge(RPKM, COLOR, by.x="GTEx_tissue", by.y="tissue_site_detail", sort=F)
> DATA
                                 GTEx_tissue       RPKM tissue_color_hex
1            Brain - Caudate (basal ganglia)   1.032300           EEEE00
2                                Whole Blood   1.249288           FF00BB
3                            Colon - Sigmoid   1.515215           EEBB77
4                             Artery - Aorta   2.133841           FF5555
5            Brain - Putamen (basal ganglia)   5.461759           EEEE00
6                   Heart - Atrial Appendage   6.369740           9900FF
7                                  Pituitary   6.862820           AAFF99
8                             Nerve - Tibial   9.316173           FFD700
9                                      Liver  11.389639           AABB66
10                        Esophagus - Mucosa  12.715260           552200
11                       Cervix - Endocervix  21.005066           CCAADD
...
Next, add # sign for color codes.

> DATA$tissue_color_hex = paste("#", DATA$tissue_color_hex, sep="")
> DATA
                                 GTEx_tissue       RPKM tissue_color_hex
1            Brain - Caudate (basal ganglia)   1.032300          #EEEE00
2                                Whole Blood   1.249288          #FF00BB
3                            Colon - Sigmoid   1.515215          #EEBB77
4                             Artery - Aorta   2.133841          #FF5555
5            Brain - Putamen (basal ganglia)   5.461759          #EEEE00
6                   Heart - Atrial Appendage   6.369740          #9900FF
7                                  Pituitary   6.862820          #AAFF99
8                             Nerve - Tibial   9.316173          #FFD700
9                                      Liver  11.389639          #AABB66
10                        Esophagus - Mucosa  12.715260          #552200
11                       Cervix - Endocervix  21.005066          #CCAADD
...
It is important to reorder levels not to be alphabetically sorted as this is how they will appear on the graph. 

> factor (DATA$GTEx_tissue)
 [1] Brain - Caudate (basal ganglia)          
 [2] Whole Blood                              
 [3] Colon - Sigmoid                          
 [4] Artery - Aorta                           
 [5] Brain - Putamen (basal ganglia)          
 [6] Heart - Atrial Appendage                 
 [7] Pituitary                                
 [8] Nerve - Tibial                           
 [9] Liver                                    
[10] Esophagus - Mucosa                       
...
[53] Brain - Nucleus accumbens (basal ganglia)
[54] Esophagus - Gastroesophageal Junction    
54 Levels: Adipose - Subcutaneous ... Whole Blood
First convert the column of the data frame to character vector, then recreate factor levels. Check the levels order.

> DATA$GTEx_tissue<-as.character(DATA$GTEx_tissue)

> DATA$GTEx_tissue<-factor(DATA$GTEx_tissue, levels=unique(DATA$GTEx_tissue))

> factor (DATA$GTEx_tissue)
 [1] Brain - Caudate (basal ganglia)          
 [2] Whole Blood                              
 [3] Colon - Sigmoid                          
 [4] Artery - Aorta                           
 [5] Brain - Putamen (basal ganglia)          
 [6] Heart - Atrial Appendage                 
 [7] Pituitary                                
 [8] Nerve - Tibial                           
 [9] Liver                                    
[10] Esophagus - Mucosa                       
...        
[53] Brain - Nucleus accumbens (basal ganglia)
[54] Esophagus - Gastroesophageal Junction    
54 Levels: Brain - Caudate (basal ganglia) Whole Blood ... Esophagus - Gastroesophageal Junction
Check if DATA$RPKM levels are in the correct order.

> factor (DATA$RPKM)
 [1] 1.03230021638509 1.24928759400422 1.51521544457643 2.13384071515083
 [5] 5.46175870783362 6.36974006723701 6.86281980596968 9.31617267159933
 [9] 11.3896391114977 12.7152598395188 21.0050660365301 23.4022785915606
[13] 23.816931737888  24.0284416972293 26.7461670932036 27.055894746729 
[17] 28.3445527395585 28.472923085106  29.1229822421017 31.6109435670403
[21] 33.1257268898051 33.6736188322218 33.7810087909512 37.907376687972 
[25] 37.9136671742278 38.9443756171557 40.791872968003  48.5668687423655
[29] 48.8203345391151 49.3179500223535 51.5145476186135 52.4176718915877
[33] 53.7403789981182 53.9869116218224 56.9835507558119 58.0722270819484
[37] 63.5714835062821 65.0641288989535 66.859651074282  67.9651249298835
[41] 68.7278064101102 72.1703782792782 77.0099742646478 77.727954691935 
[45] 86.771263595058  87.8177781990037 88.3191874893887 90.0888195183517
[49] 101.738608629986 102.486500884127 103.895843148456 116.200919797288
[53] 125.497316610595 127.021903203954
54 Levels: 1.03230021638509 1.24928759400422 ... 127.021903203954
Similarly reorder factor levels for DATA$tissue_color_hex.

> factor (DATA$tissue_color_hex)
 [1] #EEEE00 #FF00BB #EEBB77 #FF5555 #EEEE00 #9900FF #AAFF99 #FFD700 #AABB66
[10] #552200 #CCAADD #555522 #778855 #FFAA00 #995522 #EEEE00 #FFDD99 #EEEE00
[19] #006600 #7777FF #EEEE00 #AAEEFF #FFAA99 #EEEE00 #660099 #BB9988 #DDDDDD
[28] #FF66FF #EEEE00 #AAAAAA #EEEE00 #AA0000 #22FFDD #0000FF #FFCCCC #99FF00
[37] #CC9955 #FFAAFF #FF6600 #FFAA99 #FFCCCC #CC66FF #EEEE00 #FF5599 #FF0000
[46] #EEEE00 #EEEE00 #33CCCC #EEEE00 #AAAAFF #99BB88 #33DD33 #EEEE00 #8B7355
40 Levels: #0000FF #006600 #22FFDD #33CCCC #33DD33 #552200 #555522 ... #FFDD99
> DATA$tissue_color_hex<-as.character(DATA$tissue_color_hex)
> DATA$tissue_color_hex<-factor(DATA$tissue_color_hex, levels=unique(DATA$tissue_color_hex))
> factor (DATA$tissue_color_hex)
 [1] #EEEE00 #FF00BB #EEBB77 #FF5555 #EEEE00 #9900FF #AAFF99 #FFD700 #AABB66
[10] #552200 #CCAADD #555522 #778855 #FFAA00 #995522 #EEEE00 #FFDD99 #EEEE00
[19] #006600 #7777FF #EEEE00 #AAEEFF #FFAA99 #EEEE00 #660099 #BB9988 #DDDDDD
[28] #FF66FF #EEEE00 #AAAAAA #EEEE00 #AA0000 #22FFDD #0000FF #FFCCCC #99FF00
[37] #CC9955 #FFAAFF #FF6600 #FFAA99 #FFCCCC #CC66FF #EEEE00 #FF5599 #FF0000
[46] #EEEE00 #EEEE00 #33CCCC #EEEE00 #AAAAFF #99BB88 #33DD33 #EEEE00 #8B7355
40 Levels: #EEEE00 #FF00BB #EEBB77 #FF5555 #9900FF #AAFF99 #FFD700 ... #8B7355
Use ggplot2 to plot the graph. Use + aes(colour=as.character(DATA$tissue_color_hex)) + scale_colour_identity() for custom colors. 

> library(ggplot2) 

> pdf("GTEx.pdf")

> p <- ggplot(DATA,aes(x=DATA$GTEx_tissue,y=DATA$RPKM, color=DATA$tissue_color_hex))+geom_point(size=3)

> p + ylab("RPKM") + xlab("GTEx_tissue") + coord_flip() + theme_bw()  + aes(colour=as.character(DATA$tissue_color_hex)) + scale_colour_identity()

> dev.off()
The graph has the expression values of interest and it is color coded using GTEx colors.


Merge in R and preserve order of the input file

If you want to merge two data frames in R, the merge function will automatically sort alphabetically the resulting output data frame. Adding sort= might help but in that case the file is left unsorted so it might not always work.

> table.2<-read.table("test.2", head=TRUE,stringsAsFactors=FALSE)
> table.1<-read.table("test.1", head=TRUE,stringsAsFactors=FALSE)
> table.1
  SAMPLE   NUM
1      B  23.0
2      D  11.0
3      C 121.0
4      A   3.4
> table.2
  SAMPLE CODE
1      A    7
2      B   11
3      C   56
4      D   45



> merge(table.1, table.2, by.x="SAMPLE", by.y="SAMPLE")
  SAMPLE   NUM CODE
1      A   3.4    7
2      B  23.0   11
3      C 121.0   56
4      D  11.0   45

> merge(table.1, table.2, by.x="SAMPLE", by.y="SAMPLE", sort=F)
  SAMPLE   NUM CODE
1      B  23.0   11
2      D  11.0   45
3      C 121.0   56
4      A   3.4    7

In case sort=F doesn't work, create a new column in the data frame you wish to sort by, in this case data frame 1 and use for sorting after the merge.


> merger<-merge(table.1, table.2, by.x="SAMPLE", by.y="SAMPLE")

> merger[order(merger$order),]
  SAMPLE   NUM order CODE
2      B  23.0     1   11
4      D  11.0     2   45
3      C 121.0     3   56
1      A   3.4     4    7

Convert DOS to UNIX file type: eliminate ^M at the end of the line

To convert DOS with ^M at the end of the line to UNIX file types,


tissue_site_detail      tissue_site_detail_abbr tissue_site_detail_id   tissue_site     tissue_color_hex        tissue_color_rgb^MAdipose - Subcutaneous        ADPSBQ  Adipose_Subcutaneous    Adipose Tissue  FF6600  "255,102,0"^MAdipose - Visceral (Omentum)       ADPVSC  Adipose_Visceral_Omentum        Adipose Tissue  FFAA00  "255,170,0"^MAdrenal Gland      ADRNLG  Adrenal_Gland   Adrenal Gland   33DD33  "51,221,51"^MArtery - Aorta     ARTAORT Artery_Aorta    Blood Vessel    FF5555  "255,85,85"^MArtery - Coronary  ARTCRN  Artery_Coronary Blood Vessel    FFAA99  "255,170,153"^MArtery - Tibial  ARTTBL  Artery_Tibial   Blood Vessel    FF0000  "255,0,0"^MBladder      BLDDER  Bladder Bladder AA0000  "170,0,0"^MBrain - Amygdala     BRNAMY  Brain_Amygdala  Brain   EEEE00  

Use sed command:

sed -i "s/^M/\n/g" gtex_tissue_colors.txt
Be sure to enter ^M, with CTRL-V CTRL-M.

tissue_site_detail      tissue_site_detail_abbr tissue_site_detail_id   tissue_site     tissue_color_hex        tissue_color_rgb
Adipose - Subcutaneous  ADPSBQ  Adipose_Subcutaneous    Adipose Tissue  FF6600  "255,102,0"
Adipose - Visceral (Omentum)    ADPVSC  Adipose_Visceral_Omentum        Adipose Tissue  FFAA00  "255,170,0"
Adrenal Gland   ADRNLG  Adrenal_Gland   Adrenal Gland   33DD33  "51,221,51"
Artery - Aorta  ARTAORT Artery_Aorta    Blood Vessel    FF5555  "255,85,85"
Artery - Coronary       ARTCRN  Artery_Coronary Blood Vessel    FFAA99  "255,170,153"
Artery - Tibial ARTTBL  Artery_Tibial   Blood Vessel    FF0000  "255,0,0"
Bladder BLDDER  Bladder Bladder AA0000  "170,0,0"
Brain - Amygdala        BRNAMY  Brain_Amygdala  Brain   EEEE00  "238,238,0"
"238,238,0"