Friday, January 12, 2018

Code to plot GTEx color coded expression levels

These are the instructions to plot GTEx color coded expression levels in R and ggplot2. You need to have two files, one with GTEx tissue names and expression values,

> RPKM<-read.table("RPKM.txt", head=TRUE,stringsAsFactors=FALSE,sep="\t")

> RPKM
                                 GTEx_tissue       RPKM
1                     Adipose - Subcutaneous  66.859651
2               Adipose - Visceral (Omentum)  24.028442
3                              Adrenal Gland 116.200920
4                             Artery - Aorta   2.133841
5                          Artery - Coronary  67.965125
6                            Artery - Tibial  86.771264
7                                    Bladder  52.417672
8                           Brain - Amygdala  48.820335
9   Brain - Anterior cingulate cortex (BA24)  51.514548
10           Brain - Caudate (basal ganglia)   1.032300
...
And a file with GTEx tissues and color code.

> COLOR<-read.table("COLOR.txt", head=TRUE,stringsAsFactors=FALSE,sep="\t")
> COLOR
                          tissue_site_detail tissue_color_hex
1                     Adipose - Subcutaneous           FF6600
2               Adipose - Visceral (Omentum)           FFAA00
3                              Adrenal Gland           33DD33
4                             Artery - Aorta           FF5555
5                          Artery - Coronary           FFAA99
6                            Artery - Tibial           FF0000
7                                    Bladder           AA0000
8                           Brain - Amygdala           EEEE00
9   Brain - Anterior cingulate cortex (BA24)           EEEE00
10           Brain - Caudate (basal ganglia)           EEEE00
...
First lets order the RPKM data frame.

> RPKM<-RPKM[order(RPKM$RPKM),]
> RPKM
                                 GTEx_tissue       RPKM
10           Brain - Caudate (basal ganglia)   1.032300
53                               Whole Blood   1.249288
26                           Colon - Sigmoid   1.515215
4                             Artery - Aorta   2.133841
18           Brain - Putamen (basal ganglia)   5.461759
32                  Heart - Atrial Appendage   6.369740
42                                 Pituitary   6.862820
39                            Nerve - Tibial   9.316173
35                                     Liver  11.389639
29                        Esophagus - Mucosa  12.715260
25                       Cervix - Endocervix  21.005066
...
Next, merge RPKM and COLOR data frames while preserving the original order.

> DATA<-merge(RPKM, COLOR, by.x="GTEx_tissue", by.y="tissue_site_detail", sort=F)
> DATA
                                 GTEx_tissue       RPKM tissue_color_hex
1            Brain - Caudate (basal ganglia)   1.032300           EEEE00
2                                Whole Blood   1.249288           FF00BB
3                            Colon - Sigmoid   1.515215           EEBB77
4                             Artery - Aorta   2.133841           FF5555
5            Brain - Putamen (basal ganglia)   5.461759           EEEE00
6                   Heart - Atrial Appendage   6.369740           9900FF
7                                  Pituitary   6.862820           AAFF99
8                             Nerve - Tibial   9.316173           FFD700
9                                      Liver  11.389639           AABB66
10                        Esophagus - Mucosa  12.715260           552200
11                       Cervix - Endocervix  21.005066           CCAADD
...
Next, add # sign for color codes.

> DATA$tissue_color_hex = paste("#", DATA$tissue_color_hex, sep="")
> DATA
                                 GTEx_tissue       RPKM tissue_color_hex
1            Brain - Caudate (basal ganglia)   1.032300          #EEEE00
2                                Whole Blood   1.249288          #FF00BB
3                            Colon - Sigmoid   1.515215          #EEBB77
4                             Artery - Aorta   2.133841          #FF5555
5            Brain - Putamen (basal ganglia)   5.461759          #EEEE00
6                   Heart - Atrial Appendage   6.369740          #9900FF
7                                  Pituitary   6.862820          #AAFF99
8                             Nerve - Tibial   9.316173          #FFD700
9                                      Liver  11.389639          #AABB66
10                        Esophagus - Mucosa  12.715260          #552200
11                       Cervix - Endocervix  21.005066          #CCAADD
...
It is important to reorder levels not to be alphabetically sorted as this is how they will appear on the graph. 

> factor (DATA$GTEx_tissue)
 [1] Brain - Caudate (basal ganglia)          
 [2] Whole Blood                              
 [3] Colon - Sigmoid                          
 [4] Artery - Aorta                           
 [5] Brain - Putamen (basal ganglia)          
 [6] Heart - Atrial Appendage                 
 [7] Pituitary                                
 [8] Nerve - Tibial                           
 [9] Liver                                    
[10] Esophagus - Mucosa                       
...
[53] Brain - Nucleus accumbens (basal ganglia)
[54] Esophagus - Gastroesophageal Junction    
54 Levels: Adipose - Subcutaneous ... Whole Blood
First convert the column of the data frame to character vector, then recreate factor levels. Check the levels order.

> DATA$GTEx_tissue<-as.character(DATA$GTEx_tissue)

> DATA$GTEx_tissue<-factor(DATA$GTEx_tissue, levels=unique(DATA$GTEx_tissue))

> factor (DATA$GTEx_tissue)
 [1] Brain - Caudate (basal ganglia)          
 [2] Whole Blood                              
 [3] Colon - Sigmoid                          
 [4] Artery - Aorta                           
 [5] Brain - Putamen (basal ganglia)          
 [6] Heart - Atrial Appendage                 
 [7] Pituitary                                
 [8] Nerve - Tibial                           
 [9] Liver                                    
[10] Esophagus - Mucosa                       
...        
[53] Brain - Nucleus accumbens (basal ganglia)
[54] Esophagus - Gastroesophageal Junction    
54 Levels: Brain - Caudate (basal ganglia) Whole Blood ... Esophagus - Gastroesophageal Junction
Check if DATA$RPKM levels are in the correct order.

> factor (DATA$RPKM)
 [1] 1.03230021638509 1.24928759400422 1.51521544457643 2.13384071515083
 [5] 5.46175870783362 6.36974006723701 6.86281980596968 9.31617267159933
 [9] 11.3896391114977 12.7152598395188 21.0050660365301 23.4022785915606
[13] 23.816931737888  24.0284416972293 26.7461670932036 27.055894746729 
[17] 28.3445527395585 28.472923085106  29.1229822421017 31.6109435670403
[21] 33.1257268898051 33.6736188322218 33.7810087909512 37.907376687972 
[25] 37.9136671742278 38.9443756171557 40.791872968003  48.5668687423655
[29] 48.8203345391151 49.3179500223535 51.5145476186135 52.4176718915877
[33] 53.7403789981182 53.9869116218224 56.9835507558119 58.0722270819484
[37] 63.5714835062821 65.0641288989535 66.859651074282  67.9651249298835
[41] 68.7278064101102 72.1703782792782 77.0099742646478 77.727954691935 
[45] 86.771263595058  87.8177781990037 88.3191874893887 90.0888195183517
[49] 101.738608629986 102.486500884127 103.895843148456 116.200919797288
[53] 125.497316610595 127.021903203954
54 Levels: 1.03230021638509 1.24928759400422 ... 127.021903203954
Similarly reorder factor levels for DATA$tissue_color_hex.

> factor (DATA$tissue_color_hex)
 [1] #EEEE00 #FF00BB #EEBB77 #FF5555 #EEEE00 #9900FF #AAFF99 #FFD700 #AABB66
[10] #552200 #CCAADD #555522 #778855 #FFAA00 #995522 #EEEE00 #FFDD99 #EEEE00
[19] #006600 #7777FF #EEEE00 #AAEEFF #FFAA99 #EEEE00 #660099 #BB9988 #DDDDDD
[28] #FF66FF #EEEE00 #AAAAAA #EEEE00 #AA0000 #22FFDD #0000FF #FFCCCC #99FF00
[37] #CC9955 #FFAAFF #FF6600 #FFAA99 #FFCCCC #CC66FF #EEEE00 #FF5599 #FF0000
[46] #EEEE00 #EEEE00 #33CCCC #EEEE00 #AAAAFF #99BB88 #33DD33 #EEEE00 #8B7355
40 Levels: #0000FF #006600 #22FFDD #33CCCC #33DD33 #552200 #555522 ... #FFDD99
> DATA$tissue_color_hex<-as.character(DATA$tissue_color_hex)
> DATA$tissue_color_hex<-factor(DATA$tissue_color_hex, levels=unique(DATA$tissue_color_hex))
> factor (DATA$tissue_color_hex)
 [1] #EEEE00 #FF00BB #EEBB77 #FF5555 #EEEE00 #9900FF #AAFF99 #FFD700 #AABB66
[10] #552200 #CCAADD #555522 #778855 #FFAA00 #995522 #EEEE00 #FFDD99 #EEEE00
[19] #006600 #7777FF #EEEE00 #AAEEFF #FFAA99 #EEEE00 #660099 #BB9988 #DDDDDD
[28] #FF66FF #EEEE00 #AAAAAA #EEEE00 #AA0000 #22FFDD #0000FF #FFCCCC #99FF00
[37] #CC9955 #FFAAFF #FF6600 #FFAA99 #FFCCCC #CC66FF #EEEE00 #FF5599 #FF0000
[46] #EEEE00 #EEEE00 #33CCCC #EEEE00 #AAAAFF #99BB88 #33DD33 #EEEE00 #8B7355
40 Levels: #EEEE00 #FF00BB #EEBB77 #FF5555 #9900FF #AAFF99 #FFD700 ... #8B7355
Use ggplot2 to plot the graph. Use + aes(colour=as.character(DATA$tissue_color_hex)) + scale_colour_identity() for custom colors. 

> library(ggplot2) 

> pdf("GTEx.pdf")

> p <- ggplot(DATA,aes(x=DATA$GTEx_tissue,y=DATA$RPKM, color=DATA$tissue_color_hex))+geom_point(size=3)

> p + ylab("RPKM") + xlab("GTEx_tissue") + coord_flip() + theme_bw()  + aes(colour=as.character(DATA$tissue_color_hex)) + scale_colour_identity()

> dev.off()
The graph has the expression values of interest and it is color coded using GTEx colors.


Merge in R and preserve order of the input file

If you want to merge two data frames in R, the merge function will automatically sort alphabetically the resulting output data frame. Adding sort= might help but in that case the file is left unsorted so it might not always work.

> table.2<-read.table("test.2", head=TRUE,stringsAsFactors=FALSE)
> table.1<-read.table("test.1", head=TRUE,stringsAsFactors=FALSE)
> table.1
  SAMPLE   NUM
1      B  23.0
2      D  11.0
3      C 121.0
4      A   3.4
> table.2
  SAMPLE CODE
1      A    7
2      B   11
3      C   56
4      D   45



> merge(table.1, table.2, by.x="SAMPLE", by.y="SAMPLE")
  SAMPLE   NUM CODE
1      A   3.4    7
2      B  23.0   11
3      C 121.0   56
4      D  11.0   45

> merge(table.1, table.2, by.x="SAMPLE", by.y="SAMPLE", sort=F)
  SAMPLE   NUM CODE
1      B  23.0   11
2      D  11.0   45
3      C 121.0   56
4      A   3.4    7

In case sort=F doesn't work, create a new column in the data frame you wish to sort by, in this case data frame 1 and use for sorting after the merge.


> merger<-merge(table.1, table.2, by.x="SAMPLE", by.y="SAMPLE")

> merger[order(merger$order),]
  SAMPLE   NUM order CODE
2      B  23.0     1   11
4      D  11.0     2   45
3      C 121.0     3   56
1      A   3.4     4    7

Convert DOS to UNIX file type: eliminate ^M at the end of the line

To convert DOS with ^M at the end of the line to UNIX file types,


tissue_site_detail      tissue_site_detail_abbr tissue_site_detail_id   tissue_site     tissue_color_hex        tissue_color_rgb^MAdipose - Subcutaneous        ADPSBQ  Adipose_Subcutaneous    Adipose Tissue  FF6600  "255,102,0"^MAdipose - Visceral (Omentum)       ADPVSC  Adipose_Visceral_Omentum        Adipose Tissue  FFAA00  "255,170,0"^MAdrenal Gland      ADRNLG  Adrenal_Gland   Adrenal Gland   33DD33  "51,221,51"^MArtery - Aorta     ARTAORT Artery_Aorta    Blood Vessel    FF5555  "255,85,85"^MArtery - Coronary  ARTCRN  Artery_Coronary Blood Vessel    FFAA99  "255,170,153"^MArtery - Tibial  ARTTBL  Artery_Tibial   Blood Vessel    FF0000  "255,0,0"^MBladder      BLDDER  Bladder Bladder AA0000  "170,0,0"^MBrain - Amygdala     BRNAMY  Brain_Amygdala  Brain   EEEE00  

Use sed command:

sed -i "s/^M/\n/g" gtex_tissue_colors.txt
Be sure to enter ^M, with CTRL-V CTRL-M.

tissue_site_detail      tissue_site_detail_abbr tissue_site_detail_id   tissue_site     tissue_color_hex        tissue_color_rgb
Adipose - Subcutaneous  ADPSBQ  Adipose_Subcutaneous    Adipose Tissue  FF6600  "255,102,0"
Adipose - Visceral (Omentum)    ADPVSC  Adipose_Visceral_Omentum        Adipose Tissue  FFAA00  "255,170,0"
Adrenal Gland   ADRNLG  Adrenal_Gland   Adrenal Gland   33DD33  "51,221,51"
Artery - Aorta  ARTAORT Artery_Aorta    Blood Vessel    FF5555  "255,85,85"
Artery - Coronary       ARTCRN  Artery_Coronary Blood Vessel    FFAA99  "255,170,153"
Artery - Tibial ARTTBL  Artery_Tibial   Blood Vessel    FF0000  "255,0,0"
Bladder BLDDER  Bladder Bladder AA0000  "170,0,0"
Brain - Amygdala        BRNAMY  Brain_Amygdala  Brain   EEEE00  "238,238,0"
"238,238,0"

Sunday, December 31, 2017

Calculate average for each row using awk

Lets say you have a file, and you want to calculate average for each row with the ID located in the first column, and want to keep ID and the new value.

head mastertable 
#header
LINC00969-220 0 0 0 0 0 0 0 0 0 0 0 0
LINC00969-221 0 0 1 0 0 0 0 0 13 14 22 15
LINC00969-222 0 0 0 0 0 0 0 0 0 0 2 0
AC063926.2-201 0 0 0 0 0 0 0 0 1 0 0 0
MEIS1-AS3-201 0 0 0 0 0 0 0 0 0 0 0 0
AC026992.2-201 0 0 0 0 0 0 0 0 0 0 1 0
LINC00969-223 0 0 0 0 0 0 0 0 0 0 0 0
AC026992.2-202 0 0 0 0 0 0 0 0 0 0 1 0
LINC00969-224 0 0 0 0 0 0 0 0 0 0 0 0


Use awk, For NR ==1 just print new header, then for each row start adding every value starting from $2 to the variable sum. At the end print $1 and sum/(NF-1):


awk 'NR == 1 { print "lncRNA", "Average"; next }    # Print a heading row\
NF > 2 { sum=0; for (i=2; i<=NF; i++) sum+=$i; print $1, sum/(NF-1) }' \ 
mastertable | sort -gr -k2 | head
RMRP-201 90047.4
AC007336.1-201 49627.7
AC026691.1-201 36003.8
AC034102.6-202 23611.6
AC116535.1-201 17051.9
LINC02022-201 15407.2
HELLPAR-201 14754.3
AC005089.1-201 12971.1
AC034102.7-201 12796.6
LINC00868-203 12116.7

Friday, September 22, 2017

Two scripts for clustering and coloring of single cell RNAseq data

These are the two scripts that I wrote that will cluster single cell RNAseq data using principal component analysis and color the individual cells with the expression of one or two genes in a gradient scale.

The first script is SingleCellPCAplot, an R script that performs principal component analysis of single cell RNAseq data starting from an RPKM mastertable and outputs a PCA plot with cells colored according to the expression of a gene of interest in a color gradient scale. It is useful if you want quickly to cluster the cells and check the expression of your genes in different clusters. When running the script just input the gene of interest and you will get the pdf output.

https://github.com/milospjanic/SingleCellPCAplot


The second script is SingleCellPCAplotMultiGene, an extension of the previous script that may be better for the visual cluster separation. It will use a ratio of expression of two genes for the two-color gradient. In the examples, I show how you can achieve cluster separation within the same cell type on the example of fibroblasts using two different fibroblast markers. 
When colored with fibroblast/cardiomyocyte markers the cells show clear fibroblast lineage.


However when colored with two fibroblast markers cells cluster into two separate subclusters of fibroblasts.

Tuesday, August 29, 2017

GTExExtractor, a script to parse and plot individual level GTEx data

This is the script I wrote you may find useful in case you are working with GTEx data.

https://github.com/milospjanic/GTExExtractor/

As GTEx currently can only show the box plot distribution for a single gene across tissues, I have made a script (named it GTExExtractor) to show the distribution of multiple genes in a single tissue in a form of more informative violin plot. You just have to type the genes you want and the tissue. You can use it to show how certain members within a gene family are expressed higher in certain tissues.

For example, NFIX is more dominant in the class of NFI transcription factors in coronary arteries, while in the skeletal muscle it is NFIC, and in lung NFIB becomes more expressed.






You can see other examples in the link.

Thursday, August 10, 2017

Extract columns by matching column names from file

If you want to select columns using column names that are read from another file use this code. For example, input file is input.txt and column names to be extracted are in columns.txt.

First save this awk code as extractor.sh, it will read the first row and iterate through the fields to find those that match the first argument $1. After it will save the matched row number as col_num and print it out. For other rows, NR>1, it will print out the field content only if i=col_num.

awk -v COLUMN_ID=$1 '
        NR==1 {
                for (i=1; i<=NF; i++) {
                        if ($i==COLUMN_ID) {
                                col_num=i;
                                print $i;
                        }
                }
        }
        NR>1 {
                if (i=col_num) {
                        print $i;
                }
        }
' $2
Then run this code, it will read the columns.txt file line by line, save it in each loop into variable $line, then it runs the extractor.sh script with $line as $1 and input.txt as $2. It then saves output for each $line as a temp file. Then use paste to connect all .temp files to create the final table.

while IFS= read -r line; 
do ./extractor.sh $line path/to/file/input.txt > $line.temp; 

done < path/to/file/columns.txt 

paste *.temp > output.txt