Thursday, June 25, 2015

Scp to transfer files from one server to another

To transfer files from one server to another use scp command:

scp ~/path_to_file/file.txt username@server.address:"destinastion_path/"

If you have a space in the folder name use quotes and declare space:

scp ~/path_to_file/file.txt username@server.address:"destinastion\ path/"

Tuesday, June 23, 2015

How to calculate sum for a column with awk

Use awk to count the sum for column 4, add up the numbers of the field 4 to the variable 'sum'. Print sum with END condition.
Lets say that you have a header in your file, to avoid using the header use NR>2 as a condition to start adding to the sum variable when the number of records is 2. 

awk 'NR>2 {sum += $4 } END { print sum }' file.txt

To do this for multiple columns use two or more variables:

awk 'NR>2 {sum += $2; sum2 +=$3} END { print sum "\t" sum2 }' file.txt

Friday, June 19, 2015

Awk oneliner to count the sum for each row

If you need to count sum of all values for each row in a file, use this awk code:

awk '{for(i=1;i<=NF;i++) t+=$i; print t; t=0}' file.txt

Thursday, June 18, 2015

How to batch download data from UCSC Golden Path using curl

If you want to batch download for example data from UCSC Golden path, first download the file called files.txt that contains all the information about the files under the specific URL.

For example files.txt at the URL http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/

wgEncodeAwgDnaseDukeAosmcUniPk.narrowPeak.gz    project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseSeq; view=Peaks; cell=AoSMC; treatment=None; dataVersion=ENCODE Jan 2011 Freeze; tableName=wgEncodeAwgDnaseDukeAosmcUniPk; type=narrowPeak; md5sum=957b3477d43cef1c6abd41182b053418; size=1.5M
wgEncodeAwgDnaseDukeChorionUniPk.narrowPeak.gz  project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseSeq; view=Peaks; cell=Chorion; treatment=None; dataVersion=ENCODE Jan 2011 Freeze; dccAccession=wgEncodeEH000595; tableName=wgEncodeAwgDnaseDukeChorionUniPk; type=narrowPeak; md5sum=f0ce90b72c1cfaceda456e0dfd10db1e; size=1.6M
...


We can see that files.txt contains file names but not the full URLs. We cannot use curl to batch download files with such file.

First use awk to place the URL in front of the file names

awk '{$1 = "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/"$1;}1' files.txt > files.url.txt

less files.url.txt:
 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDuke8988tUniPk.narrowPeak.gz project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseSeq
; view=Peaks; cell=8988T; treatment=None; dataVersion=ENCODE Jan 2011 Freeze; dccAccession=wgEncodeEH001103; tableName=wgEncodeAwgDnaseDuke8988tUniPk; type=narrowPeak; md5sum=80fadeb7a14a72add38203910d937
f50; size=1.7M
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDukeAosmcUniPk.narrowPeak.gz project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseSeq
; view=Peaks; cell=AoSMC; treatment=None; dataVersion=ENCODE Jan 2011 Freeze; tableName=wgEncodeAwgDnaseDukeAosmcUniPk; type=narrowPeak; md5sum=957b3477d43cef1c6abd41182b053418; size=1.5M
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDukeChorionUniPk.narrowPeak.gz project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseS
...

Then cut the first column from the file:
cut -f1 files.url.txt -d' '>files.url.cut.txt

less files.url.cut.txt
Then use curl to batch download all the files:
xargs -n 1 curl -O -L < files.url.cut.txt


% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1730k  100 1730k    0     0  16.7M      0 --:--:-- --:--:-- --:--:-- 17.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1510k  100 1510k    0     0  15.1M      0 --:--:-- --:--:-- --:--:-- 16.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1588k  100 1588k    0     0  16.0M      0 --:--:-- --:--:-- --:--:-- 17.0M
...

Wednesday, June 17, 2015

How to compare two files using awk

If you have tow files that need to be compared you can do it using awk.

For example, file1

0010705-3-1     3405456
1020301-7-2     2032766
102901-8-1      2093898
2030801-6-5     2595293
2040401-3-1     2319647
3100203-1-3     2181363
3101801-2-3     2987284
9070202-27-5    2539899
9071501-8-1     2364814
CA-1401-1       2500080

file2

0010705-3-1 1012215012
9070202-27-5 776575978
1020301-7-2 699304881
102901-8-1 622053170
2030801-6-5 785789447
1347-1 713688560
1483-5 743332308
1522-2 707074454
1559-1 713053387


You want to take rows that have common id in column 1 and plot columns 2 from both files:

awk 'NR==FNR {h[$1] = $1; h2[$1] = $2; next} {print $1,$2,h[$1],h2[$1]}' file1 file2

0010705-3-1 1012215012 0010705-3-1 3405456
9070202-27-5 776575978 9070202-27-5 2539899
1020301-7-2 699304881 1020301-7-2 2032766
102901-8-1 622053170 102901-8-1 2093898
2030801-6-5 785789447 2030801-6-5 2595293
1347-1 713688560
1483-5 743332308
1522-2 707074454
1559-1 713053387



Awk condition NR==FNR will be true only for the first file, since only when the first file is read the number of records NR (cumulative number irrespective of the number of files processed) will be equal FNR, number of records for the current file.
Make a hash h with keys read from the column 1 and values also column 1, and hash h2 with keys read from column1 and values read from column 2.
Then, print columns 1 and 2 from file2 and values from hash h and hash h2, for the key = $1 from file2.
If the key is missing in hash h and hash h2 nothing will be printed.

Thursday, June 11, 2015

How to colapse columns using data.table in R

If you need to collapse your matrix in R and e.g. calculate mean you can use package data.table

Install:

install.packages("data.table")
library(data.table)

Read test file, rownames=NULL to load duplicate row.names:
ex.1 <- read.delim("test",sep="", row.names=NULL, stringsAsFactors=F)

Make data table, use key="row.names":

ex.2 <- data.table(ex.1,key="row.names")

> ex.2
       row.names GSM1289335 GSM1289336 GSM1289337 GSM1289338 GSM1289339
 1:   1415670_at    8.24126    8.52633    8.33998    8.86572    8.90274
 2:   1415670_at   80.24126   80.52633    8.33998    8.86572    8.90274
 3:   1415671_at   11.13090   11.24580   11.17340   10.94100   11.20140
 4:   1415671_at  110.13090  110.24580   11.17340   10.94100   11.20140
 5:   1415672_at   11.45830   11.44230   11.46260   11.18010   11.28670
 6:   1415672_at  110.45830  110.44230   11.46260   11.18010   11.28670
 7:   1415673_at    6.88159    6.98538    6.82758    6.41075    6.20351
 8:   1415673_at   60.88159   60.98538    6.82758    6.41075    6.20351
 9: 1415674_a_at    9.48842    9.45669    9.44673    9.92935    9.85057
10: 1415674_a_at   90.48842   90.45669    9.44673    9.92935    9.85057
11:   1415675_at    8.83436    8.97068    8.87163    9.22176    9.44220
12:   1415675_at   80.83436   80.97068    8.87163    9.22176    9.44220
13: 1415676_a_at   10.22460   10.42270   10.26540   11.08470   11.31620
14: 1415676_a_at  100.22460  100.42270   10.26540   11.08470   11.31620
15:   1415677_at    9.64835    9.77467    9.59813    9.44781    9.64694
16:   1415677_at   90.64835   90.77467    9.59813    9.44781    9.64694
17:   1415678_at    9.72289    9.64673    9.66538    9.66327    9.74859
18:   1415678_at   90.72289   90.64673    9.66538    9.66327    9.74859
19:   1415679_at   11.23940   11.20380   11.15410   11.38530   11.53210
20:   1415679_at  110.23940  110.20380   11.15410   11.38530   11.53210
    GSM1289340

 1:    8.70942
 2:    8.70942
 3:   11.04830
 4:   11.04830
 5:   11.21400
 6:   11.21400
 7:    6.48295
 8:    6.48295
 9:    9.86870
10:    9.86870
11:    9.33622
12:    9.33622
13:   11.04220
14:   11.04220
15:    9.38489
16:    9.38489
17:    9.52396
18:    9.52396
19:   11.38500
20:   11.38500



Collapse by row.names and make mean for each column:

> ex.2[,list(GSM1289335=mean(GSM1289335), GSM1289336=mean(GSM1289336), GSM1289337=mean(GSM1289337), GSM1289338=mean(GSM1289338), GSM1289339=mean(GSM1289339), GSM1289340=mean(GSM1289340)), by=row.names]
       row.names GSM1289335 GSM1289336 GSM1289337 GSM1289338 GSM1289339

 1:   1415670_at   44.24126   44.52633    8.33998    8.86572    8.90274
 2:   1415671_at   60.63090   60.74580   11.17340   10.94100   11.20140
 3:   1415672_at   60.95830   60.94230   11.46260   11.18010   11.28670
 4:   1415673_at   33.88159   33.98538    6.82758    6.41075    6.20351
 5: 1415674_a_at   49.98842   49.95669    9.44673    9.92935    9.85057
 6:   1415675_at   44.83436   44.97068    8.87163    9.22176    9.44220
 7: 1415676_a_at   55.22460   55.42270   10.26540   11.08470   11.31620
 8:   1415677_at   50.14835   50.27467    9.59813    9.44781    9.64694
 9:   1415678_at   50.22289   50.14673    9.66538    9.66327    9.74859
10:   1415679_at   60.73940   60.70380   11.15410   11.38530   11.53210
    GSM1289340
 1:    8.70942
 2:   11.04830
 3:   11.21400
 4:    6.48295
 5:    9.86870
 6:    9.33622
 7:   11.04220
 8:    9.38489
 9:    9.52396
10:   11.38500



Wednesday, June 10, 2015

How to load file with multiple spaces in R

In R if you try to load file with multiple spaces as a delimiter using sep=" " (single space) you may get:

ex <- read.delim("ex.1.merge.1", sep=" ", header=T)

Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  more columns than column names


Instead use "" that will mark any number of spaces, unlike " " that is a single space.

ex <- read.delim("ex.1.merge.1", sep="", header=T)
> head(ex)
             GSM1289335 GSM1289336 GSM1289337 GSM1289338 GSM1289339 GSM1289340
1415670_at      8.24126    8.52633    8.33998    8.86572    8.90274    8.70942
1415671_at     11.13090   11.24580   11.17340   10.94100   11.20140   11.04830
1415672_at     11.45830   11.44230   11.46260   11.18010   11.28670   11.21400
1415673_at      6.88159    6.98538    6.82758    6.41075    6.20351    6.48295
1415674_a_at    9.48842    9.45669    9.44673    9.92935    9.85057    9.86870
1415675_at      8.83436    8.97068    8.87163    9.22176    9.44220    9.33622
             GSM1289341 GSM1289342 GSM1289343    affy_mouse430_2
1415670_at      8.43722    8.57212    8.62239 ENSMUSG00000030058
1415671_at     11.30660   11.25010   11.37650 ENSMUSG00000013160
1415672_at     11.30770   11.28530   11.28260 ENSMUSG00000015341
1415673_at      6.64169    6.77523    6.87241 ENSMUSG00000029446
1415674_a_at    9.74287    9.70161    9.85818 ENSMUSG00000032112
1415675_at      8.85549    8.97766    9.03381 ENSMUSG00000026810
             ensembl_gene_id
1415670_at        1415670_at
1415671_at        1415671_at
1415672_at        1415672_at
1415673_at        1415673_at
1415674_a_at    1415674_a_at
1415675_at        1415675_at

Select rows based on column values in R

To select rows that have in column 5 entry "neurological system process" use subset function:

head(id_merge)
  affy_mouse430_2    ensembl_gene_id external_gene_name      go_id
1      1445219_at ENSMUSG00000056959            Olfr315 GO:0005575
2      1445219_at ENSMUSG00000056959            Olfr315 GO:0008150
3      1445219_at ENSMUSG00000056959            Olfr315 GO:0050877
4      1445219_at ENSMUSG00000056959            Olfr315 GO:0007165
5      1445219_at ENSMUSG00000056959            Olfr315 GO:0005623
6      1445219_at ENSMUSG00000056959            Olfr315 GO:0005886
                    name_1006
1          cellular_component
2          biological_process
3 neurological system process
4         signal transduction
5                        cell
6             plasma membrane


newdata<-subset(id_merge,name_1006=="neurological system process")

head(newdata)
     affy_mouse430_2    ensembl_gene_id external_gene_name      go_id
3         1445219_at ENSMUSG00000056959            Olfr315 GO:0050877
417       1443936_at ENSMUSG00000047352            Olfr976 GO:0050877
1582      1420784_at ENSMUSG00000034115             Scn11a GO:0050877
3259      1417415_at ENSMUSG00000021609             Slc6a3 GO:0050877
4109      1448026_at ENSMUSG00000041235               Chd7 GO:0050877
4110      1437745_at ENSMUSG00000041235               Chd7 GO:0050877
                       name_1006
3    neurological system process
417  neurological system process
1582 neurological system process
3259 neurological system process
4109 neurological system process
4110 neurological system process

Tuesday, June 9, 2015

Using biomaRt to match Affymetrix and Ensembl IDs

Lets say that you have two files that contain different gene/probe IDs and you want to match them.
First one e.g. contains Ensembl IDs, second Affy probe IDs in their first columns, respectively.

head M0_M1_M2
"ID"    "adj.P.Val"     "P.Value"       "F"     "Gene.symbol"   "Gene.title"
"1458381_at"    "6.98e-13"      "1.55e-17"      "4.82e+03"      "Clic5" "chloride intracellular channel 5"
"1455899_x_at"  "5.14e-12"      "2.28e-16"      "3.02e+03"      "Socs3" "suppressor of cytokine signaling 3"
"1422619_at"    "9.82e-12"      "7.82e-16"      "2.44e+03"      "Ppap2a"        "phosphatidic acid phosphatase type 2A"
"1433741_at"    "9.82e-12"      "8.95e-16"      "2.38e+03"      "Cd38"  "CD38 antigen"
"1434537_at"    "9.82e-12"      "1.09e-15"      "2.30e+03"      "Slco3a1"       "solute carrier organic anion transporter family, member 3a1"
"1424339_at"    "1.09e-11"      "1.44e-15"      "2.19e+03"      "Oasl1" "2'-5' oligoadenylate synthetase-like 1"
"1418652_at"    "1.62e-11"      "2.95e-15"      "1.93e+03"      "Cxcl9" "chemokine (C-X-C motif) ligand 9"
"1420549_at"    "1.62e-11"      "3.23e-15"      "1.90e+03"      "Gbp2b" "guanylate binding protein 2b"
"1448918_at"    "1.62e-11"      "3.73e-15"      "1.85e+03"      "Slco3a1"       "solute carrier organic anion transporter family, member 3a1"


head smc_chl
"ID"    "adj.P.Val"     "P.Value"       "t"     "B"     "logFC" "SPOT_ID"
"ENSMUSG00000029371_at" "5.25e-14"      "3.11e-18"      "214.094964"    "27.467113"     "9.0131515"     "ENSMUSG00000029371"
"ENSMUSG00000040026_at" "3.51e-12"      "4.16e-16"      "125.316488"    "25.629056"     "7.8254984"     "ENSMUSG00000040026"
"ENSMUSG00000029380_at" "2.89e-11"      "5.13e-15"      "95.181841"     "24.148872"     "5.1525411"     "ENSMUSG00000029380"
"ENSMUSG00000026822_at" "3.13e-11"      "7.43e-15"      "91.407928"     "23.899528"     "9.707142"      "ENSMUSG00000026822"
"ENSMUSG00000018920_at" "4.87e-11"      "1.45e-14"      "84.986439"     "23.431046"     "3.3194647"     "ENSMUSG00000018920"
"ENSMUSG00000021390_at" "5.17e-11"      "1.84e-14"      "-82.765402"    "23.254652"     "-5.7210961"    "ENSMUSG00000021390"
"ENSMUSG00000024030_at" "5.40e-11"      "2.24e-14"      "80.996784"     "23.10843"      "3.0392955"     "ENSMUSG00000024030"
"ENSMUSG00000019929_at" "5.46e-11"      "2.59e-14"      "79.724354"     "22.999927"     "5.0517855"     "ENSMUSG00000019929"
"ENSMUSG00000001349_at" "8.16e-11"      "4.41e-14"      "-75.214362"    "22.591543"     "-3.8921629"    "ENSMUSG00000001349"


If you need to extract and match identifiers from different databases it is best to do this via biomaRt Bioconductor package:

Install biomaRt:

source("http://bioconductor.org/biocLite.R") 
biocLite("biomaRt")

To extract mouse Affy IDs from Affymetrix GeneChip Mouse Genome 430 2.0 Array, and match this with Ensembl IDs use the following code. Note that Affy IDs are in Mart:ensemble Dataset:mmusculus_gene_ensembl Attributes:affy_mouse430_2, while Ensemble IDs are in Attributes:ensembl_gene_id. Note also that we pulled all the Affy IDs and Ensembl IDs from the first columns of the two previously defined files and used them as filters for getBM function.


library(biomaRt)
listMarts()

ensembl = useMart("ensembl",dataset="mmusculus_gene_ensembl")
listDatasets(ensembl)

affy_id = read.delim("M0_M1_M2", header =TRUE,quote= "\"", sep="\t")
affy_id_col1 = affy_id [,1]

ensembl_id = read.delim("smc_chl", header =TRUE,quote= "\"", sep="\t")
ensembl_id_col1 = ensembl_id [,1]

id_merge = getBM(attributes=c("affy_mouse430_2","ensembl_gene_id","external_gene_name"),filters="affy_mouse430_2",values=affy_id_col1,mart=ensembl)
head(id_merge, 50)


Output:

affy_mouse430_2    ensembl_gene_id external_gene_name
1        1432227_at ENSMUSG00000087193            Gm14820
2        1441786_at ENSMUSG00000063087            Gm10125
3        1420957_at ENSMUSG00000005871                Apc
4        1450056_at ENSMUSG00000005871                Apc
5        1435543_at ENSMUSG00000005871                Apc
6        1420956_at ENSMUSG00000005871                Apc
7        1428301_at ENSMUSG00000094811            Gm21103
8        1420614_at ENSMUSG00000031176             Dynlt3
9        1449929_at ENSMUSG00000031176             Dynlt3
10       1449928_at ENSMUSG00000031176             Dynlt3
11     1459854_s_at ENSMUSG00000031176             Dynlt3
12     1453241_a_at ENSMUSG00000092523            Gm18191
13     1415833_x_at ENSMUSG00000068243             Gm7079
14       1427542_at ENSMUSG00000038248               Sobp
15     1427382_a_at ENSMUSG00000039231            Suv39h1
16     1432236_a_at ENSMUSG00000039231            Suv39h1
17       1426107_at ENSMUSG00000051977              Prdm9
18     1450331_s_at ENSMUSG00000096707         Vmn2r-ps44
19     1436000_a_at ENSMUSG00000054115               Skp2
20     1437033_a_at ENSMUSG00000054115               Skp2
21       1418969_at ENSMUSG00000054115               Skp2
22     1460247_a_at ENSMUSG00000054115               Skp2
23     1449293_a_at ENSMUSG00000054115               Skp2
24       1425072_at ENSMUSG00000054115               Skp2
25     1415976_a_at ENSMUSG00000044434             Gm9791
26       1442926_at ENSMUSG00000102099      1700011B04Rik
27       1422182_at ENSMUSG00000017688              Hnf4g
28       1450518_at ENSMUSG00000017688              Hnf4g
29     1423696_a_at ENSMUSG00000021737              Psmd6
30       1423697_at ENSMUSG00000021737              Psmd6
31       1418815_at ENSMUSG00000024304               Cdh2
32       1449244_at ENSMUSG00000024304               Cdh2
33       1439307_at ENSMUSG00000024304               Cdh2
34       1422089_at ENSMUSG00000062524               Ncr1
35       1416335_at ENSMUSG00000059658            Gm16379
36       1439973_at ENSMUSG00000074365              Crxos
37       1460605_at ENSMUSG00000074365              Crxos
38       1427317_at ENSMUSG00000037262                Kin
39       1448167_at ENSMUSG00000020009             Ifngr1
40       1418190_at ENSMUSG00000002588               Pon1
41       1460281_at ENSMUSG00000029685              Asb15
42       1439836_at ENSMUSG00000029685              Asb15
43       1456774_at ENSMUSG00000040734           Ppp1r13l
44     1459592_a_at ENSMUSG00000040734           Ppp1r13l
45     1439530_a_at ENSMUSG00000040734           Ppp1r13l
46     1459593_x_at ENSMUSG00000040734           Ppp1r13l
47       1446394_at ENSMUSG00000025905              Oprk1
48       1451813_at ENSMUSG00000025905              Oprk1
49     1429993_s_at ENSMUSG00000033219             Gm9758
50     1435732_x_at ENSMUSG00000024121            Atp6v0c

Friday, June 5, 2015

Code to separate vcf files into chromosomes and prepend vcf header


Some scripts will need you to separate vcf file into chromosomes but keep the header, in that case use this bash code:

# separate vcf into chromosomes
cat file.vcf | awk '!/^#/{print>$1}'

#make chromosome list
awk '!/^#/ { a[$1]++ } END { for (b in a) { print b } }' file.vcf | sort > chr_list

#append header from vcf to each chromosome file
for i in $(cat chr_list)
do
echo $i
cat <(grep "#" 1410UNHS-0007_2305_3_SNP_INDEL.vcf) $i > "$i"_header
done

How to separate vcf file into separate chromosomes

Certain tools will need vcf files to be separated by chromosome.
You can do this with awk:
!/^#/ select all hte lines that are not comments
print>$1 separate by first column


cat file.vcf | awk '!/^#/{print>$1}'