Thursday, June 18, 2015

How to batch download data from UCSC Golden Path using curl

If you want to batch download for example data from UCSC Golden path, first download the file called files.txt that contains all the information about the files under the specific URL.

For example files.txt at the URL http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/

wgEncodeAwgDnaseDukeAosmcUniPk.narrowPeak.gz    project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseSeq; view=Peaks; cell=AoSMC; treatment=None; dataVersion=ENCODE Jan 2011 Freeze; tableName=wgEncodeAwgDnaseDukeAosmcUniPk; type=narrowPeak; md5sum=957b3477d43cef1c6abd41182b053418; size=1.5M
wgEncodeAwgDnaseDukeChorionUniPk.narrowPeak.gz  project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseSeq; view=Peaks; cell=Chorion; treatment=None; dataVersion=ENCODE Jan 2011 Freeze; dccAccession=wgEncodeEH000595; tableName=wgEncodeAwgDnaseDukeChorionUniPk; type=narrowPeak; md5sum=f0ce90b72c1cfaceda456e0dfd10db1e; size=1.6M
...


We can see that files.txt contains file names but not the full URLs. We cannot use curl to batch download files with such file.

First use awk to place the URL in front of the file names

awk '{$1 = "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/"$1;}1' files.txt > files.url.txt

less files.url.txt:
 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDuke8988tUniPk.narrowPeak.gz project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseSeq
; view=Peaks; cell=8988T; treatment=None; dataVersion=ENCODE Jan 2011 Freeze; dccAccession=wgEncodeEH001103; tableName=wgEncodeAwgDnaseDuke8988tUniPk; type=narrowPeak; md5sum=80fadeb7a14a72add38203910d937
f50; size=1.7M
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDukeAosmcUniPk.narrowPeak.gz project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseSeq
; view=Peaks; cell=AoSMC; treatment=None; dataVersion=ENCODE Jan 2011 Freeze; tableName=wgEncodeAwgDnaseDukeAosmcUniPk; type=narrowPeak; md5sum=957b3477d43cef1c6abd41182b053418; size=1.5M
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDukeChorionUniPk.narrowPeak.gz project=wgEncode; lab=Duke; composite=wgEncodeAwgDnaseUniPk; dataType=DnaseS
...

Then cut the first column from the file:
cut -f1 files.url.txt -d' '>files.url.cut.txt

less files.url.cut.txt
Then use curl to batch download all the files:
xargs -n 1 curl -O -L < files.url.cut.txt


% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1730k  100 1730k    0     0  16.7M      0 --:--:-- --:--:-- --:--:-- 17.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1510k  100 1510k    0     0  15.1M      0 --:--:-- --:--:-- --:--:-- 16.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1588k  100 1588k    0     0  16.0M      0 --:--:-- --:--:-- --:--:-- 17.0M
...

No comments:

Post a Comment