Thursday, January 5, 2017

Parsing tsv output files from Kallisto

If you have a tsv file from Kallisto (abundance.tsv) that you need to parse,

head abundance.tsv
target_id       length  eff_length      est_counts      tpm
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|    1657    1478    12.1601 0.153207
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|       632     453     0       0
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene|    1351    1172    243.528 3.86933
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|       68      3.2997  0.5     2.8217
ENST00000473358.1|ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|        712     533     0       0
ENST00000469289.1|ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|        535     356     0       0
ENST00000607096.1|ENSG00000274890.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|      138     6.64697 0       0
ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-001|FAM138A|1187|lincRNA| 1187    1008    0       0
ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-002|FAM138A|590|lincRNA|  590     411     0       0
for example, you need to take ENSG gene name from the second column delimited by | and then estimated number of counts from the third column delimited by tab, you can do this in one command. Use find/xargs pipe to process every abundance.tsv, cut -f1,4 columns using default tab as a separator (thus grabing first column and estimated number of counts from the 4th), then sed "s/.*|E/E/g" to delete text till first | (i.e the first column delimited with | ). 

find -name '*abundance.tsv' | xargs -I % sh -c 'cut -f1,4 % | sed "s/.*|E/E/g";' | head
target_id       est_counts
ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|      12.1601
ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| 0
ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene|      243.528
ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA| 0.5
ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|  0
ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|  0
ENSG00000274890.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|        0
ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-001|FAM138A|1187|lincRNA|   0
ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-002|FAM138A|590|lincRNA|    0
Next, use sed "s/|.*|.*|.*|.*|//g" to delete remaining of the text in the first column, keeping only ENSG gene name.

find -name '*abundance.tsv' | xargs -I % sh -c 'cut -f1,4 % | sed "s/.*|E/E/g";' | head
target_id       est_counts
ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|      12.1601
ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| 0
ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene|      243.528
ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA| 0.5
ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|  0
ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|  0
ENSG00000274890.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|        0
ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-001|FAM138A|1187|lincRNA|   0
ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-002|FAM138A|590|lincRNA|    0
Finally, substitute .[0-9]* with //, i.e. delete. 

find -name '*abundance.tsv' | xargs -I % sh -c 'cut -f1,4 % | sed "s/.*|E/E/g" | sed "s/|.*|.*|.*|.*|//g" | sed -e "s/\.[0-9]*//g";' | head
target_id       est_counts
ENSG00000223972 12
ENSG00000223972 0
ENSG00000227232 243
ENSG00000278267 0
ENSG00000243485 0
ENSG00000243485 0
ENSG00000274890 0
ENSG00000237613 0
ENSG00000237613 0

No comments:

Post a Comment