Tuesday, August 30, 2016

Example of cleaning GENCODE table with sed

This is one example how to clean your master table for RNA-Seq created using GENCODE transcript collection.
Table will contain first row with multiple gene IDs separated by |. If you need to keep only the transcript ID use the following approach with sed :


root@valkyr:/home/towerraid/mo/QC_Z259.B925_Genesips_Cundiff_Ins_071615.SE.RNASeqPolyA.RAPiD.Human# head mastertable.roundup.3
X071_2_E7_24h_TAG X071_2_E7_72h_TAG X071_2_E8_24h_TAG X071_2_E8_72h_TAG X334_1_E7_24h_TAG X334_1_E7_72h_TAG X334_1_E8_24h_TAG X334_1_E8_72h_TAG X756_3_E7_24h_TAG X756_3_E7_72h_TAG X756_3_E8_24h_TAG X756_3_E8_72h_TAG X835_1_E7_24h_TAG X835_1_E7_72h_TAG X835_1_E8_24h_TAG X835_1_E8_72h_TAG H1_E7_24h_TAG H1_E7_72h_TAG H1_E8_24h_TAG H1_E8_72h_TAG H7_E7_24h_TAG H7_E7_72h_TAG H7_E8_24h_TAG H7_E8_72h_TAG
ENST00000466638.5|ENSG00000130822.15|OTTHUMG00000024216.6|OTTHUMT00000337661.1|PNCK-021|PNCK|1204|retained_intron| 5 5 0 10 8 4 26 83 7 11 6 64 4 015 47 0 0 4 6 7 4 0 25
ENST00000499522.6|ENSG00000247081.7|OTTHUMG00000164798.4|OTTHUMT00000380348.1|BAALC-AS1-004|BAALC-AS1|1353|antisense| 4 6 5 10 15 12 12 9 7 4 13 12 14 810 11 18 25 9 6 8 10 2 9
ENST00000451029.5|ENSG00000105792.19|OTTHUMG00000023434.8|OTTHUMT00000139892.3|CFAP69-002|CFAP69|2614|nonsense_mediated_decay| 0 0 0 0 0 5 0 0 0 0 5 0 00
ENST00000616193.1|ENSG00000275771.1|OTTHUMG00000187876.1|OTTHUMT00000475652.1|AC140725.8-001|AC140725.8|880|processed_pseudogene| 0 0 0 0 0 0 0 1 0 0 0 01
ENST00000498318.1|ENSG00000204116.11|OTTHUMG00000021835.5|OTTHUMT00000057234.2|CHIC1-002|CHIC1|2800|nonsense_mediated_decay| 0 0 0 0 0 0 19 6 0 0 0 0 036 61 0 0 0 1 0 13 9 0


root@valkyr:/home/towerraid/mo/QC_Z259.B925_Genesips_Cundiff_Ins_071615.SE.RNASeqPolyA.RAPiD.Human# sed -e 's/|.*|.*|.*|.*|.*|//g' mastertable.roundup.3 | head
X071_2_E7_24h_TAG X071_2_E7_72h_TAG X071_2_E8_24h_TAG X071_2_E8_72h_TAG X334_1_E7_24h_TAG X334_1_E7_72h_TAG X334_1_E8_24h_TAG X334_1_E8_72h_TAG X756_3_E7_24h_TAG X756_3_E7_72h_TAG X756_3_E8_24h_TAG X756_3_E8_72h_TAG X835_1_E7_24h_TAG X835_1_E7_72h_TAG X835_1_E8_24h_TAG X835_1_E8_72h_TAG H1_E7_24h_TAG H1_E7_72h_TAG H1_E8_24h_TAG H1_E8_72h_TAG H7_E7_24h_TAG H7_E7_72h_TAG H7_E8_24h_TAG H7_E8_72h_TAG
ENST00000466638.5 5 5 0 10 8 4 26 83 7 11 6 64 4 0 15 47 0 0 4 6 7 4 0 25
ENST00000499522.6 4 6 5 10 15 12 12 9 7 4 13 12 14 8 10 11 18 25 9 6 8 10 2 9
ENST00000451029.5 0 0 0 0 0 5 0 0 0 0 5 0 0 0 3 0 0 0 0 0 0 0 3 0
ENST00000616193.1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1
ENST00000498318.1 0 0 0 0 0 0 19 6 0 0 0 0 0 0 36 61 0 0 0 1 0 13 9 0


Next, clear the transcript verson number also using sed :

root@valkyr:/home/towerraid/mo/QC_Z259.B925_Genesips_Cundiff_Ins_071615.SE.RNASeqPolyA.RAPiD.Human# sed -e 's/\.[0-9]*//g' mastertable.roundup.4 | head 
X071_2_E7_24h_TAG X071_2_E7_72h_TAG X071_2_E8_24h_TAG X071_2_E8_72h_TAG X334_1_E7_24h_TAG X334_1_E7_72h_TAG X334_1_E8_24h_TAG X334_1_E8_72h_TAG X756_3_E7_24h_TAG X756_3_E7_72h_TAG X756_3_E8_24h_TAG X756_3_E8_72h_TAG X835_1_E7_24h_TAG X835_1_E7_72h_TAG X835_1_E8_24h_TAG X835_1_E8_72h_TAG H1_E7_24h_TAG H1_E7_72h_TAG H1_E8_24h_TAG H1_E8_72h_TAG H7_E7_24h_TAG H7_E7_72h_TAG H7_E8_24h_TAG H7_E8_72h_TAG
ENST00000466638 5 5 0 10 8 4 26 83 7 11 6 64 4 0 15 47 0 0 4 6 7 4 0 25
ENST00000499522 4 6 5 10 15 12 12 9 7 4 13 12 14 8 10 11 18 25 9 6 8 10 2 9
ENST00000451029 0 0 0 0 0 5 0 0 0 0 5 0 0 0 3 0 0 0 0 0 0 0 3 0
ENST00000616193 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1
ENST00000498318 0 0 0 0 0 0 19 6 0 0 0 0 0 0 36 61 0 0 0 1 0 13 9 0

No comments:

Post a Comment