Monday, February 1, 2016

Extracting genomic cordinates using sed and regex

If you have to parse a file that contains genomic positions within underscores in order to get these positions you can use sed and regular expressions (regex). For example in this file we have the chromosome, position, major allele, minor allele, and genome version followed by a p value.

head SMAD3_locus_ciseQTLs_Thyroid_cut
15_66966370_T_C_b37     0.339782620541945
15_66966548_A_G_b37     0.87406702584693
15_66966951_C_G_b37     0.915454245030286
15_66967170_A_G_b37     0.362167974353386
15_66967196_C_T_b37     0.388221676710983
15_66967247_C_T_b37     0.969315725318175
15_66967308_A_C_b37     0.476873693739952
15_66967398_T_C_b37     0.487932672811565
15_66967473_G_C_b37     0.360506443754836
15_66967504_T_C_b37     0.453793537166135

If you need only the position and p-value to e.g. plot in a graph use sed, first to match ^15_ at the beginning of the line and substitute it with empty string, then pipe this to sed and find underscore followed with zero or any number of ATGC repetitions [ATGC]*, then again underscore and [ATGC]*, followed with _b37, and substitute this with an empty string. That will do the trick.

sed 's/^15_//g' SMAD3_locus_ciseQTLs_Thyroid_cut | 
sed 's/_[ATGC]*_[ATGC]*_b37//g' 
| head
66966370        0.339782620541945
66966548        0.87406702584693
66966951        0.915454245030286
66967170        0.362167974353386
66967196        0.388221676710983
66967247        0.969315725318175
66967308        0.476873693739952
66967398        0.487932672811565
66967473        0.360506443754836
66967504        0.453793537166135

No comments:

Post a Comment