Monday, December 7, 2015

How to transform list of GWAS associated genes to single column list with sed command

If you have list of GWAS associated genes where each gene is in separate line or together with other genes from a particular locus separated with a comma, and if you want to transform this to a list of genes that are in a single column, use sed command to substitute comma and a space character with a new line character.


mpjanic@valkyr:~/REBUTTAL$ cat genes
ZNF259, APOA1, APOC3, APOA4, APOA5
UBE3B, MVK, MMAB, MYO1H, KCTD10
LIPC
CETP
GFOD2, LCAT
LIPG
APOB
ZNF259, APOA1, APOC3, APOA4, APOA5, BUD13
PCSK9
CELSR2
APOB
HMGCR
TRIB1
ZNF259, APOA1, APOC3, APOA4, APOA5, BUD13
LDLR
SF4, CILP2
APOC2, APOE, APOC4, APOC1
DOCK7, ANGPTL3
GCKR
TBL2, MLXIPL, BAZ1B, BCL7B
LPL
TRIB1
CILP2, ZNF101
PPP1R3B
AFF1
SELP, F5
LOC653163, SURF2, SURF4, ADAMTS13, C9orf7, ABO
RGS14, PRR7, DBN1, GRK6, UIMC1, SLC34A1, F12, FGFR4, NSD1, PRELID1, MXD3, LMAN2
F11
RFC4, ADIPOQ, KNG1
mpjanic@valkyr:~/REBUTTAL$ sed 's/,\ /\n/g' genes
ZNF259
APOA1
APOC3
APOA4
APOA5
UBE3B
MVK
MMAB
MYO1H
KCTD10
LIPC
CETP
GFOD2
LCAT
LIPG
APOB
ZNF259
APOA1
APOC3
APOA4
APOA5
BUD13
PCSK9
CELSR2
APOB
HMGCR
TRIB1
ZNF259
APOA1
APOC3
APOA4
APOA5
BUD13
LDLR
SF4
CILP2
APOC2
APOE
APOC4
APOC1
DOCK7
ANGPTL3
GCKR
TBL2
MLXIPL
BAZ1B
BCL7B
LPL
TRIB1
CILP2
ZNF101
PPP1R3B
AFF1
SELP
F5
LOC653163
SURF2
SURF4
ADAMTS13
C9orf7
ABO
RGS14
PRR7
DBN1
GRK6
UIMC1
SLC34A1
F12
FGFR4
NSD1
PRELID1
MXD3
LMAN2
F11
RFC4
ADIPOQ
KNG1

How to compare two files with grep


Lets say that you have two lists of entries e.g. two lists of genes that you want to compare.
To compare the two lists you can use simple grep command in Unix with flags -f obtain pattern from file, -x select only those matches that match whole line -F pattern is a set of new-line separated fixed strings. Thus by using the -x flag the match has to be complete in the second file and not impartial, which would happen if for example a pattern being searched is partially included in some longer strings in the second file, and we want to search for full matches.

An example below:

mpjanic@valkyr:~/REBUTTAL$ head extrinsic_cardiomiopathy_disease_ontology_cut
ABCA1
ADH1B
ADIPOQ
ADM
ALDH2
ALOX5
ALOX5AP
ANGPT1
APOA1
APOA4

mpjanic@valkyr:~/REBUTTAL$ head coronary_artery_disease_gwas_cut 
Reported Gene(s)
intergenic
PHACTR1
LIPA
PDGFD
intergenic
KIAA1462
PHACTR1
intergenic
intergenic

mpjanic@valkyr:~/REBUTTAL$ grep -F -x -f extrinsic_cardiomiopathy_disease_ontology_cut coronary_artery_disease_gwas_cut 
SH2B3
SORT1
SORT1
LPL
SH2B3
CXCL12
SH2B3
SORT1
TRIB1
LIPG
CETP
CETP
CETP
LIPC
LPL
ABCA1
LCAT
LIPC
LPL
ESR1

...