Wednesday, November 14, 2012

How to use grep to select specific lines from a file in Unix

Lets say you have a file with over 1,000,000 lines that you can not load to and manipulate with in Excel ( as this is the limit in Excel for the number of rows).
Use Terminal in Unix and powerful grep command to select lines with specific characters or strings of characters or even specific combination of strings.

E.g. the file gene_exp.diff contains 2289014 lines. Each line contains multiple strings:
XLOC_000001    XLOC_000001    Lypla1    chr1:4797973-4836816    Random_without_RA    Knockdown_without_RA    OK    125.173    61.8913    -1.01611    6.08719    1.1491e-09    1.67416e-08    yes

To select lines containing only specific string use grep:
grep "Knockdown_without_RA" gene_exp.diff
This will select lines that contain "Knockdown_without_RA" from the file gene_exp.diff

To select lines containing specific combination of strings:
grep "Knockdown_without_RA.*Knockdown_with_RA.*OK.*yes" gene_exp.diff
This will select lines that contain strings "Knockdown_without_RA", "Knockdown_with_RA", "OK" and "yes" no matter what characters/strings are in between them (the sign .* corresponds to this).

To write the output to the less (rather then to the screen that may list the endless number of lines).
grep "Knockdown_without_RA.*Knockdown_with_RA.*OK.*yes" gene_exp.diff | less

To write the output to a file use > filename
grep "Knockdown_without_RA.*Knockdown_with_RA.*OK" gene_exp.diff > knockdown_witout_ra_vs_knockdown_with_ra.diff

No comments:

Post a Comment