Friday, September 30, 2016

Solution to messy gene tables - compare two file contents using grep

If you have a gene list and want to compare it to another gene list you can use awk and create a hash table reading first file that you will use while reading second file for comparison with a certain column of the second file, as discussed previously:

http://www.genomicscode.org/2015/07/finding-differences-in-two-files-on.html

Code:

awk 'NR==FNR {h[$1] = $1; next} {if(!h[$4]) print$0}' file1 file2
However, if you have a second file with a messy structure you would want to scan a complete file without focussing on a individual column of the second file. For, example here, I have a gene list that I want to compare with the list I created from BioGRID, that contains gene names, alternative gene names and other info. Clearly the first awk code would not work in this case:

Miloss-MacBook-Air:test milospjanic$ cat test
SUMO
TANGO
AHR
ARNT
TEST
Miloss-MacBook-Air:test milospjanic$ head -n 20 ahr.biogrid
Displaying 41 total unique interactors
Sort By: [Evidence] [Alphabetical]
10
[details]
AIP
| ARA9, FKBP16, FKBP37, SMTPHN, XAP-2, XAP2
aryl hydrocarbon receptor interacting protein
UBI
9
[details]
ARNT
| HIF-1-beta, HIF-1beta, HIF1-beta, HIF1B, HIF1BETA, TANGO, bHLHe2
aryl hydrocarbon receptor nuclear translocator
UBI
8
[details]
RB1
| RP11-174I10.1, OSRC, PPP1R130, RB, p105-Rb, pRb, pp110
retinoblastoma 1
UBISUMO
Instead, easy solution is to use grep with -Fwf to comprehensively search the second file and find lines that contain gene name from the first file. In this case all lines from file 2 that contain gene name from file 1 will be grepped.

Miloss-MacBook-Air:test milospjanic$ grep -Fwf test ahr.biogrid 
ARNT
| HIF-1-beta, HIF-1beta, HIF1-beta, HIF1B, HIF1BETA, TANGO, bHLHe2
SUMO
AHR
SUMO

No comments:

Post a Comment