Friday, July 31, 2015

Finding differences in two files on specific columns using awk - example of comparing two SNP data files

Lets say that you have two files and you want to see if entries from one column are present in another column in another file and vise versa.

head IBD-SNAPResults.txt.snps
rs12103
rs12142199
rs11590283
rs35675666
rs3766606
rs17523802
rs34157438
rs7517357
rs12740409
rs34124834


head IBD-SNAPResults.txt.snps.pos
chr1 1245367 1245368 rs11590283
chr1 1247493 1247494 rs12103
chr1 1249186 1249187 rs12142199
chr1 7989021 7989022 rs9658012
chr1 7997182 7997183 rs7545687
chr1 8014567 8014568 rs35731977
chr1 8021739 8021740 rs17523802
chr1 8021972 8021973 rs35675666
chr1 8022196 8022197 rs3766606
chr1 8023585 8023586 rs34157438


Columns that you want to compare are $1 in file 1 and $4 in file2. Using awk we can load entire column 1 into hash h and check the presence of hash values in file2 using column 4 entries as keys for the hash.


Check if there are entries from column 4 in file 2 that are not present in column 1 in file1.

awk 'NR==FNR {h[$1] = $1; next} {if(!h[$4]) print$0}' IBD-SNAPResults.txt.snps IBD-SNAPResults.txt.snps.pos


No output means that there are no entries present in file 1 that are not present in file 2.


Check if there are entries from column 1 in file 1 that are not present in column 4 in file2.


awk 'NR==FNR {h[$4] = $4; next} {if(!h[$1]) print$0}' IBD-SNAPResults.txt.snps.pos IBD-SNAPResults.txt.snps

rs115258758
WARNING
WARNING
rs114188880
rs115853148
rs67626681
rs63368160
rs74848575
rs79569409
rs41313148

...

In conclusion, there are entries present in file 1 col 1 not present in file 2 col 4.

No comments:

Post a Comment