Wednesday, June 17, 2015

How to compare two files using awk

If you have tow files that need to be compared you can do it using awk.

For example, file1

0010705-3-1     3405456
1020301-7-2     2032766
102901-8-1      2093898
2030801-6-5     2595293
2040401-3-1     2319647
3100203-1-3     2181363
3101801-2-3     2987284
9070202-27-5    2539899
9071501-8-1     2364814
CA-1401-1       2500080

file2

0010705-3-1 1012215012
9070202-27-5 776575978
1020301-7-2 699304881
102901-8-1 622053170
2030801-6-5 785789447
1347-1 713688560
1483-5 743332308
1522-2 707074454
1559-1 713053387


You want to take rows that have common id in column 1 and plot columns 2 from both files:

awk 'NR==FNR {h[$1] = $1; h2[$1] = $2; next} {print $1,$2,h[$1],h2[$1]}' file1 file2

0010705-3-1 1012215012 0010705-3-1 3405456
9070202-27-5 776575978 9070202-27-5 2539899
1020301-7-2 699304881 1020301-7-2 2032766
102901-8-1 622053170 102901-8-1 2093898
2030801-6-5 785789447 2030801-6-5 2595293
1347-1 713688560
1483-5 743332308
1522-2 707074454
1559-1 713053387



Awk condition NR==FNR will be true only for the first file, since only when the first file is read the number of records NR (cumulative number irrespective of the number of files processed) will be equal FNR, number of records for the current file.
Make a hash h with keys read from the column 1 and values also column 1, and hash h2 with keys read from column1 and values read from column 2.
Then, print columns 1 and 2 from file2 and values from hash h and hash h2, for the key = $1 from file2.
If the key is missing in hash h and hash h2 nothing will be printed.

No comments:

Post a Comment