Monday, December 10, 2012

Comparing two files using grep/awk in Unix

If you have file1:

1
3
5
6
8

and file2:
1       a
2       b
3       c
4       d
5       e
6       f
7       g
8       h

And you want to get all the rows from file 2 containing the same field in column 1 as file1:
grep "$(cat file1)" file2 | awk '{print $0}' > file3
nano file3
1       a
3       c
5       e
6       f
8       h


In case you want the opposite all the rows from file2 that do not contain the same field in column1 as file1:

grep -v "$(cat file1)" file2 | awk '{print $0}' > file3
nano file3
2       b
4       d
7       g

In case you want to output only column 1 from file 2:
grep -v "$(cat file1)" file2 | awk '{print $1}' > file3
nano file3

2
4
7

or to plot just column 2:
grep -v "$(cat file1)" file2 | awk '{print $2}' > file3
nano file3

b
d
g

____________


Now in case of file1 with more than one column:
1       y
3       y
5       y
6       y
8       y

and file2:
1       a
2       b
3       c
4       d
5       e
6       f
7       g
8       h

You have to compare only the column1 of file1 with the file2, so type:
grep "$(cut -f1 file1)" file2 | awk '{print $0}' > file3
nano file3
1       a
3       c
5       e
6       f
8       h

___________

The problem with this code is that in case you have a line in file2 containing e.g. the number 88 it will be picked up because it contains a character 8 present in file1:
1       a
2       b
3       c
4       d
5       e
6       f
7       g
8       h
88       h

grep "$(cut -f1 file1)" file2 | awk '{print $0}' > file3
nano file3
1       a
3       c
5       e
6       f
8       h 
88       h

So this code is useless unless you're dealing with simple files.
___________

SOLUTION

One of the solutions is to use the awk in this way:

awk -F " " 'BEGIN{while(getline<"file1") a[$1]=1 } ; a[$1] ==1 {print $0 } ' file2 > file3
nano file3
1    a
3    c
5    e
6    f
8    h


To output lines of file2 that do not contain the same field in column1 as file1:

awk -F " " 'BEGIN{while(getline<"file1") a[$1]=1 } ; a[$1] !=1 {print $0 } ' file2 > file3
nano file3
2       b
4       d
7       g
88      h

2 comments:

  1. Hi, I just came across this blog, and I found it is nice to know the grep ($ cat file1) trick, but the join command in linux can solve the same question:)

    see here:
    http://crazyhottommy.blogspot.com/2013/05/linux-command-join.html?view=sidebar

    ReplyDelete
  2. Here's is another way:
    http://theunixshell.blogspot.in/2012/12/i-have-two-files-file-1-contains-3.html

    ReplyDelete