Sunday, May 14, 2017

Unique on one column and keep maximum value from another column in Unix

If you have a file, e.g.

head id_merge_ens_length.txt
ENSMUSG00000095309 924
ENSMUSG00000000126 3318
ENSMUSG00000000126 3320
ENSMUSG00000086196 1381
ENSMUSG00000086196 2127
ENSMUSG00000054418 1784
ENSMUSG00000095268 1960
ENSMUSG00000082399 480
ENSMUSG00000097090 1650
ENSMUSG00000020063 3353
And want to unique on first column while keeping the maximum on second column, first use sort to sort on column 1 followed by reverse numerical sort on column 2:

sort -k1,1 -k2,2nr file.txt | head
ENSMUSG00000000001 3262
ENSMUSG00000000003 902
ENSMUSG00000000003 697
ENSMUSG00000000028 2143
ENSMUSG00000000028 1747
ENSMUSG00000000028 832
ENSMUSG00000000031 2286
ENSMUSG00000000031 1853
ENSMUSG00000000031 935
ENSMUSG00000000031 817
Now, to select the column with maximum value for each unique entry from column one, use awk. Use awk '!a[$1]++' as a short version of awk '{if(! a[$1]){print; a[$1]++}}' meaning if the current first field ($1) is not present in the a array, which will happen if it is a unique field, print the line and add the 1st field to the a array. Next time we see the same field, it will be already in the array and it will not be printed due to the condition if(!a[$1]).

sort -k1,1 -k2,2nr file.txt | awk '!a[$1]++' | head
ENSMUSG00000000001 3262
ENSMUSG00000000003 902
ENSMUSG00000000028 2143
ENSMUSG00000000031 2286
ENSMUSG00000000037 4847
ENSMUSG00000000049 1190
ENSMUSG00000000056 4395
ENSMUSG00000000058 2733
ENSMUSG00000000078 4217
ENSMUSG00000000085 3544

1 comment:

  1. i had to use this procedure for one of my custom thesis writing and i was so confused as to what to do about it. and your post really guides me through it all. thank you so much for posting this