Monday, December 22, 2014

Find all files in a folder with a certain name

Find all files in a system with name 1x56_peak_gwas_ld_li_eur.txt

sudo find / -name 1x56_peak_gwas_ld_li_eur.txt

Find all files in a home directory with name 1x56_peak_gwas_ld_li_eur.txt

find ~ -name 1x56_peak_gwas_ld_li_eur.txt

Collapsing snp-pair association table in GWAS-ChIPSeq comparison

If you have a file with SNP pairs associations (e.g. SNPs from GWAS studies and SNPs found within chip-seq peaks) and you want to select those that meet the specified linkage disequilibrium threshold for example r2>0.7, you would need to filter the table which could be done with awk:

http://milospjanic.blogspot.com/2014/09/delete-all-rows-with-column-n-contains.html

However, even with this selection we are still overestimating the association between SNPs, since e.g. SNPs from GWAS may be in high LD with each other and each of these will associate with high LD to a single SNP from chip-seq peaks. Conversely, the SNPs from chip-seq peak/peaks may be in high LD with each other and correlate with high LD to a single GWAS SNP.

This may lead to overestimation of the association between GWAS phenotypes and transcription factor probed with chip-seq.

To collapse the table and keep only a single r2 value (e.g. maximum) for a single chip-seq SNP and a single GWAS SNP use the following code below.

In this example we will use the file r2_from_hapmap_1x56_peak_gwas_ld_li_eur that has SNP pairs from GWAS catalog and chip-seq experiment for a transcription factor.

Lets read the beginning of the file first.

head r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt
38356410        38461119        CEU     rs13132853      rs10004195      0.751   0.126   2.26    383     Helicobacter pylori serologic status
38461119        38621564        CEU     rs10004195      rs1060582       0.66    0.033   0.58    384     Helicobacter pylori serologic status
38461119        38621612        CEU     rs10004195      rs10011235      1.0     0.023   0.21    384     Helicobacter pylori serologic status
142225023       142416549       CEU     rs10007052      rs354834        0.082   0.0020  0.04    1422    Chronic obstructive pulmonary disease-related biomarkers
61257852        61411881        CEU     rs2013326       rs1000778       0.121   0.015   0.27    612     Sphingolipid levels
61411881        61446981        CEU     rs1000778       rs12420625      1.0     0.0040  0.14    614     Sphingolipid levels
61411881        61447283        CEU     rs1000778       rs741887        0.34    0.049   0.78    614     Sphingolipid levels
61411881        61447477        CEU     rs1000778       rs7927548       0.356   0.0070  0.15    614     Sphingolipid levels
61411881        61479221        CEU     rs1000778       rs1109748       0.162   0.0010  0.01    614     Sphingolipid levels
61411881        61479414        CEU     rs1000778       rs195165        0.139   0.0     0.01    614     Sphingolipid levels

The 4th column has the SNP id for GWAS SNPs and 5th column has SNP id for chip-seq SNPs. 6th column contains r2 values for each column i.e. each SNP pair.

The following R code will collapse and save the file with either collapsed one or the other snp column (test2, test3) or both (test4). It is necessary to transform 6th column to numeric with transform(test, V6 = as.numeric(as.character(V6))), then aggregate will do the collapsing, but it will keep only columns 4 and 6, merge will merge the aggregated data with the original data to obtain all the original columns. 

test <- read.delim("r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt", header=F)
test <- test[,1:10]
test <- transform(test, V6 = as.numeric(as.character(V6)))
test2<-merge(aggregate(V6~V4, data=test, max), test, all.x=T)
write.table(file="GWAS_collapse_2", test2, sep="\t", quote=F)

test <- read.delim("r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt", header=F)
test <- test[,1:10]
test <- transform(test, V6 = as.numeric(as.character(V6)))
test3<-merge(aggregate(V6~V5, data=test, max), test, all.x=T)
write.table(file="chip-seq_snps_collapse_2", test3, sep="\t", quote=F)

test4<-merge(aggregate(V6~V5, data=test2, max), test2, all.x=T)
write.table(file="ALL_collapse_2", test4, sep="\t", quote=F)

Lets save the beginning of the file into the test file as an example of culling the data.

head r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt > r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt_test

test <- read.delim("r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt_test", header=F)
test <- test[,1:10]
test
          V1        V2  V3         V4         V5    V6    V7   V8   V9
1   38356410  38461119 CEU rs13132853 rs10004195 0.751 0.126 2.26  383
2   38461119  38621564 CEU rs10004195  rs1060582 0.660 0.033 0.58  384
3   38461119  38621612 CEU rs10004195 rs10011235 1.000 0.023 0.21  384
4  142225023 142416549 CEU rs10007052   rs354834 0.082 0.002 0.04 1422
5   61257852  61411881 CEU  rs2013326  rs1000778 0.121 0.015 0.27  612
6   61411881  61446981 CEU  rs1000778 rs12420625 1.000 0.004 0.14  614
7   61411881  61447283 CEU  rs1000778   rs741887 0.340 0.049 0.78  614
8   61411881  61447477 CEU  rs1000778  rs7927548 0.356 0.007 0.15  614
9   61411881  61479221 CEU  rs1000778  rs1109748 0.162 0.001 0.01  614
10  61411881  61479414 CEU  rs1000778   rs195165 0.139 0.000 0.01  614
                                                        V10
1                      Helicobacter pylori serologic status
2                      Helicobacter pylori serologic status
3                      Helicobacter pylori serologic status
4  Chronic obstructive pulmonary disease-related biomarkers
5                                       Sphingolipid levels
6                                       Sphingolipid levels
7                                       Sphingolipid levels
8                                       Sphingolipid levels
9                                       Sphingolipid levels
10                                      Sphingolipid levels

test <- transform(test, V6 = as.numeric(as.character(V6)))
test2<-merge(aggregate(V6~V4, data=test, max), test, all.x=T)
test2
          V4    V6        V1        V2  V3         V5    V7   V8   V9
1 rs10004195 1.000  38461119  38621612 CEU rs10011235 0.023 0.21  384
2 rs10007052 0.082 142225023 142416549 CEU   rs354834 0.002 0.04 1422
3  rs1000778 1.000  61411881  61446981 CEU rs12420625 0.004 0.14  614
4 rs13132853 0.751  38356410  38461119 CEU rs10004195 0.126 2.26  383
5  rs2013326 0.121  61257852  61411881 CEU  rs1000778 0.015 0.27  612
                                                       V10
1                     Helicobacter pylori serologic status
2 Chronic obstructive pulmonary disease-related biomarkers
3                                      Sphingolipid levels
4                     Helicobacter pylori serologic status
5                                      Sphingolipid levels

What we see here is that we collapsed the data and kept only unique SNP identifiers in 4th column. SNP rs1000778 was culled from 5 to one entry and snp rs1000778 was culled from 2 to one entry and only pairs with maximum r2 values were kept.

Now in this code we have overlooked one important thing that will give rise to multiple pairs with same SNP id in the output. What is happening if there are multiple pairs from one GWAS SNP (or one chip-seq SNP) that have the same r2 value? They will all be kept even though aggregate function will keep only single V4-V6 pair, but after performing merge with the original data, each V4-V6 pair will find multiple corresponding rows if r2 value is the same.

For example, lets grep one snp and save output to a new file and repeat the steps from above:

grep "rs12418204" r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt
71927771        72074657        CEU     rs12418204      rs11824205      1.0     0.0060  0.23    719     Optic disc size (cup)
71927771        72074848        CEU     rs12418204      rs2291288       1.0     0.0     0.02    719     Optic disc size (cup)
71927771        72091837        CEU     rs12418204      rs3765105       0.461   0.01    0.18    719     Optic disc size (cup)
71927771        72092977        CEU     rs12418204      rs2306615       1.0     0.0080  0.31    719     Optic disc size (cup)
71927771        72124606        CEU     rs12418204      rs12808507      0.171   0.0020  0.04    719     Optic disc size (cup)

grep "rs12418204" r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt > r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt_test2

test <- read.delim("r2_from_hapmap_1x56_peak_gwas_ld_li_eur.txt_test2", header=F)
test <- test[,1:10]
test
        V1       V2  V3         V4         V5    V6    V7   V8  V9
1 71927771 72074657 CEU rs12418204 rs11824205 1.000 0.006 0.23 719
2 71927771 72074848 CEU rs12418204  rs2291288 1.000 0.000 0.02 719
3 71927771 72091837 CEU rs12418204  rs3765105 0.461 0.010 0.18 719
4 71927771 72092977 CEU rs12418204  rs2306615 1.000 0.008 0.31 719
5 71927771 72124606 CEU rs12418204 rs12808507 0.171 0.002 0.04 719
                    V10
1 Optic disc size (cup)
2 Optic disc size (cup)
3 Optic disc size (cup)
4 Optic disc size (cup)
5 Optic disc size (cup)

test <- transform(test, V6 = as.numeric(as.character(V6)))
test2<-merge(aggregate(V6~V4, data=test, max), test, all.x=T)
test2
          V4 V6       V1       V2  V3         V5    V7   V8  V9
1 rs12418204  1 71927771 72074657 CEU rs11824205 0.006 0.23 719
2 rs12418204  1 71927771 72074848 CEU  rs2291288 0.000 0.02 719
3 rs12418204  1 71927771 72092977 CEU  rs2306615 0.008 0.31 719
                    V10
1 Optic disc size (cup)
2 Optic disc size (cup)

3 Optic disc size (cup)

What we can see is that all of pairs with maximum r2 value of 1 were kept (3 rows).

To keep only one of them (i.e. the first entry) we could do quick collapse of the output files in unix with sort -u 
Write output as a file in R:

write.table(file="test_file", test2, sep="\t", quote=F)

In bash write:

grep -v "V1" test_file > test_file_removeheader
sort -u -k2,2 test_file_removeheader > test_file_sortuniq
cat test_file_sortuniq

1       rs12418204      1       71927771        72074657        CEU     rs11824205      0.006   0.23    719     Optic disc size (cup)

So to finish the original data set do the same thing with ALL_collapse_2 file:

grep -v "V1" ALL_collapse_2 > ALL_collapse_2_removeheader
sort -u -k2,2 ALL_collapse_2_removeheader > ALL_collapse_2_sortuniq

Wednesday, December 17, 2014

Transferring folder with rsync - example of a content sync vs complete folder transer

Make test folder with 10 empty files:

mkdir test
touch test/test{1..10}

Use rsync to transfer test folder to a server (options: -a archive, similar to -r and means sync recursively and preserve symbolic links, modification times, permissions, -v verbose, -z reduce network transfer by adding compression, -P combines flags --progress and --partial, first gives you progress bar for the transfers and second will allow resuming of the interrupted transfers :

rsync -chavzP --stats test/ mpjanic@zoran.stanford.edu:~/test/
mpjanic@zoran.stanford.edu's password: 
building file list ... 
11 files to consider
created directory /home/mpjanic/test
./
test1
           0 100%    0.00kB/s    0:00:00 (xfer#1, to-check=9/11)
test10
           0 100%    0.00kB/s    0:00:00 (xfer#2, to-check=8/11)
test2
           0 100%    0.00kB/s    0:00:00 (xfer#3, to-check=7/11)
test3
           0 100%    0.00kB/s    0:00:00 (xfer#4, to-check=6/11)
test4
           0 100%    0.00kB/s    0:00:00 (xfer#5, to-check=5/11)
test5
           0 100%    0.00kB/s    0:00:00 (xfer#6, to-check=4/11)
test6
           0 100%    0.00kB/s    0:00:00 (xfer#7, to-check=3/11)
test7
           0 100%    0.00kB/s    0:00:00 (xfer#8, to-check=2/11)
test8
           0 100%    0.00kB/s    0:00:00 (xfer#9, to-check=1/11)
test9
           0 100%    0.00kB/s    0:00:00 (xfer#10, to-check=0/11)

Number of files: 11
Number of files transferred: 10
Total file size: 0 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 305
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 717
Total bytes received: 246

sent 717 bytes  received 246 bytes  214.00 bytes/sec
total size is 0  speedup is 0.00

Previous command created test folder on the server and sync the contents with the test folder on our computer. Note that specifying the contents of the folder was done with a slash sign by specifying test/ and ~/test/ 
The same could be done with the command without the slash sign in which case the whole folder will be transferred, and not the contents of the folder, so in the second part of the rsync command we have to specify only the home folder of the server where the test folder will be transferred to, i.e. ~

rsync -chavzP --stats test mpjanic@zoran.stanford.edu:~
mpjanic@zoran.stanford.edu's password: 
building file list ... 
11 files to consider
test/
test/test1
           0 100%    0.00kB/s    0:00:00 (xfer#1, to-check=9/11)
test/test10
           0 100%    0.00kB/s    0:00:00 (xfer#2, to-check=8/11)
test/test2
           0 100%    0.00kB/s    0:00:00 (xfer#3, to-check=7/11)
test/test3
           0 100%    0.00kB/s    0:00:00 (xfer#4, to-check=6/11)
test/test4
           0 100%    0.00kB/s    0:00:00 (xfer#5, to-check=5/11)
test/test5
           0 100%    0.00kB/s    0:00:00 (xfer#6, to-check=4/11)
test/test6
           0 100%    0.00kB/s    0:00:00 (xfer#7, to-check=3/11)
test/test7
           0 100%    0.00kB/s    0:00:00 (xfer#8, to-check=2/11)
test/test8
           0 100%    0.00kB/s    0:00:00 (xfer#9, to-check=1/11)
test/test9
           0 100%    0.00kB/s    0:00:00 (xfer#10, to-check=0/11)

Number of files: 11
Number of files transferred: 10
Total file size: 0 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 310
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 722
Total bytes received: 246

sent 722 bytes  received 246 bytes  176.00 bytes/sec
total size is 0  speedup is 0.00

In case you delete first 5 files on the server 

rm test{1..5}

With the first command you will transfer only the files missing:

rsync -chavzP --stats test/ mpjanic@zoran.stanford.edu:~/test/
mpjanic@zoran.stanford.edu's password: 
building file list ... 
11 files to consider
./
test1
           0 100%    0.00kB/s    0:00:00 (xfer#1, to-check=9/11)
test2
           0 100%    0.00kB/s    0:00:00 (xfer#2, to-check=7/11)
test3
           0 100%    0.00kB/s    0:00:00 (xfer#3, to-check=6/11)
test4
           0 100%    0.00kB/s    0:00:00 (xfer#4, to-check=5/11)
test5
           0 100%    0.00kB/s    0:00:00 (xfer#5, to-check=4/11)

Number of files: 11
Number of files transferred: 5
Total file size: 0 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 305
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 522
Total bytes received: 136

sent 522 bytes  received 136 bytes  188.00 bytes/sec
total size is 0  speedup is 0.00

Also the second command will recognize that the folder it is transferring exist and will transfer only 5 files missing:

rsync -chavzP --stats test mpjanic@zoran.stanford.edu:~
mpjanic@zoran.stanford.edu's password: 
building file list ... 
11 files to consider
test/
test/test1
           0 100%    0.00kB/s    0:00:00 (xfer#1, to-check=9/11)
test/test2
           0 100%    0.00kB/s    0:00:00 (xfer#2, to-check=7/11)
test/test3
           0 100%    0.00kB/s    0:00:00 (xfer#3, to-check=6/11)
test/test4
           0 100%    0.00kB/s    0:00:00 (xfer#4, to-check=5/11)
test/test5
           0 100%    0.00kB/s    0:00:00 (xfer#5, to-check=4/11)

Number of files: 11
Number of files transferred: 5
Total file size: 0 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 310
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 527
Total bytes received: 136

sent 527 bytes  received 136 bytes  147.33 bytes/sec
total size is 0  speedup is 0.00

However, if you want to use first example i.e. syncing of the folder contents and do not type slash after folder name in the command line, rsync will consider complete folder to be transferred and it will transfer complete test folder to the test folder specified for the server. Try to avoid this error.

rsync -chavzP --stats test mpjanic@zoran.stanford.edu:~/test/
mpjanic@zoran.stanford.edu's password: 
building file list ... 
11 files to consider
test/
test/test1
           0 100%    0.00kB/s    0:00:00 (xfer#1, to-check=9/11)
test/test10
           0 100%    0.00kB/s    0:00:00 (xfer#2, to-check=8/11)
test/test2
           0 100%    0.00kB/s    0:00:00 (xfer#3, to-check=7/11)
test/test3
           0 100%    0.00kB/s    0:00:00 (xfer#4, to-check=6/11)
test/test4
           0 100%    0.00kB/s    0:00:00 (xfer#5, to-check=5/11)
test/test5
           0 100%    0.00kB/s    0:00:00 (xfer#6, to-check=4/11)
test/test6
           0 100%    0.00kB/s    0:00:00 (xfer#7, to-check=3/11)
test/test7
           0 100%    0.00kB/s    0:00:00 (xfer#8, to-check=2/11)
test/test8
           0 100%    0.00kB/s    0:00:00 (xfer#9, to-check=1/11)
test/test9
           0 100%    0.00kB/s    0:00:00 (xfer#10, to-check=0/11)

Number of files: 11
Number of files transferred: 10
Total file size: 0 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 310
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 722
Total bytes received: 246

sent 722 bytes  received 246 bytes  276.57 bytes/sec
total size is 0  speedup is 0.00

Now the contents of the test folder on the server are test6-10 files plus the complete test folder transferred:

mpjanic@zoran:~/test$ ls -l
total 4
drwxr-xr-x 2 mpjanic mpjanic 4096 Dec 17 14:51 test
-rw-r--r-- 1 mpjanic mpjanic    0 Dec 17 14:51 test10
-rw-r--r-- 1 mpjanic mpjanic    0 Dec 17 14:51 test6
-rw-r--r-- 1 mpjanic mpjanic    0 Dec 17 14:51 test7
-rw-r--r-- 1 mpjanic mpjanic    0 Dec 17 14:51 test8
-rw-r--r-- 1 mpjanic mpjanic    0 Dec 17 14:51 test9

Tuesday, December 16, 2014

How to download specific folder using wget

If you need to download the complete folder from a http server using wget, you may try wget -r option  to recursively copy the contents of the folder you specified. For a secure https protocol use flags --http-user and --http-password to specify username and password.

wget -r --http-user=username --http-password=password https://server.address/folder1/folder2

In this example, folder2 within folder1 on the https server is being transferred.

However, this may result in data from the parent folder/folders being copied also. To prevent this behavior use flag --no-parent to stop parent folders from being accessed.

wget -r --http-user=username --http-password=password --no-parent https://server.address/folder1/folder2/

Tuesday, December 9, 2014

How to paste and cut in one command

If you have two files that you want to paste side by side, use paste command.

file1:
chr1    11128   11519   MACS_peak_1     530.61
chr1    89268   89360   MACS_peak_2     58.63
chr1    153496  153625  MACS_peak_3     52.19
chr1    545624  545758  MACS_peak_4     63.70
chr1    564601  565447  MACS_peak_5     80.25

file2:
chr1    566050  566363  MACS_peak_6     120.67
chr1    567242  568254  MACS_peak_7     212.45
chr1    569057  570300  MACS_peak_8     169.08
chr1    704763  704818  MACS_peak_9     93.35
chr1    724126  724259  MACS_peak_10    58.44

paste file1 file2
chr1    11128   11519   MACS_peak_1     530.61  chr1    566050  566363  MACS_peak_6     120.67
chr1    89268   89360   MACS_peak_2     58.63   chr1    567242  568254  MACS_peak_7     212.45
chr1    153496  153625  MACS_peak_3     52.19   chr1    569057  570300  MACS_peak_8     169.08
chr1    545624  545758  MACS_peak_4     63.70   chr1    704763  704818  MACS_peak_9     93.35
chr1    564601  565447  MACS_peak_5     80.25   chr1    724126  724259  MACS_peak_10    58.44

If you want, in the same command, to cut the columns 1,2,3,5 from file2 and then do paste, making a pipe paste|cut doesn't work here:

paste file1 | cut -f1,2,3,5 file2
chr1    566050  566363  120.67
chr1    567242  568254  212.45
chr1    569057  570300  169.08
chr1    704763  704818  93.35
chr1    724126  724259  58.44

However, the command below works just fine:

paste file1 <(cut -f1,2,3,5 file2)
chr1    11128   11519   MACS_peak_1     530.61  chr1    566050  566363  120.67
chr1    89268   89360   MACS_peak_2     58.63   chr1    567242  568254  212.45
chr1    153496  153625  MACS_peak_3     52.19   chr1    569057  570300  169.08
chr1    545624  545758  MACS_peak_4     63.70   chr1    704763  704818  93.35
chr1    564601  565447  MACS_peak_5     80.25   chr1    724126  724259  58.44

So if you want to cut from both files and paste type:

paste <(cut -f1,2,3,5 file1) <(cut -f1,2,3,5 file2)
chr1    11128   11519   530.61  chr1    566050  566363  120.67
chr1    89268   89360   58.63   chr1    567242  568254  212.45
chr1    153496  153625  52.19   chr1    569057  570300  169.08
chr1    545624  545758  63.70   chr1    704763  704818  93.35
chr1    564601  565447  80.25   chr1    724126  724259  58.44

Sourcing R script does not print output to the console

If you want to source and  print output of an R script use echo=TRUE:

source("script.R", echo=TRUE)

Wednesday, December 3, 2014

How to sort bed files for bedtools input

In some cases you will need to sort you're bed file to be able to use it with bedtools

If you use bedtools sort you will not get a karyotype order (chr1, chr2, chr3) but instead you will get chr1, chr10, chr11 etc.

The same output you will get with Unix sort command:

sort -k1,1 -k2,2n file > file_sorted

You would still get chr1, chr10, chr11 etc. as output:

...
chr1    246168412       246168944
chr1    247070790       247071135
chr10   363179  363606
chr10   2970376 2970831
chr10   3087334 3087998
chr10   3511405 3511734
...

The trick is to use -V (--version-sort) parameter with the sort command that will enable natural sort of numbers within text

sort -k1,1V -k2,2n file > file_sorted

...
chr1    246168412       246168944
chr1    247070790       247071135
chr2    1595719 1596411
chr2    1629102 1629748
chr2    1635289 1635633
chr2    1735080 1736335
...

It is also necessary to sort the genome file with chromosomal sizes, using the same command. If you leave this file unsorted bedtools may give you an error. Bedtools has several genome size files in its genomes folder and these are unsorted so you should run:

sort -k1,1V -k2,2n human.hg19.genome > human.hg19.genome_sorted
sort -k1,1V -k2,2n human.hg18.genome > human.hg18.genome_sorted
sort -k1,1V -k2,2n human.hg38.genome > human.hg38.genome_sorted

etc.

Tuesday, December 2, 2014

Monday, December 1, 2014

How to concatenate side by side two files

If you need to concatenate two files side by side use paste command
-d '\t' will make tab as a delimiter.

Following command will paste two files side by side with tab as delimiter and cut columns 1,2,4 into a file conc.txt

paste -d'\t' file1 file2 | cut -f1,2,4 > conc.txt

E.g.

file1

atp-binding     2.3281287823771484
blocked amino end       7.743558776167471
compositionally biased region:Poly-Lys  4.604155374887082
compositionally biased region:Ser-rich  3.331241830065359
disease mutation        1.9403512039170954
disulfide bond  1.674089840106158
disulfide bond  1.6242758946817313
domain:PH       4.978121581497109
endoplasmic reticulum   2.331394040136443
extracellular matrix    3.941396444854259
glycoprotein    1.5948598745418259
glycosylation site:N-linked (GlcNAc...) 1.5429886170985712
GO:0000166~nucleotide binding   1.6648241884322064
GO:0001882~nucleoside binding   1.8304477780284232

file2 

atp-binding     1.7131363573311138
blocked amino end       11.728395061728394
compositionally biased region:Poly-Lys  3.4320987654320985
compositionally biased region:Ser-rich  4.290123456790123
disease mutation        5.4453262786596115
disulfide bond  1.9547325102880657
disulfide bond  1.4300411522633742
domain:PH       3.3000949667616335
endoplasmic reticulum   2.2805212620027433
extracellular matrix    3.9094650205761314
glycoprotein    1.9108059370231654
glycosylation site:N-linked (GlcNAc...) 1.4389233954451346
GO:0000166~nucleotide binding   1.465270684371808
GO:0001882~nucleoside binding   1.6341991341991342

conc.txt

atp-binding     2.3281287823771484      1.7131363573311138
blocked amino end       7.743558776167471       11.728395061728394
compositionally biased region:Poly-Lys  4.604155374887082       3.4320987654320985
compositionally biased region:Ser-rich  3.331241830065359       4.290123456790123
disease mutation        1.9403512039170954      5.4453262786596115
disulfide bond  1.674089840106158       1.9547325102880657
disulfide bond  1.6242758946817313      1.4300411522633742
domain:PH       4.978121581497109       3.3000949667616335
endoplasmic reticulum   2.331394040136443       2.2805212620027433
extracellular matrix    3.941396444854259       3.9094650205761314
glycoprotein    1.5948598745418259      1.9108059370231654
glycosylation site:N-linked (GlcNAc...) 1.5429886170985712      1.4389233954451346
GO:0000166~nucleotide binding   1.6648241884322064      1.465270684371808
GO:0001882~nucleoside binding   1.8304477780284232      1.6341991341991342

Intersect two files using sort and uniq commands

If you have two lists that you want to intersect, use sort and uniq commands to easily perform this operation.

For examle, file1 contains:

ZXDB
ZYG11A
ZYG11B
ZYX
ZZZ3

file2 contains:

ZSWIM7
ZSWIM8
ZWILCH
ZXDB
ZYG11A
ZYG11B

Command:

sort file1 file2 | uniq -d

will print strings that are same for the two files. (sort will use both files as input and uniq -d option will print only duplicates)

ZXDB
ZYG11A
ZYG11B

On the other side, command:

sort file1 file2 | uniq -u

will print only uniq lines after sorting two files together (i.e. these are unique lines from both files):

ZSWIM7
ZSWIM8
ZWILCH
ZYX
ZZZ3