Friday, July 31, 2015

Finding differences in two files on specific columns using awk - example of comparing two SNP data files

Lets say that you have two files and you want to see if entries from one column are present in another column in another file and vise versa.

head IBD-SNAPResults.txt.snps
rs12103
rs12142199
rs11590283
rs35675666
rs3766606
rs17523802
rs34157438
rs7517357
rs12740409
rs34124834


head IBD-SNAPResults.txt.snps.pos
chr1 1245367 1245368 rs11590283
chr1 1247493 1247494 rs12103
chr1 1249186 1249187 rs12142199
chr1 7989021 7989022 rs9658012
chr1 7997182 7997183 rs7545687
chr1 8014567 8014568 rs35731977
chr1 8021739 8021740 rs17523802
chr1 8021972 8021973 rs35675666
chr1 8022196 8022197 rs3766606
chr1 8023585 8023586 rs34157438


Columns that you want to compare are $1 in file 1 and $4 in file2. Using awk we can load entire column 1 into hash h and check the presence of hash values in file2 using column 4 entries as keys for the hash.


Check if there are entries from column 4 in file 2 that are not present in column 1 in file1.

awk 'NR==FNR {h[$1] = $1; next} {if(!h[$4]) print$0}' IBD-SNAPResults.txt.snps IBD-SNAPResults.txt.snps.pos


No output means that there are no entries present in file 1 that are not present in file 2.


Check if there are entries from column 1 in file 1 that are not present in column 4 in file2.


awk 'NR==FNR {h[$4] = $4; next} {if(!h[$1]) print$0}' IBD-SNAPResults.txt.snps.pos IBD-SNAPResults.txt.snps

rs115258758
WARNING
WARNING
rs114188880
rs115853148
rs67626681
rs63368160
rs74848575
rs79569409
rs41313148

...

In conclusion, there are entries present in file 1 col 1 not present in file 2 col 4.

Thursday, July 23, 2015

Reading Unix variables within awk

If you want to use Unix variables within the awk code use "'" (double quote - single quote - double quote) to enclose the variable.

In the following parsing code, Unix variable $chr is called within awk using double-single-double quotes:

for chr in $(seq 1 22) ; do
echo "parsing $chr"
awk -F"\t" '$1 == "'"$chr"'" { print $2"\t"$3 }' hg18_avg_occ.standard > chr"$chr"
done


The code will loop $chr variable from 1 to 22, test if column one equals $chr within awk, and print columns 2 and 3 to a file containing $chr in its name. Note that $chr contains double-single-double quotes within awk but not when output is sent to a file chr"$chr" (where single quotes are sufficient).
 
head  hg18_avg_occ.standard
1       1       0.011
1       2       0.0197
1       3       0.0281
1       4       0.041
1       5       0.0573
1       6       0.0773
1       7       0.0917
1       8       0.103
1       9       0.112
1       10      0.124


head chr1
1       0.011
2       0.0197
3       0.0281
4       0.041
5       0.0573
6       0.0773
7       0.0917
8       0.103
9       0.112
10      0.124


Alexa rankings of genomics blogs

Alexa rankings of this blog and some top genomics blogs (that appear in Google search when you search for 'genomics blog').





Filter lines that have sum above or below

Similarly to the previous post you can use awk code to print lines with the sum above or below a certain value.

Above 1000

awk '{for(i=1;i<=NF;i++) t+=$i; if(t>=1000) print$0; t=0}' file.txt> file_filtered.txt

Below 1000

awk '{for(i=1;i<=NF;i++) t+=$i; if(t<=1000) print$0; t=0}' file.txt> file_filtered.txt

Wednesday, July 22, 2015

Plot all lines that have certain sum using awk code

To calculate the sum of all fields for each line use NF variable to define a for loop, than sum all fields and keep it in a t variable. If t is not equal to 0, print the whole line with print$0.  Then set back t to 0 for the next line.

awk '{for(i=1;i<=NF;i++) t+=$i; if(t!=0) print$0; t=0}' file.txt > file.txt.filtered


To print all lines that have sum of 100, similarly use code:

awk '{for(i=1;i<=NF;i++) t+=$i; if(t==100) print$0; t=0}' file.txt > file.txt.filtered

Tuesday, July 21, 2015

Get command from history in bash

To get a command you typed in the past use history:

history | grep 'ln -s'
 2679  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C1_S1_L001_L004_R1_R2_001/Pass2_1/Aligned.out.sam 90715018.Aligned.out.sam
 2681  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C2_S1_L001_L004_R1_R2_001/Pass2_2/Aligned.out.sam 317155.2.Aligned.out.sam
 2683  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C3_S1_L001_L004_R1_R2_001/Pass2_2/Aligned.out.sam 7103002.3.Aligned.out.sam
 2685  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C3_S1_L001_L004_R1_R2_001/Pass2_3/Aligned.out.sam 7103002.3.Aligned.out.sam
 2687  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C3_S1_L001_L004_R1_R2_001/Pass2_3/Aligned.out.sam 7103002.3.Aligned.out.sam
 2689  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C4_S1_L001_L004_R1_R2_001/Pass2_4/Aligned.out.sam 1042702.4.Aligned.out.sam
 2691  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C5_S1_L001_L004_R1_R2_001/Pass2_5/Aligned.out.sam 313605.4.Aligned.out.sam
 2693  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C6_S1_L001_L004_R1_R2_001/Pass2_6/Aligned.out.sam 24635.2.Aligned.out.sam
 2694  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C7_S1_L001_L004_R1_R2_001/Pass2_7/Aligned.out.sam 2108.4.Aligned.out.sam
 2696  ln -s /home/towerraid/tina/150313_NS500418_0108_AH5Y5YBGXX/Data/Intensities/BaseCalls/C8_S1_L001_L004_R1_R2_001/Pass2_8/Aligned.out.sam 1448.1.Aligned.out.sam

... 

Removing files 0-N days old in Unix

If you want to remove files that were created on that day from a folder use find in combination with xargs:

To list all files that are 1 day old:

find . -ctime -1 -type f | xargs ls -l

To remove them verbosely:

find . -ctime -1 -type f | xargs rm -v

removed ‘./30_S14_L001_R2_001.fastq.gz’
removed ‘./15_S7_L001_R1_001.fastq.gz’
removed ‘./32_S16_L001_R1_001.fastq.gz’
removed ‘./14_S6_L001_R2_001.fastq.gz’
removed ‘./29_S13_L001_R2_001.fastq.gz’
removed ‘./23_S11_L001_R2_001.fastq.gz’
removed ‘./32_S16_L001_R2_001.fastq.gz’
removed ‘./21_S9_L001_R1_001.fastq.gz’
removed ‘./22_S10_L001_R1_001.fastq.gz’
...



To remove files 0-4 days old 


find . -ctime -4 -type f | xargs rm -v

Tuesday, July 14, 2015

How to rotate labels from horizontal to vertical position in R

In R sometimes your labels may not be visible, and you would need to make them vertical to fit in the graph.

E.g. here is one plot that does not fit all the labels.

par(mfrow=c(1,2))
boxplot(GWAS,ylim=c(0,3.5))
boxplot(CAD,ylim=c(0,3.5))


To make the labels visible use las=2 to plot them vertically.
par(mfrow=c(1,2))
boxplot(GWAS,ylim=c(0,3.5),las=2)
boxplot(CAD,ylim=c(0,3.5),las=2)



However, not all the labels are visible here as well.
To make the labels more visible set the margins using mar:

par(mfrow=c(1,2))
par(mar=c(7,5,1,1))
boxplot(GWAS,ylim=c(0,3.5),las=2)

boxplot(CAD,ylim=c(0,3.5),las=2)