Sunday, May 14, 2017

Unique on one column and keep maximum value from another column in Unix

If you have a file, e.g.

head id_merge_ens_length.txt
ENSMUSG00000095309 924
ENSMUSG00000000126 3318
ENSMUSG00000000126 3320
ENSMUSG00000086196 1381
ENSMUSG00000086196 2127
ENSMUSG00000054418 1784
ENSMUSG00000095268 1960
ENSMUSG00000082399 480
ENSMUSG00000097090 1650
ENSMUSG00000020063 3353
And want to unique on first column while keeping the maximum on second column, first use sort to sort on column 1 followed by reverse numerical sort on column 2:

sort -k1,1 -k2,2nr file.txt | head
ENSMUSG00000000001 3262
ENSMUSG00000000003 902
ENSMUSG00000000003 697
ENSMUSG00000000028 2143
ENSMUSG00000000028 1747
ENSMUSG00000000028 832
ENSMUSG00000000031 2286
ENSMUSG00000000031 1853
ENSMUSG00000000031 935
ENSMUSG00000000031 817
Now, to select the column with maximum value for each unique entry from column one, use awk. Use awk '!a[$1]++' as a short version of awk '{if(! a[$1]){print; a[$1]++}}' meaning if the current first field ($1) is not present in the a array, which will happen if it is a unique field, print the line and add the 1st field to the a array. Next time we see the same field, it will be already in the array and it will not be printed due to the condition if(!a[$1]).

sort -k1,1 -k2,2nr file.txt | awk '!a[$1]++' | head
ENSMUSG00000000001 3262
ENSMUSG00000000003 902
ENSMUSG00000000028 2143
ENSMUSG00000000031 2286
ENSMUSG00000000037 4847
ENSMUSG00000000049 1190
ENSMUSG00000000056 4395
ENSMUSG00000000058 2733
ENSMUSG00000000078 4217
ENSMUSG00000000085 3544

Thursday, May 11, 2017

Defining and calculating genome coverage, depth and breadth of coverage

Average estimation of the whole genome coverage in a sequencing assay (i.e. depth of coverage) is calculated with the formula:

coverage = (read count * read length ) / total genome size
Similarly you can calculate the estimation of the coverage for each gene or gene locus:

coverage of the gene= (gene read count * read length ) / gene size 
In reality of the sequencing experiment, distribution of the mapped reads is uneven and in additon not all portions of the reads will map to the genome, therefore for each nucleotide you will calculate specific read coverage (per base read coverage), for example using bedtools genomecov:

bedtools genomecov -ibam file.bam -g my.genome -d | head

chr1  6  0
chr1  7  0
chr1  8  0
chr1  9  0
chr1  10 0
chr1  11 1
chr1  12 1
chr1  13 1
chr1  14 1
chr1  15 1
On the other side, breadth of coverage is a different term than depth of coverage and relates to the proportion of the genome that is covered with reads. Both breadth and depth of coverage correlate to the sequencing depth i.e. the number of reads generated in the sequencing experiment.

For details see publication: