Wednesday, December 3, 2014

How to sort bed files for bedtools input

In some cases you will need to sort you're bed file to be able to use it with bedtools

If you use bedtools sort you will not get a karyotype order (chr1, chr2, chr3) but instead you will get chr1, chr10, chr11 etc.

The same output you will get with Unix sort command:

sort -k1,1 -k2,2n file > file_sorted

You would still get chr1, chr10, chr11 etc. as output:

...
chr1    246168412       246168944
chr1    247070790       247071135
chr10   363179  363606
chr10   2970376 2970831
chr10   3087334 3087998
chr10   3511405 3511734
...

The trick is to use -V (--version-sort) parameter with the sort command that will enable natural sort of numbers within text

sort -k1,1V -k2,2n file > file_sorted

...
chr1    246168412       246168944
chr1    247070790       247071135
chr2    1595719 1596411
chr2    1629102 1629748
chr2    1635289 1635633
chr2    1735080 1736335
...

It is also necessary to sort the genome file with chromosomal sizes, using the same command. If you leave this file unsorted bedtools may give you an error. Bedtools has several genome size files in its genomes folder and these are unsorted so you should run:

sort -k1,1V -k2,2n human.hg19.genome > human.hg19.genome_sorted
sort -k1,1V -k2,2n human.hg18.genome > human.hg18.genome_sorted
sort -k1,1V -k2,2n human.hg38.genome > human.hg38.genome_sorted

etc.

No comments:

Post a Comment