Thursday, August 18, 2011

Transcriptome assembly from RNA-Seq data using annotated reference transcripts

Review of the paper:

Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 Jun 21. Roberts A, Pimentel H, Trapnell C, Pachter L.
Department of Computer Science, UC Berkeley, Berkeley, CA.

http://www.ncbi.nlm.nih.gov/pubmed/21697122

Assembly of a transciptome using only a reference genome (for mapping of the sequenced reads) but without prior reference transcript annotation suffers from several problems.

First, lowly expressed genes have low sequence coverage and thus the assembly of transcripts that originate from such genes is quite difficult and error prone.

Second, the exact positions of the 5' and 3' ends of transcripts is sometimes difficult to establish, possibly due to the lack of sufficient sequence coverage at the ends (especially 5' ends are covered with less sequences if a poly-A selection has been included in the sequencing protocol).

The picture shows an example of a miscellaneous 5' gene end detection in several human RNA-Seq samples for highly expressed GAPDH gene. The correct position of the 5' end could not be established and each transcript would be called as a unique transcript if data were to be pooled from these RNA-Seq experiments.


Third, assemblers may output several transcripts (transfrags) actually originating from from a single transcript simply due to the lack of connecting reads that will asemble transfrags into a single transcript.

In this paper, authors suggest using a reference annotation of transcripts to correct for these issues. However, the use of a reference is different than simply taking already annotated transcripts and calculating their sequence coverage. This method only uses a reference to correct for the known problems. In other words, novel RNA transcripts will still be detected.
miscellaneous 5' gene end detection in several human RNA-Seq samples. The correct position of the 5' end could not be established and each transcript would be called as a unique transcript if data were to be pooled from these RNA-Seq experiments.

Third, assemblers may output several transcripts (transfrags) actually originating from from a single transcript simply due to the lack of connecting reads that will assemble transfrags into a single transcript.

In this paper, authors suggest using a reference annotation of transcripts to correct for these issues. However, the use of a reference is different than simply taking already annotated transcripts and calculating their sequence coverage. This method only uses a reference to correct for the known problems. In other words, novel RNA transcripts will still be detected in addition to already annotated ones.

The correction seems to work quite nice for the example given in the paper. Four incomplete transfrags originating from two actual transcripts were called using Cufflinks assembler. Authors' RABT assembler produced two complete transcripts produced from this locus.

All together, authors' RABT assembly when applied to human brain tissue found 70,241 transfrags (transcripts) and 36,494 gene loci. 36,494 transfrags and 15,504 genes were novel. On average each gene had 1.92 transfrags (transcripts).

No comments:

Post a Comment