Pickrell JK, Pai AA, Gilad Y, Pritchard JK (2010) Noisy Splicing Drives mRNA Isoform Diversity in Human Cells. PLoS Genet 6(12): e1001236. doi:10.1371/journal.pgen.1001236
In this paper authors, describe novel and unannotated exon-exon junctions they discovered in human limphoblastoid cell lines using RNA-sequencing technology.
They claim to have discovered 392,612 putative splice junctions and that many of them define functional alternative RNA splice variants, but also speculate that some of them represent transcriptional noise or errors in the splicing machinery in the cell.
Out of 392,612 splice junctions 306,606 contained canonical sites GT-AG or GC-AG at the edges of their introns.
The authors did not use any previous knowledge of the exon-intron boundaries when spliting the reads, yet the majority of split reads show GT-AG pattern when split. This indicates the majority of the reads are actually reliable estimates of the splicing event.
Interesting fact observed is that the when compared to already annotated splice sites these putative splice sites showed periodical distribution as more putative sites were found that maintained the coding frame than those that would disrupt it (thus the distribution of the putative splice sites shows a peak on every 3rd base position from the annotated splice site). This may show that such putative splice sites are not just transcriptional noise but are actually functional and that they could have been seen by selection and eventually fixed during evolution in humans.
50% of 306,606 junctions were not present in gene models from UCSC, Ensembl, Vega and RefSeq or in ESTs from Genbank. However these novel junctions account for only 1.7% of all junction spanning reads, thus they likely represent rare RNA variants present in the cell.
However, one important issue that is related to this analysis is the contribution of the mapping errors to the dataset. It is still possible that one part of the spliced reads are actually mapping errors. Since during the mapping procedure mapping tools are allowed to use mismatches (even at the very edge of the read) it may be possible that the read is split and subsequently mapped in a slightly wrong way creating thus the non-existing junction. Thus, although global estimates may be pretty reliable, when we look at individual junctions, we have to be cautious and check e.g. for the mismatches with the reference.