Authors: Lucia Puchbauer, Ralf Zimmer and Fabian Birzele
Abstract
Motivation: RNA sequencing is becoming the standard technology for expression analysis and promises not only to profile gene but also transcript expression levels. Several methods have been proposed to compute transcript levels from RNASeq data and the results of such analyses can deepen our understanding of the function of alternative splicing in biological systems. However, a critical assessment of the performance of those methods has not been performed so far. Results: In this study, we compare seven methods for transcript quantification with respect to different measurements and potential influence factors like read coverage, read sampling and sequencing bias on simulated and real sequencing data. We show that even for the best methods, transcript level estimates are often far away from the truth. Thus, results of any qualified method need to be handled with care when it comes to interpret the results in the biological context. Further, by using two simple consensus approaches we can significantly improve prediction accuracy on several measures. Conclusions: Transcript levels estimated by currently available tools can be significantly wrong especially when transcripts mixtures are complex and many transcripts of a gene are expressed in a sample. Interesting cases and splicing events should always be visually inspected to confirm differential splicing events.
Download
All simulated reads used in the performance evaluation can be downloaded formatted as .fastq with the corresponding .info files:Alternatively our read simulator can be downloaded:
USAGE: java -cp simulator.jar simulator/ngs/rnaseq/splicing/RNASeqSimulator [options] Required Parameters: -g GTF file containing the annotation of the genes to use in the simulation. -fa cDNA sequences of the genes transcripts. -fi index file for the specified fasta file for fast access. This file be created if not existent. -o prefix for output files. Example prefix: /home/workspace/sampleID, example output file generated: /home/workspace/sampleID.info -l file containing a list of genes to use in the simulation. It has to be one gene ID per line. Options: -b bias to be simulated. Possible values: uniform, random, 3prime5prime, combined [ default: uniform ] -c coverage. [ default: 50 ] -r read length. In the case of paired end reads this is length of one mate. [ default: 50 ] -p flag for paired-end read simulation. -e sequencing error rate. [ default: 0.01 ] -x minimal positional probability for drawing a read. Regarded for random and combined bias. [ default: 0.5 ] -m white space separated list of isoform mixtures. Format:: . The isoform mixtures can be enclosed by " or '. Example: -m "50:100|0" "50:09|10" [ default: 25:100|0 25:90|10 25:75|25 25:60|40 ]
Description of the .info format
In the .info files the read information for all genes used is stored. The read information is stored per transcript and the transcripts are grouped by their genes. One entry is started by stating gene ID, composite gene length, total number of reads for the gene and isoform mixture model used for the gene. The composite gene length is the number of a gene's exonic positions. A transcript description line is marked by the TRANSCRIPT tag and contains the following columns:TRANSCRIPT [Transcript ID] [Transcript length] [Reads drawn from transcript]
Example:
>ENSG00000001497_X COMPOSITE_LENGTH 5601 TOTAL_NUMBER_OF_READS 2800 SELECTED_MODEL 40|30|20|10 TRANSCRIPT ENST00000312391_X 2356 1120 TRANSCRIPT ENST00000374811_X 2439 840 TRANSCRIPT ENST00000469091_X 447 560 TRANSCRIPT ENST00000374804_X 2261 280 TRANSCRIPT ENST00000484069_X 5205 0 TRANSCRIPT ENST00000374807_X 2386 0