[BioC] DEXSeq - too many exons in gene

António domingues amjdomingues at gmail.com
Thu Feb 6 18:01:10 CET 2014


Hi Bioconductors,

I happened upon a funny thing in DEXseq: a gene which appears to have 
more exons in the final DEXseq output than the annotation suggests. The 
gene ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests 
the 3 exons in a flattened gene model. However, the DEXSeq results lists 
13 exons (here showing the output of htseq-count):

grep ENSMUSG00000027854 htseq_count_out.txt
ENSMUSG00000027854:001	0
ENSMUSG00000027854:002	6
ENSMUSG00000027854:003	18
ENSMUSG00000027854:004	0
ENSMUSG00000027854:005	0
ENSMUSG00000027854:006	86
ENSMUSG00000027854:007	0
ENSMUSG00000027854:008	113
ENSMUSG00000027854:009	52
ENSMUSG00000027854:010	76
ENSMUSG00000027854:011	0
ENSMUSG00000027854:012	310
ENSMUSG00000027854:013	554

This comes from the annotation created with:
dexseq_prepare_annotation.py mm10_ensGene.gtf mm10_ensGene.gff

grep ENSMUSG00000027854 ../../data/gtf/mm10_ensGene.gff
chr3	mm10_ensGene.gtf	aggregate_gene	102995728	103003914	.	+	.	gene_id 
"ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102995728	102995729	.	+	.	transcripts 
"ENSMUST00000029447"; exonic_part_number "001"; gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102995730	102995794	.	+	.	transcripts 
"ENSMUST00000029447+ENSMUST00000151065"; exonic_part_number "002"; 
gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102995795	102995967	.	+	.	transcripts 
"ENSMUST00000151065+ENSMUST00000029447+ENSMUST00000119450"; 
exonic_part_number "003"; gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102995968	102996048	.	+	.	transcripts 
"ENSMUST00000151065"; exonic_part_number "004"; gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102996049	102996155	.	+	.	transcripts 
"ENSMUST00000151065+ENSMUST00000137332"; exonic_part_number "005"; 
gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102996156	102996261	.	+	.	transcripts 
"ENSMUST00000029447+ENSMUST00000137332+ENSMUST00000151065"; 
exonic_part_number "006"; gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102996262	102997242	.	+	.	transcripts 
"ENSMUST00000151065"; exonic_part_number "007"; gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102997243	102997351	.	+	.	transcripts 
"ENSMUST00000029447+ENSMUST00000137332+ENSMUST00000151065"; 
exonic_part_number "008"; gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102997352	102997385	.	+	.	transcripts 
"ENSMUST00000029447+ENSMUST00000151065"; exonic_part_number "009"; 
gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102998490	102998603	.	+	.	transcripts 
"ENSMUST00000151065+ENSMUST00000029447+ENSMUST00000119450"; 
exonic_part_number "010"; gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	102998604	102999251	.	+	.	transcripts 
"ENSMUST00000151065"; exonic_part_number "011"; gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	103001708	103002194	.	+	.	transcripts 
"ENSMUST00000029447+ENSMUST00000119450"; exonic_part_number "012"; 
gene_id "ENSMUSG00000027854"
chr3	mm10_ensGene.gtf	exonic_part	103002195	103003914	.	+	.	transcripts 
"ENSMUST00000029447"; exonic_part_number "013"; gene_id "ENSMUSG00000027854"

Between exon1 is only 1 base long (?) and exons1 to 4 are contiguous. As 
far as I am aware, DEXSeq model should have flattened all of these into 
one single "exon". Is this correct? is the error coming from the gtf? 
(at the end of the message there is also the gene annotation in the gtf).

This is specially concerning for me because I am interested in selecting 
the first and last exon of genes, using the exon ranking from DEXSeq, to 
analyze further.


Thanks,
António

 > sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] grDevices datasets  stats     graphics  utils     methods   base

other attached packages:
  [1] DEXSeq_1.4.0           GenomicFeatures_1.10.2 GenomicRanges_1.10.5
  [4] IRanges_1.16.6         data.table_1.8.9       stringr_0.6.2
  [7] ggplot2_0.9.3.1        AnnotationDbi_1.20.2   Biobase_2.18.0
[10] BiocGenerics_0.4.0

loaded via a namespace (and not attached):
  [1] BSgenome_1.26.1    Biostrings_2.26.3  DBI_0.2-5 
MASS_7.3-23
  [5] RColorBrewer_1.0-5 RCurl_1.95-4.1     RSQLite_0.11.2 
Rsamtools_1.10.2
  [9] XML_3.98-1.1       biomaRt_2.14.0     bitops_1.0-6 
colorspace_1.2-4
[13] dichromat_2.0-0    digest_0.6.3       grid_2.15.2 
gtable_0.1.2
[17] hwriter_1.3        labeling_0.2       munsell_0.4.2 
parallel_2.15.2
[21] plyr_1.8           proto_0.3-10       reshape2_1.2.2 
rtracklayer_1.18.1
[25] scales_0.2.3       statmod_1.4.17     stats4_2.15.2 
tools_2.15.2
[29] zlibbioc_1.4.0



grep ENSMUSG00000027854 ../../data/gtf/mm10_ensGene.gtf
chr3	ensGene	exon	102995728	102995967	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number 
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	CDS	102995809	102995967	.	+	0	gene_id "ENSMUSG00000027854"; 
transcript_id "ENSMUST00000029447"; exon_number "1"; exon_id 
"ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102996156	102996261	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number 
"2"; exon_id "ENSMUST00000029447.2"; gene_name "ENSMUSG00000027854";
chr3	ensGene	CDS	102996156	102996261	.	+	0	gene_id "ENSMUSG00000027854"; 
transcript_id "ENSMUST00000029447"; exon_number "2"; exon_id 
"ENSMUST00000029447.2"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102997243	102997385	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number 
"3"; exon_id "ENSMUST00000029447.3"; gene_name "ENSMUSG00000027854";
chr3	ensGene	CDS	102997243	102997385	.	+	2	gene_id "ENSMUSG00000027854"; 
transcript_id "ENSMUST00000029447"; exon_number "3"; exon_id 
"ENSMUST00000029447.3"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102998490	102998603	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number 
"4"; exon_id "ENSMUST00000029447.4"; gene_name "ENSMUSG00000027854";
chr3	ensGene	CDS	102998490	102998603	.	+	0	gene_id "ENSMUSG00000027854"; 
transcript_id "ENSMUST00000029447"; exon_number "4"; exon_id 
"ENSMUST00000029447.4"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	103001708	103003914	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number 
"5"; exon_id "ENSMUST00000029447.5"; gene_name "ENSMUSG00000027854";
chr3	ensGene	CDS	103001708	103001806	.	+	0	gene_id "ENSMUSG00000027854"; 
transcript_id "ENSMUST00000029447"; exon_number "5"; exon_id 
"ENSMUST00000029447.5"; gene_name "ENSMUSG00000027854";
chr3	ensGene	start_codon	102995809	102995811	.	+	0	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number 
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	stop_codon	103001807	103001809	.	+	0	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number 
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102995730	102997385	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000151065"; exon_number 
"1"; exon_id "ENSMUST00000151065.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102998490	102999251	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000151065"; exon_number 
"2"; exon_id "ENSMUST00000151065.2"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102995795	102995967	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number 
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	CDS	102995809	102995967	.	+	0	gene_id "ENSMUSG00000027854"; 
transcript_id "ENSMUST00000119450"; exon_number "1"; exon_id 
"ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102998490	102998603	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number 
"2"; exon_id "ENSMUST00000119450.2"; gene_name "ENSMUSG00000027854";
chr3	ensGene	CDS	102998490	102998603	.	+	0	gene_id "ENSMUSG00000027854"; 
transcript_id "ENSMUST00000119450"; exon_number "2"; exon_id 
"ENSMUST00000119450.2"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	103001708	103002194	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number 
"3"; exon_id "ENSMUST00000119450.3"; gene_name "ENSMUSG00000027854";
chr3	ensGene	CDS	103001708	103001806	.	+	0	gene_id "ENSMUSG00000027854"; 
transcript_id "ENSMUST00000119450"; exon_number "3"; exon_id 
"ENSMUST00000119450.3"; gene_name "ENSMUSG00000027854";
chr3	ensGene	start_codon	102995809	102995811	.	+	0	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number 
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	stop_codon	103001807	103001809	.	+	0	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number 
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102996049	102996261	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000137332"; exon_number 
"1"; exon_id "ENSMUST00000137332.1"; gene_name "ENSMUSG00000027854";
chr3	ensGene	exon	102997243	102997351	.	+	.	gene_id 
"ENSMUSG00000027854"; transcript_id "ENSMUST00000137332"; exon_number 
"2"; exon_id "ENSMUST00000137332.2"; gene_name "ENSMUSG00000027854";


-- 
António Miguel de Jesus Domingues, PhD
Postdoctoral researcher
Deep Sequencing Group - SFB655
Biotechnology Center (Biotec)
Technische Universität Dresden
Fetscherstraße 105
01307 Dresden

Phone: +49 (351) 458 82362
Email: antonio.domingues(at)biotec.tu-dresden.de
--
The Unbearable Lightness of Molecular Biology
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Internal_tranbscript.pdf
Type: application/pdf
Size: 8751 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140206/c58f158d/attachment.pdf>


More information about the Bioconductor mailing list