[BioC] Recommended gene model for DESeq
Assaf Gordon
gordon at cshl.edu
Thu Apr 5 19:00:47 CEST 2012
Thank you for the quick answer.
A side-note regarding UCSC's GTF export:
Simon Anders wrote, On 04/05/2012 05:11 AM:
>
> A subtle but very annoying problem is this:
>
> According to the GTF file standard specs, an "exon" or "CDS" line in
> a GTF file shlould have two attributes labelled "gene_id" and
> "transcript_id". Exon lines describing different transcripts of the
> same gene should hence have different transcript IDs but the same
> gene ID. Ensembl honors this rule while GTF files downloaded from the
> UCSC Table browser just have the transcript ID repeated as gene ID,
> which defeats the purpose, of course. A script won't notice that two
> transcripts belong to the same gene because they do not have the same
> gene ID.
>
> Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC
> people to fix the issue. (iii) Fix a UCSC GTF file by removing from
> the gene_id attibute the number behind the last dot. Hence, taking
> e.g., this line from the UCSC GTF file for hg19:
>
The UCSC Table Browser indeed produces unsuitable GTF files, it's a known issue and they're not going to fix it (technically: the format conversion code uses a BED-like structure for all formats, and BED has only one ID per record).
The correct way to export usable GFF files directly from UCSC:
1. Download a program called "genePredToGtf" from here:
http://hgdownload.cse.ucsc.edu/admin/exe/
or compile it from source ( http://genome.ucsc.edu/admin/git.html ) if you're feeling lucky.
2. Create the following file in your home directory:
$ cat ~/.hg.conf
db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password
# the file's permissions must be user-only
$ chmod 0600 ~/.hg.conf
3. run "genePredToGtf" with any organism and any table that is in "genePred" format:
## mm9/RefSeq Genes
$ genePredToGtf mm9 refGene refGene.gtf
## hg19/Ensemble Genes
$ genePredToGtf hg19 ensGene ensGene.gtf
## hg18/UCSC Known Genes
$ genePredToGtf hg18 knownGene knownGene.gtf
This will save "refGene.gtf" with all the required attributes (gene_id, gene_name transcript_id, exon_number).
(but still, per your explanation, not directly usable with DESeq because of multiple isoforms per gene).
-gordon
More information about the Bioconductor
mailing list