[BioC] How to combine different annotation (chromosomic, genic) when using easyRNASeq (was easyRNASeq error)

Nicolas Delhomme delhomme at embl.de
Fri Mar 9 14:28:22 CET 2012


Hi Wade,

Here is a description on how to deal with the problem you reported in the thread: "easyRNASeq error". I just paste the R code below, check the comments for the explanation of the different steps. I'll add a section in the vignette about this.

Cheers,

Nico


## libs
library(easyRNASeq) ## >= 1.1.10

## set wd
setwd("Your working dir")

## read in your data
## this is inefficient in comparison to a bam file, a you load as well the whole sequences in memory
aln <- readAligned("data",type="SolexaExport",pattern="*.txt.gz")

## free some mem!
gc()

## the names are different from what we expect (UCSC standards), as they have an additional .fa extension
levels(chromosome(aln)) ## faster than a table on the same content: table(chromosome(aln))

## they are different from what biomaRt will return as well, as these are 1:19, X, Y and MT plus others
## therefore it will be a problem from easyRNASeq as you have seen when you want to use biomaRt as an annotation source
## for this easyRNASeq requires the name of the organism, which circumvent the "custom" chromosome name map by-pass

## The solution is to fetch the annotation first
obj <- fetchAnnotation(new('RNAseq',
                           organismName="Mmusculus"
                           ),
                       method="biomaRt")

## the annotation are in the genomicAnnotation slot
gAnnot <- genomicAnnotation(obj)

## shall we subset to the chromosome we're interested in
## there are only 1181 NT contigs, so let's keep them
length(grep("NT_",space(gAnnot)))

## rename the chromosomes in the annotation
## that will ease our chr.map table generation
## note that it would not be necessary
names(gAnnot) <- paste("chr",names(gAnnot),".fa",sep="")

## run easyRNASeq using the local annotation
## since we are within a script, I'll directly use the gAnnot object.
## However as you will see this will raise double counting warnings.
## double counting reads is not what ones want, that's why the better way
## would be to rework the obtained annotation to remove/clarify these cases
## and save the obtained annotation as an rda file on your file system
## save(gAnnot,file="myAnnotation.rda")
## easyRNASeq can use such an rda file as annotation as well

## first I need some space
rm(aln,obj)
gc()

## get the chromosome sizes
## Again, note that the use of the BSgenome
## is not mandatory. It's just easy as they are
## available in Bioconductor. Typing in your own
## chromosome size list is as valid.
library(BSgenome.Mmusculus.UCSC.mm9)

## note that this might change to a vector soon, which is more intuitive
## it's historicaly due to the RangedData API
chr.sizes<- as.list(seqlengths(Mmusculus))

## create the chr.map
chr.map <- data.frame(from=paste("chr",c(1:19,"X","Y"),".fa",sep=""),to=paste("chr",c(1:19,"X","Y"),sep=""))

## get the geneModels count within an RNAseq object
## the advantage of using an RNAseq object is that
## you can access the geneModels annotation
## and tailor it to your needs as explained above

## Here we have several solutions
## using the entire set of annotation we got
## We need to define a set of Filter to ensure
## that the data is read properly
## The export file contains all the reads, so the one that do not pass the chastity filter have to be removed.
## In addition, some of the other reads are for internal QC, and they have no position. For that reason, we need
## to filter those out too.
## Reading in aln data is more resource exhausting that reading in bam files, as we are
## reading in the sequence and quality information as well, whereas we do not need them.

## As you will see, this generates a lot of warnings,
## because of the differing annotations
rnaSeq <- easyRNASeq(filesDirectory="data",
                     organism="custom",
                     chr.map=chr.map,
                     chr.sizes=chr.sizes,
                     filter=compose(
                       naPositionFilter(),
                       chastityFilter()),
                     readLength=50L,
                     annotationMethod="env",
                     annotationObject=gAnnot,
                     format="aln",
                     count="genes",
                     summarization= "geneModels",
                     filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                     outputFormat="RNAseq",
                     nbCore=2
                     )

## one way to reduce the warnings is to select the chromosome we are interested in
## that's waht we do by adding the chr.sel selector
rnaSeq <- easyRNASeq(filesDirectory="data",
                     organism="custom",
                     chr.map=chr.map,
                     chr.sizes=chr.sizes,
                     chr.sel=chr.map$from,
                     filter=compose(
                       naPositionFilter(),
                       chastityFilter()),
                     readLength=50L,
                     annotationMethod="env",
                     annotationObject=gAnnot,
                     format="aln",
                     count="genes",
                     summarization= "geneModels",
                     filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                     outputFormat="RNAseq",
                     nbCore=2
                     )

## To further reduce the warnings, we can manipulate the RangedData object
## to remove the unnecessary annotation. It's quite straightforward.
sel <- grep("NT_",names(gAnnot))
gAnnot <- RangedData(ranges=ranges(gAnnot)[-sel,],values=values(gAnnot)[-sel,])
colnames(gAnnot) <- gsub("values\\.","",colnames(gAnnot))

## This will only generate two warnings, one that could be easily dealt with (the chrMT)
## The other one is about what was discussed previously. There's a need to adapt the annotation
## to avoid potential double read counting.
rnaSeq <- easyRNASeq(filesDirectory="data",
                     organism="custom",
                     chr.map=chr.map,
                     chr.sizes=chr.sizes,
                     chr.sel=chr.map$from,
                     filter=compose(
                       naPositionFilter(),
                       chastityFilter()),
                     readLength=50L,
                     annotationMethod="env",
                     annotationObject=gAnnot,
                     format="aln",
                     count="genes",
                     summarization= "geneModels",
                     filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                     outputFormat="RNAseq",
                     nbCore=2
                     )

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany



More information about the Bioconductor mailing list