[BioC] easyRNASeq error
Nicolas Delhomme
delhomme at embl.de
Thu Mar 1 17:28:13 CET 2012
Dear Wade,
Thanks for the nice words, that's good to hear! Although I have to say it got me a little worried, because the more people praise, the more difficult the questions to come ;-)
I added the bioconductor mailing list in Cc, so that it benefits others.
The first error, I was surprised to see. From the parallel package documentation, I did not expect it, but it got well extended since I started using it and is now obvious that the MakeForkCluster would not work on windows. As I don't have access to any Windows machine, I would like to give you the code that I'll write to replace that per email (off list) so that you can try it and tell me if it fixes the issue. Would that be OK?
For the second error, you are correct. Note that it is NOT necessary to use the BSgenome for easyRNASeq, that's just that as they are available in Bioconductor it's easy to use them for exemplifying, but they are not mandatory. I need to state that more explicitly in the vignette.
Back to your issue, I'm expecting in the alignment file and in the chr.sizes object chromosome names that follow UCSC conventions, e.g. chr1 to chr19 and chrX, chrY and chrM for the M.musculus for example. If it's not what I get, I try to "guess" them but obviously, since every sequencing facility has its own convention, it is not a straightforward business. For that reason, I've implemented a pass-by. To use it, we need to do the following
1) create a data.frame that contains the mapping from your different input. As you're using biomaRt as an annotationMethod, we will have three sets of chromosome names:
a) biomaRT: 1:19,"X","Y"
b) your alignment file: chr1.fa to chr19.fa, chrX.fa, chrY.fa, etc...
c) the chromosome file: chr1 to chr19, chrX, chrY, etc...
that we need to combine together. Since c) follows the UCSC convention, we'll use that as our standard. I'm not sure what you want to do about the NM, QC, RM and splice_sites-auto.fa, so I ignore them. From the name, you probably don't want to ignore that last one, but I'm not sure how you can use it. Here we go
chr.map <- data.frame(from=c(1:19,"X","Y",paste("chr",c(1:19,"X","Y"),".fa",sep="")),to=rep(paste("chr",c(1:19,"X","Y"),sep=""),2))
2) rerun you command with the chr.map argument and a "custom" organism argument
rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
organism="custom",
chr.sizes=chr.sizes,
filter=myfilt,
readLength=50L,
annotationMethod="biomaRt",
format="aln",
count="exons",
pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
outputFormat="RNAseq",
chr.map=chr.map
)
Note that I haven't tried that myself, I always had a least two set of chromosomes in agreement, so it might fail. If so, it would be great if I could get access to an excerpt of your data to try it out.
Finally, note that there's another way to by-pass this issue, if you are not using biomaRt but a local annotation source (gff, gtf, rda, etc...). If this file would contain the same chromosome names as your alignment file, you could do the following:
1) change the chromosome name in the chr.sizes to add a ".fa" at the end
chr.sizes <- as.list(seqlengths(Mmusculus))
names(chr.sizes) <- paste(names(chr.sizes),"fa",sep=".")
2) rerun your command with that new chr.sizes
and precise validity.check=FALSE in the easyRNASeq command
rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
organism="Mmusculus",
chr.sizes=chr.sizes,
filter=myfilt,
readLength=50L,
annotationMethod="biomaRt",
format="aln",
count="exons",
pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
outputFormat="RNAseq",
validity.check=FALSE
)
You need to be careful about what you're doing there, since you shut off all the consistency check on the chromosome naming.
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On 29 Feb 2012, at 22:07, Davis, Wade wrote:
> Dear Nicolas,
> First , I want to thank you for the hard work you have put in making the easyRNASeq package. I am very familiar with R and BioConductor, and NGS statistical models. But I have been learning over the past month how to deal with raw NGS for the first time, and the number of different ways and packages is a bit overwhelming. Fortunately, I discovered your package, and it has simplified things greatly and helped me understand the workflow better to getting to a count table. So once again, thank you! I am sure you will have many users in the coming months.
>
> I have 2 questions. The first relates to the error below which seems to come from the summarization method I am using, which must involve trying to do some work in parallel. However, since the default nbCore is 1, I am surprised it still tries to use multiple processers (I am on a 8 core machine). Is there anything I can do about this? I really want to use the geneModels..
> My second question follows the output below.
>
> Thanks for your help and time!
>
> J. Wade Davis, PhD
> Associate Professor, Department of Health Management and Informatics
> Co-Director, Biostatistics and Research Design Unit
> 187 Galena DC 018.0
> University of Missouri
> Columbia, MO 65212
> Phone: (573) 882-0770
> Fax: (573) 884-4196
> www.biostat.missouri.edu
>
>
>
> > # run this code only the first time this way to get the annotations;
> > # run the subsequent code for future use....
> > rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
> + organism="Mmusculus",
> + chr.sizes=chr.sizes,
> + filter=myfilt,
> + readLength=50L,
> + annotationMethod="biomaRt",
> + # format="SolexaExport",
> + format="aln",
> + #withId=TRUE,
> + count="genes",
> + summarization= "geneModels",
> + #count="exons", #generates error message
> + #filenames=fastqfilenames[1],
> + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
> + outputFormat="RNAseq"
> + )
> Checking arguments...
> Fetching annotations...
> Computing gene models...
> Error in makeForkCluster(nnodes = nbCore) :
> fork clusters are not supported on Windows
> > sessionInfo()
> R Under development (unstable) (2012-02-28 r58513)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] RnaSeqTutorial_0.0.7 BSgenome.Dmelanogaster.UCSC.dm3_1.3.17 BSgenome.Mmusculus.UCSC.mm9_1.3.17 easyRNASeq_1.1.7
> [5] ShortRead_1.13.13 latticeExtra_0.6-20 RColorBrewer_1.0-5 Rsamtools_1.7.37
> [9] DESeq_1.7.6 locfit_1.5-6 lattice_0.20-0 akima_0.5-7
> [13] BSgenome_1.23.2 GenomicRanges_1.7.28 Biostrings_2.23.6 IRanges_1.13.26
> [17] edgeR_2.5.9 limma_3.11.15 biomaRt_2.11.1 Biobase_2.15.3
> [21] BiocGenerics_0.1.7 genomeIntervals_1.11.0 intervals_0.13.3
>
> loaded via a namespace (and not attached):
> [1] annotate_1.33.2 AnnotationDbi_1.17.22 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.37.0 geneplotter_1.33.1 grid_2.15.0
> [8] hwriter_1.3 RCurl_1.91-1.1 RSQLite_0.11.1 splines_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4.1
> [15] xtable_1.7-0 zlibbioc_1.1.1
> > library(parallel)
> > ?makeForkClsuter
> No documentation for ‘makeForkClsuter’ in specified packages and libraries:
> you could try ‘??makeForkClsuter’
>
>
> My second question is about a different error message, which seems very close to one you dealt with on the mailing list a few weeks ago, but the person never said if it was resolved:
>
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/38841/focus=38858
>
> Here is my code and the error:
>
>
> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
> + organism="Mmusculus",
> + chr.sizes=chr.sizes,
> + filter=myfilt,
> + readLength=50L,
> + annotationMethod="biomaRt",
> + format="aln",
> + count="exons",
> + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
> + outputFormat="RNAseq"
> + )
> Checking arguments...
> Fetching annotations...
> Summarizing counts...
> Processing 1-Feb_ATCACG_L003_R1_001_export.txt.gz
> Error in FUN(c("chr1.fa", "chr10.fa", "chr11.fa", "chr12.fa", "chr13.fa", :
> (list) object cannot be coerced to type 'integer'
> In addition: Warning message:
> In easyRNASeq(filesDirectory = extdataDir, organism = "Mmusculus", :
> There are 429316 features/exons defined in your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want?
>
> What easyRNASeq is using as my chromosome names is different from my Illumina file (see below). I’m sure this is the source of my error, I just don’t know how to change the names and fix this. The file I am dealing with from the sequencing core is aligned I am told as such “The alignment was run in a splice-aware fashion using as input the mm9 build of the mouse genome from UCSC as provided by Illumina.”
>
>
> From my Illumina export/fastQ file read in using ReadAlign:
>
>
> table(chromosome(aln))
>
> chr1.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa
> 285307 221943 395265 115402 108108 127673
> chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr2.fa
> 159310 143801 201871 83037 156612 304649
> chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa
> 168498 222071 234236 229869 297021 194308
> chr9.fa chrX.fa chrY.fa NM QC RM
> 264232 124942 3 1009813 5877 1074323
> splice_sites-auto.fa
> 804227
>
>
>
>
> as.list(seqlengths(Mmusculus))
>
>
> $chr1
> [1] 197195432
>
> $chr2
> [1] 181748087
>
> $chr3
> [1] 159599783
>
> $chr4
> [1] 155630120
>
> $chr5
> [1] 152537259
>
> $chr6
> [1] 149517037
>
> $chr7
> [1] 152524553
>
> $chr8
> [1] 131738871
>
> $chr9
> [1] 124076172
>
> $chr10
> [1] 129993255
>
> $chr11
> [1] 121843856
>
> $chr12
> [1] 121257530
>
> $chr13
> [1] 120284312
>
> $chr14
> [1] 125194864
>
> $chr15
> [1] 103494974
>
> $chr16
> [1] 98319150
>
> $chr17
> [1] 95272651
>
> $chr18
> [1] 90772031
>
> $chr19
> [1] 61342430
>
> $chrX
> [1] 166650296
>
> $chrY
> [1] 15902555
>
> $chrM
> [1] 16299
>
> $chr1_random
> [1] 1231697
>
> $chr3_random
> [1] 41899
>
> $chr4_random
> [1] 160594
>
> $chr5_random
> [1] 357350
>
> $chr7_random
> [1] 362490
>
> $chr8_random
> [1] 849593
>
> $chr9_random
> [1] 449403
>
> $chr13_random
> [1] 400311
>
> $chr16_random
> [1] 3994
>
> $chr17_random
> [1] 628739
>
> $chrX_random
> [1] 1785075
>
> $chrY_random
> [1] 58682461
>
> $chrUn_random
> [1] 5900358
>
>
>
>
More information about the Bioconductor
mailing list