[BioC] easyRNASeq error

Nicolas Delhomme delhomme at embl.de
Thu Mar 1 17:28:13 CET 2012


Dear Wade,

Thanks for the nice words, that's good to hear! Although I have to say it got me a little worried, because the more people praise, the more difficult the questions to come ;-)

I added the bioconductor mailing list in Cc, so that it benefits others.

The first error, I was surprised to see. From the parallel package documentation, I did not expect it, but it got well extended since I started using it and is now obvious that the MakeForkCluster would not work on windows. As I don't have access to any Windows machine, I would like to give you the code that I'll write to replace that per email (off list) so that you can try it and tell me if it fixes the issue. Would that be OK?


For the second error, you are correct. Note that it is NOT necessary to use the BSgenome for easyRNASeq, that's just that as they are available in Bioconductor it's easy to use them for exemplifying, but they are not mandatory. I need to state that more explicitly in the vignette. 
Back to your issue, I'm expecting in the alignment file and in the chr.sizes object chromosome names that follow UCSC conventions, e.g. chr1 to chr19 and chrX, chrY and chrM for the M.musculus for example. If it's not what I get, I try to "guess" them but obviously, since every sequencing facility has its own convention, it is not a straightforward business. For that reason, I've implemented a pass-by. To use it, we need to do the following

1) create a data.frame that contains the mapping from your different input. As you're using biomaRt as an annotationMethod, we will have three sets of chromosome names:

a) biomaRT: 1:19,"X","Y"
b) your alignment file: chr1.fa to chr19.fa, chrX.fa, chrY.fa, etc...
c) the chromosome file: chr1 to chr19, chrX, chrY, etc...

that we need to combine together. Since c) follows the UCSC convention, we'll use that as our standard. I'm not sure what you want to do about the NM, QC, RM and splice_sites-auto.fa, so I ignore them. From the name, you probably don't want to ignore that last one, but I'm not sure how you can use it. Here we go

chr.map <- data.frame(from=c(1:19,"X","Y",paste("chr",c(1:19,"X","Y"),".fa",sep="")),to=rep(paste("chr",c(1:19,"X","Y"),sep=""),2))

2) rerun you command with the chr.map argument and a "custom" organism argument
rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
                      organism="custom",
                      chr.sizes=chr.sizes,
                      filter=myfilt,
                      readLength=50L,
                      annotationMethod="biomaRt",
                       format="aln",
                      count="exons",
                      pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
                      outputFormat="RNAseq",
			chr.map=chr.map
                      )

Note that I haven't tried that myself, I always had a least two set of chromosomes in agreement, so it might fail. If so, it would be great if I could get access to an excerpt of your data to try it out.

Finally, note that there's another way to by-pass this issue, if you are not  using biomaRt but a local annotation source (gff, gtf, rda, etc...). If this file would contain the same chromosome names as your alignment file, you could do the following: 

1) change the chromosome name in the chr.sizes to add  a ".fa" at the end
chr.sizes <- as.list(seqlengths(Mmusculus))
names(chr.sizes) <- paste(names(chr.sizes),"fa",sep=".")

2) rerun your command with that new chr.sizes
 and precise validity.check=FALSE in the easyRNASeq command

rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
                      organism="Mmusculus",
                      chr.sizes=chr.sizes,
                      filter=myfilt,
                      readLength=50L,
                      annotationMethod="biomaRt",
                       format="aln",
                      count="exons",
                      pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
                      outputFormat="RNAseq",
			validity.check=FALSE
                      )

You need to be careful about what you're doing there, since you shut off all the consistency check on the chromosome naming.


Cheers,

Nico


---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 29 Feb 2012, at 22:07, Davis, Wade wrote:

> Dear Nicolas,
> First , I want to thank you for the hard work you have put in making the easyRNASeq package. I am very familiar with R and BioConductor, and NGS statistical models. But I have been learning over the past month how to deal with raw NGS for the first time, and the number of different ways and packages is a bit overwhelming. Fortunately, I discovered your package, and it has simplified things greatly and helped me understand the workflow better to getting to a count table. So once again, thank you! I am sure you will have many users in the coming months.
>  
> I have 2 questions. The first relates to the error below which seems to come from the summarization method I am using, which must involve trying to do some work in parallel. However, since the default nbCore is 1, I am surprised it still tries to use multiple processers (I am on a 8 core machine). Is there anything I can do about this? I really want to use the geneModels..
> My second question follows the output below.
>  
> Thanks for your help and time!
>  
> J. Wade Davis, PhD
> Associate Professor, Department of Health Management and Informatics
> Co-Director, Biostatistics and Research Design Unit
> 187 Galena DC 018.0
> University of Missouri
> Columbia, MO 65212
> Phone: (573) 882-0770
> Fax: (573) 884-4196
> www.biostat.missouri.edu
>  
>  
>  
> > # run this code only the first time this way to get the annotations;
> > # run the subsequent code for future use....
> > rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
> +                      organism="Mmusculus",
> +                      chr.sizes=chr.sizes,
> +                      filter=myfilt,
> +                      readLength=50L,
> +                      annotationMethod="biomaRt",
> +                     # format="SolexaExport",
> +                       format="aln",
> +                      #withId=TRUE,
> +                      count="genes",
> +                      summarization= "geneModels",
> +                      #count="exons", #generates error message
> +                      #filenames=fastqfilenames[1],
> +                      pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
> +                      outputFormat="RNAseq"
> +                      )
> Checking arguments...
> Fetching annotations...
> Computing gene models...
> Error in makeForkCluster(nnodes = nbCore) :
>   fork clusters are not supported on Windows
> > sessionInfo()
> R Under development (unstable) (2012-02-28 r58513)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>  
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                         
> [5] LC_TIME=English_United States.1252   
>  
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base    
>  
> other attached packages:
> [1] RnaSeqTutorial_0.0.7                   BSgenome.Dmelanogaster.UCSC.dm3_1.3.17 BSgenome.Mmusculus.UCSC.mm9_1.3.17     easyRNASeq_1.1.7                     
>  [5] ShortRead_1.13.13                      latticeExtra_0.6-20                    RColorBrewer_1.0-5                     Rsamtools_1.7.37                     
>  [9] DESeq_1.7.6                            locfit_1.5-6                           lattice_0.20-0                         akima_0.5-7                           
> [13] BSgenome_1.23.2                        GenomicRanges_1.7.28                   Biostrings_2.23.6                      IRanges_1.13.26                      
> [17] edgeR_2.5.9                            limma_3.11.15                          biomaRt_2.11.1                         Biobase_2.15.3                       
> [21] BiocGenerics_0.1.7                     genomeIntervals_1.11.0                 intervals_0.13.3                     
>  
> loaded via a namespace (and not attached):
> [1] annotate_1.33.2       AnnotationDbi_1.17.22 bitops_1.0-4.1        DBI_0.2-5             genefilter_1.37.0     geneplotter_1.33.1    grid_2.15.0         
>  [8] hwriter_1.3           RCurl_1.91-1.1        RSQLite_0.11.1        splines_2.15.0        survival_2.36-12      tools_2.15.0          XML_3.9-4.1         
> [15] xtable_1.7-0          zlibbioc_1.1.1      
> > library(parallel)
> > ?makeForkClsuter
> No documentation for ‘makeForkClsuter’ in specified packages and libraries:
> you could try ‘??makeForkClsuter’
>  
>  
> My second question is about a different error message, which seems very close to one you dealt with on the mailing list a few weeks ago, but the person never said if it was resolved:
>  
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/38841/focus=38858
>  
> Here is my code and the error:
>  
>  
> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
> +                      organism="Mmusculus",
> +                      chr.sizes=chr.sizes,
> +                      filter=myfilt,
> +                      readLength=50L,
> +                      annotationMethod="biomaRt",
> +                       format="aln",
> +                      count="exons",
> +                      pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
> +                      outputFormat="RNAseq"
> +                      )
> Checking arguments...
> Fetching annotations...
> Summarizing counts...
> Processing 1-Feb_ATCACG_L003_R1_001_export.txt.gz
> Error in FUN(c("chr1.fa", "chr10.fa", "chr11.fa", "chr12.fa", "chr13.fa",  :
>   (list) object cannot be coerced to type 'integer'
> In addition: Warning message:
> In easyRNASeq(filesDirectory = extdataDir, organism = "Mmusculus",  :
>   There are 429316 features/exons defined in your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want?
>  
> What easyRNASeq is using as my chromosome names is different from my Illumina file (see below). I’m sure this is the source of my error, I just don’t know how to change the names and fix this. The file I am dealing with from the sequencing core is aligned I am told as such “The alignment was run in a splice-aware fashion using as input the mm9 build of the mouse genome from UCSC as provided by Illumina.”
>  
>  
> From my Illumina export/fastQ file read in using ReadAlign:
>  
>  
> table(chromosome(aln))
>  
>              chr1.fa             chr10.fa             chr11.fa             chr12.fa             chr13.fa             chr14.fa
>               285307               221943               395265               115402               108108               127673
>             chr15.fa             chr16.fa             chr17.fa             chr18.fa             chr19.fa              chr2.fa
>               159310               143801               201871                83037               156612               304649
>              chr3.fa              chr4.fa              chr5.fa              chr6.fa              chr7.fa              chr8.fa
>               168498               222071               234236               229869               297021               194308
>              chr9.fa              chrX.fa              chrY.fa                   NM                   QC                   RM
>               264232               124942                    3              1009813                 5877              1074323
> splice_sites-auto.fa
>               804227
>  
>  
>  
>  
> as.list(seqlengths(Mmusculus))
>  
>  
> $chr1
> [1] 197195432
>  
> $chr2
> [1] 181748087
>  
> $chr3
> [1] 159599783
>  
> $chr4
> [1] 155630120
>  
> $chr5
> [1] 152537259
>  
> $chr6
> [1] 149517037
>  
> $chr7
> [1] 152524553
>  
> $chr8
> [1] 131738871
>  
> $chr9
> [1] 124076172
>  
> $chr10
> [1] 129993255
>  
> $chr11
> [1] 121843856
>  
> $chr12
> [1] 121257530
>  
> $chr13
> [1] 120284312
>  
> $chr14
> [1] 125194864
>  
> $chr15
> [1] 103494974
>  
> $chr16
> [1] 98319150
>  
> $chr17
> [1] 95272651
>  
> $chr18
> [1] 90772031
>  
> $chr19
> [1] 61342430
>  
> $chrX
> [1] 166650296
>  
> $chrY
> [1] 15902555
>  
> $chrM
> [1] 16299
>  
> $chr1_random
> [1] 1231697
>  
> $chr3_random
> [1] 41899
>  
> $chr4_random
> [1] 160594
>  
> $chr5_random
> [1] 357350
>  
> $chr7_random
> [1] 362490
>  
> $chr8_random
> [1] 849593
>  
> $chr9_random
> [1] 449403
>  
> $chr13_random
> [1] 400311
>  
> $chr16_random
> [1] 3994
>  
> $chr17_random
> [1] 628739
>  
> $chrX_random
> [1] 1785075
>  
> $chrY_random
> [1] 58682461
>  
> $chrUn_random
> [1] 5900358
>  
>  
>  
>  



More information about the Bioconductor mailing list