[Bioc-sig-seq] getSeq with space names as factors vs characters

Tue Nov 16 03:32:40 CET 2010

sorry - I'd meant to include this too:

 > sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome.Mmusculus.UCSC.mm9_1.3.16 BSgenome_1.18.1
[3] Biostrings_2.18.0                  GenomicRanges_1.2.1
[5] IRanges_1.8.2

loaded via a namespace (and not attached):
[1] Biobase_2.10.0 tools_2.12.0

Janet

On Nov 15, 2010, at 6:30 PM, Janet Young wrote:

> Hi,
>
> I just updated R and to 2.12.0 and BioC to the corresponding latest  
> version.
>
> I've found some new maybe weird behavior in getSeq (Biostrings)  
> that's causing a little chaos for me using my code with the updated  
> BioC.  I think I can find a workaround but am also hoping getSeq  
> might be fixable fairly easily?
>
> Here's my issue: I'm using getSeq to extract multiple sequences at  
> once from the mouse genome, specifying coordinates using RangedData  
> objects. That works OK if I use the whole RangedData object, but  
> weird things start to happen if I just use subsets of the RangedData  
> object (something to do with factors versus characters for space  
> names, perhaps, or the function is getting confused with GRanges vs  
> RangedData?).
>
> library(BSgenome.Mmusculus.UCSC.mm9)
> library(IRanges)
>
> tempRD <-  
> RangedData 
> (IRanges 
> (start 
> = 
> c(10000001,10000001),end=c(10000051,10000051)),space=c("chr1","chr2"))
>
> #### simple getSeq looks good
> getSeq(Mmusculus,tempRD)
> [1] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"
> [2] "AGGCCAACTTTTAGAGGTTGGCTCTCTCCTTCAATTGCATGTCCAGGGAGC"
>
> ### but if I subset the RangedData it doesn't look so good - I'd  
> like the following command to give me just one sequence for the  
> first region specified in tempRD, but instead it gives me that first  
> region two times
> getSeq(Mmusculus,tempRD[1,])
> [1] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"
> [2] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"
>
> ### also if I have unused space names I get an error
>
> tempRD3 <-  
> RangedData 
> (IRanges 
> (start 
> = 
> c 
> (10000001,10000001,10000001 
> ),end 
> = 
> c 
> (10000051,10000051,10000051 
> )),space=as.character(c("chr1","chr2","chr3")) )
>
> ######
> tempRD4 <- tempRD3[1:2,]
>
> getSeq(Mmusculus,tempRD4)
>
> Error in validObject(.Object) :
>  invalid class "GRanges" object: slot lengths are not all equal
> In addition: Warning message:
> In newCompressedList("CompressedSplitDataFrameList", x, splitFactor  
> = f,  :
>  data length is not a multiple of split variable
>
> ### one possible workaround - get rid of the unused space name
> tempRD5 <-  
> RangedData 
> (IRanges 
> (start(tempRD4),end(tempRD4)),space=as.character(space(tempRD4)))
> getSeq(Mmusculus,tempRD5)   #### now this works
>
> #############
>
> Hope that all makes some sense - thanks very much,
>
> Janet
>
>
>
> -------------------------------------------------------------------
>
> Dr. Janet Young (Trask lab)
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Avenue N., C3-168,
> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>
> tel: (206) 667 1471 fax: (206) 667 6524
> email: jayoung  ...at...  fhcrc.org
>
> http://www.fhcrc.org/labs/trask/
>
> -------------------------------------------------------------------
>