[Bioc-sig-seq] as.data.frame on GRanges object with DNAStringSet in values
Hervé Pagès
hpages at fhcrc.org
Thu Jun 16 01:26:40 CEST 2011
On 11-06-15 03:38 PM, Michael Lawrence wrote:
>
>
> 2011/6/15 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> Hi Michael, Janet,
>
> I just added an "as.vector" method for XStringSet objects to
> Biostrings 2.21.6:
>
> > library(Biostrings)
> > x <- DNAStringSet(c("aaatg", "gt"))
> > as.vector(x)
> [1] "AAATG" "GT"
>
> But that doesn't solve Janet's problem:
>
> > df <- DataFrame(id=c("ID1", "ID2"), seqs=x)
> > df
> DataFrame with 2 rows and 2 columns
> id seqs
> <character> <DNAStringSet>
> 1 ID1 AAATG
> 2 ID2 GT
> > as.data.frame(df)
>
> Error in as.data.frame.default(y, optional = TRUE, ...) :
> cannot coerce class 'structure("DNAStringSet", package =
> "Biostrings")' into a data.frame
>
> Michael?
>
>
> Well, sorry for that. I just added a coercion from Vector to data.frame
> through as.vector, so this works.
Thanks!
> But someone might add a coercion from
> List to data.frame that would treat the elements as columns. Would this
> make sense?
Hard to tell. Maybe sometimes it would make sense, but sometimes it
definitely does not (e.g. DNAStringSet).
> AtomicList to data.frame does something even stranger: it
> creates a two column data frame with the unlisted values and
> names/indices rep'd out as a factor. Actually, that's kind of cool,
> since usually one does not have a list with equal element lengths, but
> it's somewhat unintuitive. But why does it apply only to AtomicList?
Glad you bring this on the table.
For the record, "as.vector" also unrolls an AtomicList:
> as.vector(IntegerList(1:4, 0:-2))
[1] 1 2 3 4 0 -1 -2
IMO, we should not do things like that. Because:
1) The same can be achieved with unlist():
> unlist(IntegerList(1:4, 0:-2))
[1] 1 2 3 4 0 -1 -2
2) It's totally unintuitive to use as.vector for unlisting
a list (as.vector on a standard list does not do that).
3) There is a strong expectation that as.vector() will preserve
the length of its input.
So I propose to deprecate those "as.vector" and "as.data.frame"
methods for AtomicList objects.
H.
> Anyway, given the special correspondence between a XStringSet and a
> character vector, we could always add an as.data.frame method for
> XStringSet, just to make sure stuff behaves as expected.
>
> Thanks,
> H.
>
>
> > sessionInfo()
> R version 2.14.0 Under development (unstable) (2011-05-30 r56024)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.21.6 IRanges_1.11.10
>
>
>
> On 11-06-15 12:49 PM, Janet Young wrote:
>
> yes - as.character seems a good choice, I think
>
> thanks,
>
> Janet
>
> On Jun 15, 2011, at 12:46 PM, Michael Lawrence wrote:
>
> So you would expect that the DNAStringSet is converted to a
> character vector? DNAStringSet (technically XStringSet) then
> just needs an as.vector method that delegates to as.character.
>
> Michael
>
>
> On Wed, Jun 15, 2011 at 12:37 PM, Janet
> Young<jayoung at fhcrc.org <mailto:jayoung at fhcrc.org>> wrote:
> Hi there,
>
> I'm trying to as as.data.frame on a GRanges object. On
> regular GRanges objects it works fine but I have some
> objects that contain a DNAStringSet in the values column,
> which isn't built in to the as.data.frame method. Is it
> possible to add the ability to coerce the DNAStringSet too,
> please?
>
> Here's some code that demonstrates the issue:
>
> ################
> library(GenomicRanges)
> library(Biostrings)
>
> gr1<-
> GRanges(seqnames=rep("chr1",3),ranges=IRanges(start=c(1,101,201),width=50),strand=c("+","-","+"),
> genenames=c("seq1","seq2","seq3") )
>
> as.data.frame(gr1)
> # works
>
> gr2<- gr1
> values(gr2)[,"myseqs"]<- DNAStringSet(c ("AACGTG",
> "ACGGTGGTGTT", "GAGGCTG"))
>
> as.data.frame(gr2)
> # Error in as.data.frame.default(y, optional = TRUE, ...) :
> # cannot coerce class 'structure("DNAStringSet", package =
> "Biostrings")' into a data.frame
> ################
>
> and here's sessionInfo() output:
>
> R version 2.13.0 (2011-04-13)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] Biostrings_2.20.1 GenomicRanges_1.4.6 IRanges_1.10.4
>
> ################
>
>
> You might wonder why I'm storing sequences in the GRanges
> values - in my real data they're sequencing reads that have
> mapped back to that region, but I'm still curious to
> maintain the sequence itself (for the moment) because it's
> not always identical to the underlying genomic sequence of
> that region (investigating mapping issues).
>
> (and my desire to use as.data.frame relates to a suggestion
> from Herve to let me workaround some issues with the
> identical function)
>
> thanks,
>
> Janet
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list