[Bioc-sig-seq] "identical" on PhredQuality objects - I'm confused
Janet Young
jayoung at fhcrc.org
Wed Jun 15 21:23:45 CEST 2011
Hi,
Marcus - thanks for the self-contained example. I probably should have included something like that too!
and Herve - thank you VERY much - that's helpful. I'm beginning to understand this, but I have a little way to go still - I'm amazed at the subtleties there are this stuff.
I'm glad you explained the false negatives too: just this morning I saw one of those for the first time, and I would have spent a long time worrying about it if you hadn't explained...
I'll go with your ugly workaround (nothing arong with ugly for a biologist like me). That has brought up another question (as.data.frame on GRanges), but I'll send a separate email about that, as it's a separate issue.
Janet
On Jun 15, 2011, at 10:45 AM, Hervé Pagès wrote:
> Hi Janet,
>
> This is because identical() is not the appropriate tool when comparing
> objects that contain external pointers. It will sometimes report that
> different objects are identical, and sometimes report that identical
> objects are different.
>
> With different objects:
>
> > library(Biostrings)
> > x <- DNAString("AC")
> > y <- DNAString("GG")
> > identical(x, y)
> [1] TRUE
>
> With "identical" objects:
>
> > z <- subseq(DNAString("TGG"), start=2)
> > z
> 2-letter "DNAString" instance
> seq: GG
> > identical(y, z)
> [1] FALSE
>
> The 2nd one (false negative) is a fair one: even if 'y' and 'z'
> represent the same object (conceptually), they actually have different
> internal representations:
>
> > y at shared
> SharedRaw of length 2 (data starting at address 0x4833bf0)
> > y at offset
> [1] 0
> > z at shared
> SharedRaw of length 3 (data starting at address 0x4780350)
> > z at offset
> [1] 1
>
> Since identical() performs low-level comparison, it will report that
> the objects are different.
>
> For the 1st one (false positive), you should blame identical() for
> reporting that 2 external pointers are *always* identical:
>
> > xp1 <- .Call("externalptr_new", PACKAGE="IRanges")
> > xp2 <- .Call("externalptr_new", PACKAGE="IRanges")
> > IRanges:::address(xp1)
> [1] "0x2e20178"
> > IRanges:::address(xp2)
> [1] "0x2e20530"
> > identical(xp1, xp2)
> [1] TRUE
>
> Actually, the addresses shouldn't matter. What should matter is the
> data that the 2 external pointers are pointing at. But identical()
> doesn't try to "follow" the pointers in order to get to the data
> they are pointing at. It just reports that 2 external pointers are
> identical! Like here, where the 2 external pointers are pointing to
> data that differ in *value*:
>
> > identical(x at shared@xp, y at shared@xp)
> [1] TRUE
>
> One could try to make a case on the R-devel mailing list to change
> the behavior of identical() on external pointers, but that wouldn't
> solve the 2nd issue (false positives). This one could be addressed
> by having a new generic e.g. equal() with methods for Vector objects
> and their derivations that would know how to reliably compare 2
> objects. Most of the time, it would just do the same as identical(),
> but for objects that contain external pointers (like DNAString,
> DNAStringSet, PhredQuality, etc...), it would use implement a
> specialized logic.
>
> Any suggestion on this would be appreciated.
>
> In the meantime, the ugly workaround that I often use is to apply
> as.character on the objects I want to compare:
>
> > identical(as.character(x), as.character(y))
> [1] FALSE
> > identical(as.character(y), as.character(z))
> [1] TRUE
>
> Cheers,
> H.
>
> On 11-06-14 06:58 PM, Janet Young wrote:
>> Hi there,
>>
>> I'm investigating a mistake I made when running BWA (a little embarrassing, but it happens). I had initially failed to use the -I flag on BWA to tell it to convert our Illumina ASCII-64 qualities appropriately. Now I'm taking a look to see how much differnce it makes in our downstream analysis.
>>
>> Anyway, that's not why I'm writing - while delving into this I've found something unexpected about the "identical" function, and I'd love to understand that better if possible.
>>
>> I used scanBam to read in my two BWA output files - with and without the "-I" option when I ran BWA (same input seqs, everything else the same). As I expected, with the -I option BWA converted the qualities (my lane1read2_bwaWithI_bamscan object), and in the other it didn't (my lane1read2_bwaNotI_bamscan object). That makes sense to me - here's how the quals look after reading BWA output in with scanBam.
>>
>> ### when I use the -I option with BWA:
>>> head(lane1read2_bwaWithI_bamscan[[1]][["qual"]])
>> A PhredQuality instance of length 6
>> width seq
>> [1] 50 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHFHFF
>> [2] 36 GFFEEHHFHHHHHHHEFEEEHHHHHFHHHEEFEE at E
>> [3] 50 ##################################################
>> [4] 31 HHHHHHHHHHHHHHHHHHHHHHEHHHHFHFH
>> [5] 24 HHHHHHHHHHHHHHDFHHHHHHHH
>> [6] 50 FFFFCD=8BGEBHGHHHHEHGFGHHDHEHFHEHHHHGHHHHHHHHHGHHH
>>
>> ### when I don't use the -I option with BWA:
>>> head(lane1read2_bwaNotI_bamscan[[1]][["qual"]])
>> A PhredQuality instance of length 6
>> width seq
>> [1] 50 gggggggggggggggggggggggggggggggggggggggggeggggegee
>> [2] 36 feeddggegggggggdedddgggggegggddedd_d
>> [3] 50 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>> [4] 31 ggggggggggggggggggggggdggggegeg
>> [5] 24 ggggggggggggggcegggggggg
>> [6] 50 eeeebc\Wafdagfggggdgfefggcgdgegdggggfgggggggggfggg
>>
>>
>> The part I don't understand is when I use identical to test whether those two qual objects are the same, it says they are identical:
>>
>>> identical( lane1read2_bwaWithI_bamscan[[1]][["qual"]], lane1read2_bwaNotI_bamscan[[1]][["qual"]])
>> [1] TRUE
>>
>> In one sense they're the same (if you convert from Illumina qualities to normal Phred qualities), but how did identical know to do that? I might be misunderstanding identical - I thought that EVERYTHING about the object had to be identical.
>>
>> I want to understand this as it seems like a dangerous misunderstanding - I use identical a fair amount to make sure I'm not screwing things up.
>>
>> Really I should probably have converted the "qual" item in my scamBam-generated list to a SolexaQuality object to keep track of the fact they're offset by 64 and not 33, but that's OK. At this point I just want to take a look at what's changed about the BWA output when I use that -I flag.
>>
>> Any thoughts?
>>
>> thanks,
>>
>> Janet
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list