[Bioc-sig-seq] GRanges, failure assigning chromosome lengths

Tue Apr 27 16:39:11 CEST 2010

On Tue, Apr 27, 2010 at 10:33 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Patrick,
>
> In my opinion, there is no perfect solution, only compromise solution.
>
> There is a group of java-based programs called Picard. It is a package
> designed to work with SAM/BAM tool. It is good and well established.
> In my opinion, it is 'the' sam/bam toolbox.
>
> In that package, the problem of DNA circularity is also present. They
> way Picard deals with this problem is by providing an option to
> override the over-the-bound check. As long as the user knows that some
> DNAs are circular, everything is OK.
>
> The BAM header contains the chromosome lengths and yet it does hold
> reads that lie beyond the sequence boundaries.

Actually, the edge case is general as alignments, even on linear
chromosomes, may extend beyond the end of the chromosome, I believe.
In the best case, these alignments are clipped (in CIGAR terms), but I
don't know that all aligners are doing that appropriately.

Sean

> I vote for an overriding option rather than an infrastructure overhaul.
>
> Thank you,
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
>
>
>
> On Mon, Apr 26, 2010 at 6:51 PM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>> Ivan,
>> The chr4 and chr4_random type issue is easily solved by adding the length of
>> chr?_random to chr? for each chromosome. Perhaps we can even add an argument
>> to seqlengths,BSgenome-method to perform that collapsing.
>>
>> The case of circular DNA (e.g. mitochondrial) is a bit trickier since our
>> infrastructure assumes linear DNA. To address that case we would first have
>> to create a linearized DNA coordinate system and then remap all those
>> alignments that fall into the extended region. That is unless people really
>> want to work in circular coordinates, which would involve numerous changes
>> to the infrastructure.
>>
>> I am open to any suggestions.
>>
>>
>> Patrick
>>
>>
>>
>> On 4/26/10 2:17 PM, Ivan Gregoretti wrote:
>>>
>>> Hi Patrick.
>>>
>>> Sure. Here is the output. To the best of my understanding, most
>>> aligners assume that, for instance, 'ch4_random' is just a linear
>>> continuation of 'chr4'.
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>>
>>>
>>>>
>>>> range(A)
>>>>
>>>
>>> GRanges with 67 ranges and 0 elementMetadata values
>>>          seqnames               ranges strand   |
>>>             <Rle>             <IRanges>   <Rle>    |
>>>  [1]         chr1 [3000311, 197195093]      +   |
>>>  [2]         chr1 [2999944, 197195305]      -   |
>>>  [3]        chr10 [3000180, 129992771]      +   |
>>>  [4]        chr10 [3000092, 129993250]      -   |
>>>  [5]        chr11 [3000006, 121843794]      +   |
>>>  [6]        chr11 [3000010, 121843708]      -   |
>>>  [7]        chr12 [3000095, 121257390]      +   |
>>>  [8]        chr12 [3000144, 121257410]      -   |
>>>  [9]        chr13 [3000434, 120284396]      +   |
>>>  ...          ...                  ...    ... ...
>>> [59] chrUn_random [    -21,   5900319]      -   |
>>> [60]         chrX [3000042, 166627000]      +   |
>>> [61]         chrX [2999944, 166650266]      -   |
>>> [62]  chrX_random [    326,   1051594]      +   |
>>> [63]  chrX_random [    229,   1309438]      -   |
>>> [64]         chrY [    150,   2902454]      +   |
>>> [65]         chrY [     -2,   2902181]      -   |
>>> [66]  chrY_random [   7539,  58497416]      +   |
>>> [67]  chrY_random [   1247,  58489658]      -   |
>>>
>>> seqlengths
>>>          chr1        chr10        chr11 ...         chrY  chrY_random
>>>            NA           NA           NA ...           NA           NA
>>>
>>>
>>>
>>>>
>>>> range(A[seqnames(A)=='chrM'])
>>>>
>>>
>>> GRanges with 2 ranges and 0 elementMetadata values
>>>     seqnames       ranges strand |
>>>        <Rle>     <IRanges>   <Rle>  |
>>> [1]     chrM [  4, 16387]      + |
>>> [2]     chrM [-82, 16298]      - |
>>>
>>>
>>>
>>>>
>>>> range(B)
>>>>
>>>
>>> GRanges with 65 ranges and 0 elementMetadata values
>>>          seqnames               ranges strand   |
>>>             <Rle>             <IRanges>   <Rle>    |
>>>  [1]         chr1 [3000506, 197194916]      +   |
>>>  [2]         chr1 [3000977, 197194382]      -   |
>>>  [3]        chr10 [3000626, 129989507]      +   |
>>>  [4]        chr10 [3000643, 129988346]      -   |
>>>  [5]        chr11 [3000057, 121843783]      +   |
>>>  [6]        chr11 [3000010, 121843699]      -   |
>>>  [7]        chr12 [3001370, 121256864]      +   |
>>>  [8]        chr12 [3000135, 121257132]      -   |
>>>  [9]        chr13 [3000103, 120284379]      +   |
>>>  ...          ...                  ...    ... ...
>>> [57] chrUn_random [    125,   5900272]      -   |
>>> [58]         chrX [3001345, 166614519]      +   |
>>> [59]         chrX [3000031, 166628394]      -   |
>>> [60]  chrX_random [   1177,   1051588]      +   |
>>> [61]  chrX_random [    254,   1051506]      -   |
>>> [62]         chrY [    141,   2902458]      +   |
>>> [63]         chrY [    744,   2901128]      -   |
>>> [64]  chrY_random [  39555,  58472198]      +   |
>>> [65]  chrY_random [  71505,  58487987]      -   |
>>>
>>> seqlengths
>>>          chr1        chr10        chr11 ...         chrY  chrY_random
>>>            NA           NA           NA ...           NA           NA
>>>
>>>
>>>
>>>>
>>>> range(B[seqnames(B)=='chrM'])
>>>>
>>>
>>> GRanges with 2 ranges and 0 elementMetadata values
>>>     seqnames       ranges strand |
>>>        <Rle>     <IRanges>   <Rle>  |
>>> [1]     chrM [  5, 16385]      + |
>>> [2]     chrM [-77, 16299]      - |
>>>
>>>
>>>
>>>>
>>>> range(C)
>>>>
>>>
>>> GRanges with 67 ranges and 0 elementMetadata values
>>>          seqnames               ranges strand   |
>>>             <Rle>             <IRanges>   <Rle>    |
>>>  [1]         chr1 [3000125, 197195029]      +   |
>>>  [2]         chr1 [2999981, 197194997]      -   |
>>>  [3]        chr10 [3000098, 129993293]      +   |
>>>  [4]        chr10 [3000105, 129993219]      -   |
>>>  [5]        chr11 [3000007, 121843789]      +   |
>>>  [6]        chr11 [3000010, 121843707]      -   |
>>>  [7]        chr12 [3000113, 121257571]      +   |
>>>  [8]        chr12 [3000011, 121257521]      -   |
>>>  [9]        chr13 [3000163, 120284384]      +   |
>>>  ...          ...                  ...    ... ...
>>> [59] chrUn_random [    -75,   5900330]      -   |
>>> [60]         chrX [3000056, 166650160]      +   |
>>> [61]         chrX [2999951, 166650076]      -   |
>>> [62]  chrX_random [    327,   1200747]      +   |
>>> [63]  chrX_random [    236,   1508524]      -   |
>>> [64]         chrY [     38,   2902563]      +   |
>>> [65]         chrY [    -80,   2902221]      -   |
>>> [66]  chrY_random [   5699,  58492209]      +   |
>>> [67]  chrY_random [   7154,  58486839]      -   |
>>>
>>> seqlengths
>>>          chr1        chr10        chr11 ...         chrY  chrY_random
>>>            NA           NA           NA ...           NA           NA
>>>
>>>
>>>
>>>>
>>>> range(C[seqnames(C)=='chrM'])
>>>>
>>>
>>> GRanges with 2 ranges and 0 elementMetadata values
>>>     seqnames       ranges strand |
>>>        <Rle>     <IRanges>   <Rle>  |
>>> [1]     chrM [  1, 16387]      + |
>>> [2]     chrM [-81, 16332]      - |
>>>
>>>
>>> library(BSgenome.Mmusculus.UCSC.mm9)
>>>
>>>>
>>>> length(Mmusculus$chrM)
>>>>
>>>
>>> [1] 16299
>>>
>>>
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>>
>>>
>>>
>>> On Mon, Apr 26, 2010 at 4:35 PM, Patrick Aboyoun<paboyoun at fhcrc.org>
>>>  wrote:
>>>
>>>>
>>>> Ivan,
>>>> Could you provide me with the results of
>>>>
>>>> range(Z)
>>>>
>>>> for all three of the GRanges objects that seqlengths<- throws and error
>>>> on?
>>>> I would like to see if there is some adjustment we can make to the
>>>> seqlengths<- function that will resolve your issue.
>>>>
>>>> Thanks,
>>>> Patrick
>>>>
>>>>
>>>> On 4/26/10 12:57 PM, Ivan Gregoretti wrote:
>>>>
>>>>>
>>>>> Hi Patrick,
>>>>>
>>>>> You are correct. The validity check is rejecting the input. The
>>>>> problem with that is that I have three set, all of them failing perhap
>>>>> because one or two tags are out of bounds.
>>>>>
>>>>> This tags are coming straight from the Illumina sequencer.  There is
>>>>> no manipulation. Perhaps it is complaining because the validity check
>>>>> does not know that Mitochondrial DNA is circular.
>>>>>
>>>>> Bottom line, anybody attempting to load the data as GRanges from a
>>>>> high coverage tag set will find this seqlengths() rejection.
>>>>>
>>>>> Is there any way to override it or should I just refrain from
>>>>> assigning chromosome lengths? The online vignette does not mention if
>>>>> there is a switch to the validity check.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Ivan
>>>>>
>>>>>
>>>>>
>>>>> Ivan Gregoretti, PhD
>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>> National Institutes of Health
>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>> Bethesda, MD 20892. USA.
>>>>> Phone: 1-301-496-1592
>>>>> Fax: 1-301-496-9878
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 26, 2010 at 3:42 PM, Patrick Aboyoun<paboyoun at fhcrc.org>
>>>>>  wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Ivan,
>>>>>> As you probably already realized, the first error you encountered was
>>>>>> do
>>>>>> to
>>>>>> a misuse of the seqlengths function since function objects (e.g.
>>>>>> BSgenome)
>>>>>> have no sequence lengths.
>>>>>>
>>>>>> # Here is the problem
>>>>>> seqlengths(Z)<- seqlengths(BSgenome)[names(seqlengths(Z))]
>>>>>> Error in function (classes, fdef, mtable) :
>>>>>> unable to find an inherited method for function "seqlengths", for
>>>>>> signature "function"
>>>>>>
>>>>>> The second error is the result of a failed validity check on the
>>>>>> modified
>>>>>> object. All ranges stored in a GRanges object must be between 1 and
>>>>>> seqlengths(object) if the seqlengths information is non-NAs.
>>>>>>
>>>>>> # Here is a second attempt that also fails
>>>>>> seqlengths(Z)<- seqlengths(Mmusculus)[names(seqlengths(Z))]
>>>>>> Error in validObject(.Object) :
>>>>>> invalid class "GRanges" object: slot 'ranges' contains values
>>>>>> outside of sequence bounds
>>>>>>
>>>>>> Compare the results of range(Z), which returns the min start and max
>>>>>> end
>>>>>> for
>>>>>> each of the seqnames in Z, and compare it with
>>>>>> seqlengths(Mmusculus)[names(seqlengths(Z))]. This should provide you
>>>>>> with
>>>>>> some insight as to which ranges are out of bounds. Perhaps your
>>>>>> intervals
>>>>>> are 0-based instead of 1-based?
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Patrick
>>>>>>
>>>>>>
>>>>>> On 4/26/10 11:08 AM, Ivan Gregoretti wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hello listers,
>>>>>>>
>>>>>>> Is anybody having trouble assigning seqlengths() to a GRanges instance
>>>>>>> with the new GenomicRanges version?
>>>>>>>
>>>>>>>
>>>>>>> This morning I upgraded my GenomicRanges from 0.0.9 to 0.1.17 and
>>>>>>> since then I am unable to assign chromosome lengths to any of my tag
>>>>>>> sets from my Illumina 36 nucleotide sequences.
>>>>>>>
>>>>>>> On Friday this worked. Let me show you how it complains now:
>>>>>>>
>>>>>>> Z<- import('millionsoftags.bed.gz', 'bed')
>>>>>>>
>>>>>>> Z<- as(Z, 'GRanges')
>>>>>>>
>>>>>>> class(Z)
>>>>>>> [1] "GRanges"
>>>>>>> attr(,"package")
>>>>>>> [1] "GenomicRanges"
>>>>>>>
>>>>>>> Z
>>>>>>> GRanges with 23293177 ranges and 2 elementMetadata values
>>>>>>>               seqnames               ranges strand   |
>>>>>>>                  <Rle>                 <IRanges>       <Rle>        |
>>>>>>>        [1]        chr1   [3000506, 3000541]      +   |
>>>>>>>        [2]        chr1   [3001061, 3001096]      -   |
>>>>>>>        [3]        chr1   [3001075, 3001110]      -   |
>>>>>>>        [4]        chr1   [3001098, 3001133]      +   |
>>>>>>>        [5]        chr1   [3001310, 3001345]      +   |
>>>>>>>        [6]        chr1   [3001559, 3001594]      +   |
>>>>>>>        [7]        chr1   [3001603, 3001638]      +   |
>>>>>>>        [8]        chr1   [3001603, 3001638]      +   |
>>>>>>>        [9]        chr1   [3001609, 3001644]      -   |
>>>>>>>        ...         ...                  ...    ... ...
>>>>>>> [23293169] chrY_random [58402685, 58402720]      +   |
>>>>>>> [23293170] chrY_random [58403358, 58403393]      +   |
>>>>>>> [23293171] chrY_random [58406154, 58406189]      +   |
>>>>>>> [23293172] chrY_random [58411077, 58411112]      -   |
>>>>>>> [23293173] chrY_random [58430677, 58430712]      +   |
>>>>>>> [23293174] chrY_random [58435117, 58435152]      -   |
>>>>>>> [23293175] chrY_random [58472079, 58472114]      +   |
>>>>>>> [23293176] chrY_random [58483725, 58483760]      -   |
>>>>>>> [23293177] chrY_random [58487952, 58487987]      -   |
>>>>>>>                                   name     score
>>>>>>>                            <character>      <numeric>
>>>>>>>        [1]  HWI-EAS179_1:7:39:506:1302        96
>>>>>>>        [2]  HWI-EAS179_1:2:69:562:1539       119
>>>>>>>        [3]  HWI-EAS179_1:8:28:1327:394       119
>>>>>>>        [4]   HWI-EAS179_1:7:96:619:454       119
>>>>>>>        [5] HWI-EAS179_49:3:4:1219:1729       119
>>>>>>>        [6]  HWI-EAS179_49:3:88:949:558       118
>>>>>>>        [7] HWI-EAS179_1:7:60:1151:1790       119
>>>>>>>        [8]  HWI-EAS179_1:7:61:1586:147       114
>>>>>>>        [9]   HWI-EAS179_1:7:55:813:365       106
>>>>>>>        ...                         ...       ...
>>>>>>> [23293169] HWI-EAS179_1:7:49:1416:1573        17
>>>>>>> [23293170]  HWI-EAS179_1:8:25:405:1723        59
>>>>>>> [23293171] HWI-EAS179_1:7:75:1366:1224        25
>>>>>>> [23293172]    HWI-EAS179_1:2:5:1338:80         5
>>>>>>> [23293173]  HWI-EAS179_49:3:13:151:166        83
>>>>>>> [23293174] HWI-EAS179_49:3:29:1091:472         6
>>>>>>> [23293175]  HWI-EAS179_1:2:69:1424:733        17
>>>>>>> [23293176]  HWI-EAS179_1:7:16:945:1051        25
>>>>>>> [23293177] HWI-EAS179_1:7:74:1117:1801        14
>>>>>>>
>>>>>>> seqlengths
>>>>>>>          chr1        chr10        chr11 ...         chrY  chrY_random
>>>>>>>            NA           NA           NA ...           NA           NA
>>>>>>>
>>>>>>> # Here is the problem
>>>>>>> seqlengths(Z)<- seqlengths(BSgenome)[names(seqlengths(Z))]
>>>>>>> Error in function (classes, fdef, mtable)  :
>>>>>>>   unable to find an inherited method for function "seqlengths", for
>>>>>>> signature "function"
>>>>>>>
>>>>>>> # Here is a second attempt that also fails
>>>>>>> seqlengths(Z)<- seqlengths(Mmusculus)[names(seqlengths(Z))]
>>>>>>> Error in validObject(.Object) :
>>>>>>>   invalid class "GRanges" object: slot 'ranges' contains values
>>>>>>> outside of sequence bounds
>>>>>>>
>>>>>>>
>>>>>>> As you can see, I haven't had the chance to mess the data.
>>>>>>> Any idea how to circumvent this problem?
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Ivan
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
>>>>>>> x86_64-unknown-linux-gnu
>>>>>>>
>>>>>>> locale:
>>>>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.16 BSgenome_1.15.21
>>>>>>> [3] Biostrings_2.15.27                 GenomicRanges_0.1.17
>>>>>>> [5] IRanges_1.5.79                     rtracklayer_1.7.12
>>>>>>> [7] RCurl_1.4-1                        bitops_1.0-4.1
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] Biobase_2.7.6 tools_2.12.0  XML_2.8-1
>>>>>>>
>>>>>>>
>>>>>>> Ivan Gregoretti, PhD
>>>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>>>> National Institutes of Health
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>