[Bioc-sig-seq] GRanges, failure assigning chromosome lengths
Sean Davis
seandavi at gmail.com
Tue Apr 27 16:39:11 CEST 2010
On Tue, Apr 27, 2010 at 10:33 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Patrick,
>
> In my opinion, there is no perfect solution, only compromise solution.
>
> There is a group of java-based programs called Picard. It is a package
> designed to work with SAM/BAM tool. It is good and well established.
> In my opinion, it is 'the' sam/bam toolbox.
>
> In that package, the problem of DNA circularity is also present. They
> way Picard deals with this problem is by providing an option to
> override the over-the-bound check. As long as the user knows that some
> DNAs are circular, everything is OK.
>
> The BAM header contains the chromosome lengths and yet it does hold
> reads that lie beyond the sequence boundaries.
Actually, the edge case is general as alignments, even on linear
chromosomes, may extend beyond the end of the chromosome, I believe.
In the best case, these alignments are clipped (in CIGAR terms), but I
don't know that all aligners are doing that appropriately.
Sean
> I vote for an overriding option rather than an infrastructure overhaul.
>
> Thank you,
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
>
>
>
> On Mon, Apr 26, 2010 at 6:51 PM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>> Ivan,
>> The chr4 and chr4_random type issue is easily solved by adding the length of
>> chr?_random to chr? for each chromosome. Perhaps we can even add an argument
>> to seqlengths,BSgenome-method to perform that collapsing.
>>
>> The case of circular DNA (e.g. mitochondrial) is a bit trickier since our
>> infrastructure assumes linear DNA. To address that case we would first have
>> to create a linearized DNA coordinate system and then remap all those
>> alignments that fall into the extended region. That is unless people really
>> want to work in circular coordinates, which would involve numerous changes
>> to the infrastructure.
>>
>> I am open to any suggestions.
>>
>>
>> Patrick
>>
>>
>>
>> On 4/26/10 2:17 PM, Ivan Gregoretti wrote:
>>>
>>> Hi Patrick.
>>>
>>> Sure. Here is the output. To the best of my understanding, most
>>> aligners assume that, for instance, 'ch4_random' is just a linear
>>> continuation of 'chr4'.
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>>
>>>
>>>>
>>>> range(A)
>>>>
>>>
>>> GRanges with 67 ranges and 0 elementMetadata values
>>> seqnames ranges strand |
>>> <Rle> <IRanges> <Rle> |
>>> [1] chr1 [3000311, 197195093] + |
>>> [2] chr1 [2999944, 197195305] - |
>>> [3] chr10 [3000180, 129992771] + |
>>> [4] chr10 [3000092, 129993250] - |
>>> [5] chr11 [3000006, 121843794] + |
>>> [6] chr11 [3000010, 121843708] - |
>>> [7] chr12 [3000095, 121257390] + |
>>> [8] chr12 [3000144, 121257410] - |
>>> [9] chr13 [3000434, 120284396] + |
>>> ... ... ... ... ...
>>> [59] chrUn_random [ -21, 5900319] - |
>>> [60] chrX [3000042, 166627000] + |
>>> [61] chrX [2999944, 166650266] - |
>>> [62] chrX_random [ 326, 1051594] + |
>>> [63] chrX_random [ 229, 1309438] - |
>>> [64] chrY [ 150, 2902454] + |
>>> [65] chrY [ -2, 2902181] - |
>>> [66] chrY_random [ 7539, 58497416] + |
>>> [67] chrY_random [ 1247, 58489658] - |
>>>
>>> seqlengths
>>> chr1 chr10 chr11 ... chrY chrY_random
>>> NA NA NA ... NA NA
>>>
>>>
>>>
>>>>
>>>> range(A[seqnames(A)=='chrM'])
>>>>
>>>
>>> GRanges with 2 ranges and 0 elementMetadata values
>>> seqnames ranges strand |
>>> <Rle> <IRanges> <Rle> |
>>> [1] chrM [ 4, 16387] + |
>>> [2] chrM [-82, 16298] - |
>>>
>>>
>>>
>>>>
>>>> range(B)
>>>>
>>>
>>> GRanges with 65 ranges and 0 elementMetadata values
>>> seqnames ranges strand |
>>> <Rle> <IRanges> <Rle> |
>>> [1] chr1 [3000506, 197194916] + |
>>> [2] chr1 [3000977, 197194382] - |
>>> [3] chr10 [3000626, 129989507] + |
>>> [4] chr10 [3000643, 129988346] - |
>>> [5] chr11 [3000057, 121843783] + |
>>> [6] chr11 [3000010, 121843699] - |
>>> [7] chr12 [3001370, 121256864] + |
>>> [8] chr12 [3000135, 121257132] - |
>>> [9] chr13 [3000103, 120284379] + |
>>> ... ... ... ... ...
>>> [57] chrUn_random [ 125, 5900272] - |
>>> [58] chrX [3001345, 166614519] + |
>>> [59] chrX [3000031, 166628394] - |
>>> [60] chrX_random [ 1177, 1051588] + |
>>> [61] chrX_random [ 254, 1051506] - |
>>> [62] chrY [ 141, 2902458] + |
>>> [63] chrY [ 744, 2901128] - |
>>> [64] chrY_random [ 39555, 58472198] + |
>>> [65] chrY_random [ 71505, 58487987] - |
>>>
>>> seqlengths
>>> chr1 chr10 chr11 ... chrY chrY_random
>>> NA NA NA ... NA NA
>>>
>>>
>>>
>>>>
>>>> range(B[seqnames(B)=='chrM'])
>>>>
>>>
>>> GRanges with 2 ranges and 0 elementMetadata values
>>> seqnames ranges strand |
>>> <Rle> <IRanges> <Rle> |
>>> [1] chrM [ 5, 16385] + |
>>> [2] chrM [-77, 16299] - |
>>>
>>>
>>>
>>>>
>>>> range(C)
>>>>
>>>
>>> GRanges with 67 ranges and 0 elementMetadata values
>>> seqnames ranges strand |
>>> <Rle> <IRanges> <Rle> |
>>> [1] chr1 [3000125, 197195029] + |
>>> [2] chr1 [2999981, 197194997] - |
>>> [3] chr10 [3000098, 129993293] + |
>>> [4] chr10 [3000105, 129993219] - |
>>> [5] chr11 [3000007, 121843789] + |
>>> [6] chr11 [3000010, 121843707] - |
>>> [7] chr12 [3000113, 121257571] + |
>>> [8] chr12 [3000011, 121257521] - |
>>> [9] chr13 [3000163, 120284384] + |
>>> ... ... ... ... ...
>>> [59] chrUn_random [ -75, 5900330] - |
>>> [60] chrX [3000056, 166650160] + |
>>> [61] chrX [2999951, 166650076] - |
>>> [62] chrX_random [ 327, 1200747] + |
>>> [63] chrX_random [ 236, 1508524] - |
>>> [64] chrY [ 38, 2902563] + |
>>> [65] chrY [ -80, 2902221] - |
>>> [66] chrY_random [ 5699, 58492209] + |
>>> [67] chrY_random [ 7154, 58486839] - |
>>>
>>> seqlengths
>>> chr1 chr10 chr11 ... chrY chrY_random
>>> NA NA NA ... NA NA
>>>
>>>
>>>
>>>>
>>>> range(C[seqnames(C)=='chrM'])
>>>>
>>>
>>> GRanges with 2 ranges and 0 elementMetadata values
>>> seqnames ranges strand |
>>> <Rle> <IRanges> <Rle> |
>>> [1] chrM [ 1, 16387] + |
>>> [2] chrM [-81, 16332] - |
>>>
>>>
>>> library(BSgenome.Mmusculus.UCSC.mm9)
>>>
>>>>
>>>> length(Mmusculus$chrM)
>>>>
>>>
>>> [1] 16299
>>>
>>>
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>>
>>>
>>>
>>> On Mon, Apr 26, 2010 at 4:35 PM, Patrick Aboyoun<paboyoun at fhcrc.org>
>>> wrote:
>>>
>>>>
>>>> Ivan,
>>>> Could you provide me with the results of
>>>>
>>>> range(Z)
>>>>
>>>> for all three of the GRanges objects that seqlengths<- throws and error
>>>> on?
>>>> I would like to see if there is some adjustment we can make to the
>>>> seqlengths<- function that will resolve your issue.
>>>>
>>>> Thanks,
>>>> Patrick
>>>>
>>>>
>>>> On 4/26/10 12:57 PM, Ivan Gregoretti wrote:
>>>>
>>>>>
>>>>> Hi Patrick,
>>>>>
>>>>> You are correct. The validity check is rejecting the input. The
>>>>> problem with that is that I have three set, all of them failing perhap
>>>>> because one or two tags are out of bounds.
>>>>>
>>>>> This tags are coming straight from the Illumina sequencer. There is
>>>>> no manipulation. Perhaps it is complaining because the validity check
>>>>> does not know that Mitochondrial DNA is circular.
>>>>>
>>>>> Bottom line, anybody attempting to load the data as GRanges from a
>>>>> high coverage tag set will find this seqlengths() rejection.
>>>>>
>>>>> Is there any way to override it or should I just refrain from
>>>>> assigning chromosome lengths? The online vignette does not mention if
>>>>> there is a switch to the validity check.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Ivan
>>>>>
>>>>>
>>>>>
>>>>> Ivan Gregoretti, PhD
>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>> National Institutes of Health
>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>> Bethesda, MD 20892. USA.
>>>>> Phone: 1-301-496-1592
>>>>> Fax: 1-301-496-9878
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 26, 2010 at 3:42 PM, Patrick Aboyoun<paboyoun at fhcrc.org>
>>>>> wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Ivan,
>>>>>> As you probably already realized, the first error you encountered was
>>>>>> do
>>>>>> to
>>>>>> a misuse of the seqlengths function since function objects (e.g.
>>>>>> BSgenome)
>>>>>> have no sequence lengths.
>>>>>>
>>>>>> # Here is the problem
>>>>>> seqlengths(Z)<- seqlengths(BSgenome)[names(seqlengths(Z))]
>>>>>> Error in function (classes, fdef, mtable) :
>>>>>> unable to find an inherited method for function "seqlengths", for
>>>>>> signature "function"
>>>>>>
>>>>>> The second error is the result of a failed validity check on the
>>>>>> modified
>>>>>> object. All ranges stored in a GRanges object must be between 1 and
>>>>>> seqlengths(object) if the seqlengths information is non-NAs.
>>>>>>
>>>>>> # Here is a second attempt that also fails
>>>>>> seqlengths(Z)<- seqlengths(Mmusculus)[names(seqlengths(Z))]
>>>>>> Error in validObject(.Object) :
>>>>>> invalid class "GRanges" object: slot 'ranges' contains values
>>>>>> outside of sequence bounds
>>>>>>
>>>>>> Compare the results of range(Z), which returns the min start and max
>>>>>> end
>>>>>> for
>>>>>> each of the seqnames in Z, and compare it with
>>>>>> seqlengths(Mmusculus)[names(seqlengths(Z))]. This should provide you
>>>>>> with
>>>>>> some insight as to which ranges are out of bounds. Perhaps your
>>>>>> intervals
>>>>>> are 0-based instead of 1-based?
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Patrick
>>>>>>
>>>>>>
>>>>>> On 4/26/10 11:08 AM, Ivan Gregoretti wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hello listers,
>>>>>>>
>>>>>>> Is anybody having trouble assigning seqlengths() to a GRanges instance
>>>>>>> with the new GenomicRanges version?
>>>>>>>
>>>>>>>
>>>>>>> This morning I upgraded my GenomicRanges from 0.0.9 to 0.1.17 and
>>>>>>> since then I am unable to assign chromosome lengths to any of my tag
>>>>>>> sets from my Illumina 36 nucleotide sequences.
>>>>>>>
>>>>>>> On Friday this worked. Let me show you how it complains now:
>>>>>>>
>>>>>>> Z<- import('millionsoftags.bed.gz', 'bed')
>>>>>>>
>>>>>>> Z<- as(Z, 'GRanges')
>>>>>>>
>>>>>>> class(Z)
>>>>>>> [1] "GRanges"
>>>>>>> attr(,"package")
>>>>>>> [1] "GenomicRanges"
>>>>>>>
>>>>>>> Z
>>>>>>> GRanges with 23293177 ranges and 2 elementMetadata values
>>>>>>> seqnames ranges strand |
>>>>>>> <Rle> <IRanges> <Rle> |
>>>>>>> [1] chr1 [3000506, 3000541] + |
>>>>>>> [2] chr1 [3001061, 3001096] - |
>>>>>>> [3] chr1 [3001075, 3001110] - |
>>>>>>> [4] chr1 [3001098, 3001133] + |
>>>>>>> [5] chr1 [3001310, 3001345] + |
>>>>>>> [6] chr1 [3001559, 3001594] + |
>>>>>>> [7] chr1 [3001603, 3001638] + |
>>>>>>> [8] chr1 [3001603, 3001638] + |
>>>>>>> [9] chr1 [3001609, 3001644] - |
>>>>>>> ... ... ... ... ...
>>>>>>> [23293169] chrY_random [58402685, 58402720] + |
>>>>>>> [23293170] chrY_random [58403358, 58403393] + |
>>>>>>> [23293171] chrY_random [58406154, 58406189] + |
>>>>>>> [23293172] chrY_random [58411077, 58411112] - |
>>>>>>> [23293173] chrY_random [58430677, 58430712] + |
>>>>>>> [23293174] chrY_random [58435117, 58435152] - |
>>>>>>> [23293175] chrY_random [58472079, 58472114] + |
>>>>>>> [23293176] chrY_random [58483725, 58483760] - |
>>>>>>> [23293177] chrY_random [58487952, 58487987] - |
>>>>>>> name score
>>>>>>> <character> <numeric>
>>>>>>> [1] HWI-EAS179_1:7:39:506:1302 96
>>>>>>> [2] HWI-EAS179_1:2:69:562:1539 119
>>>>>>> [3] HWI-EAS179_1:8:28:1327:394 119
>>>>>>> [4] HWI-EAS179_1:7:96:619:454 119
>>>>>>> [5] HWI-EAS179_49:3:4:1219:1729 119
>>>>>>> [6] HWI-EAS179_49:3:88:949:558 118
>>>>>>> [7] HWI-EAS179_1:7:60:1151:1790 119
>>>>>>> [8] HWI-EAS179_1:7:61:1586:147 114
>>>>>>> [9] HWI-EAS179_1:7:55:813:365 106
>>>>>>> ... ... ...
>>>>>>> [23293169] HWI-EAS179_1:7:49:1416:1573 17
>>>>>>> [23293170] HWI-EAS179_1:8:25:405:1723 59
>>>>>>> [23293171] HWI-EAS179_1:7:75:1366:1224 25
>>>>>>> [23293172] HWI-EAS179_1:2:5:1338:80 5
>>>>>>> [23293173] HWI-EAS179_49:3:13:151:166 83
>>>>>>> [23293174] HWI-EAS179_49:3:29:1091:472 6
>>>>>>> [23293175] HWI-EAS179_1:2:69:1424:733 17
>>>>>>> [23293176] HWI-EAS179_1:7:16:945:1051 25
>>>>>>> [23293177] HWI-EAS179_1:7:74:1117:1801 14
>>>>>>>
>>>>>>> seqlengths
>>>>>>> chr1 chr10 chr11 ... chrY chrY_random
>>>>>>> NA NA NA ... NA NA
>>>>>>>
>>>>>>> # Here is the problem
>>>>>>> seqlengths(Z)<- seqlengths(BSgenome)[names(seqlengths(Z))]
>>>>>>> Error in function (classes, fdef, mtable) :
>>>>>>> unable to find an inherited method for function "seqlengths", for
>>>>>>> signature "function"
>>>>>>>
>>>>>>> # Here is a second attempt that also fails
>>>>>>> seqlengths(Z)<- seqlengths(Mmusculus)[names(seqlengths(Z))]
>>>>>>> Error in validObject(.Object) :
>>>>>>> invalid class "GRanges" object: slot 'ranges' contains values
>>>>>>> outside of sequence bounds
>>>>>>>
>>>>>>>
>>>>>>> As you can see, I haven't had the chance to mess the data.
>>>>>>> Any idea how to circumvent this problem?
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Ivan
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
>>>>>>> x86_64-unknown-linux-gnu
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>>>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.16 BSgenome_1.15.21
>>>>>>> [3] Biostrings_2.15.27 GenomicRanges_0.1.17
>>>>>>> [5] IRanges_1.5.79 rtracklayer_1.7.12
>>>>>>> [7] RCurl_1.4-1 bitops_1.0-4.1
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] Biobase_2.7.6 tools_2.12.0 XML_2.8-1
>>>>>>>
>>>>>>>
>>>>>>> Ivan Gregoretti, PhD
>>>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>>>> National Institutes of Health
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
More information about the Bioc-sig-sequencing
mailing list