[Bioc-sig-seq] GRanges, failure assigning chromosome lengths
Ivan Gregoretti
ivangreg at gmail.com
Tue Apr 27 16:33:15 CEST 2010
Patrick,
In my opinion, there is no perfect solution, only compromise solution.
There is a group of java-based programs called Picard. It is a package
designed to work with SAM/BAM tool. It is good and well established.
In my opinion, it is 'the' sam/bam toolbox.
In that package, the problem of DNA circularity is also present. They
way Picard deals with this problem is by providing an option to
override the over-the-bound check. As long as the user knows that some
DNAs are circular, everything is OK.
The BAM header contains the chromosome lengths and yet it does hold
reads that lie beyond the sequence boundaries.
I vote for an overriding option rather than an infrastructure overhaul.
Thank you,
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
On Mon, Apr 26, 2010 at 6:51 PM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
> Ivan,
> The chr4 and chr4_random type issue is easily solved by adding the length of
> chr?_random to chr? for each chromosome. Perhaps we can even add an argument
> to seqlengths,BSgenome-method to perform that collapsing.
>
> The case of circular DNA (e.g. mitochondrial) is a bit trickier since our
> infrastructure assumes linear DNA. To address that case we would first have
> to create a linearized DNA coordinate system and then remap all those
> alignments that fall into the extended region. That is unless people really
> want to work in circular coordinates, which would involve numerous changes
> to the infrastructure.
>
> I am open to any suggestions.
>
>
> Patrick
>
>
>
> On 4/26/10 2:17 PM, Ivan Gregoretti wrote:
>>
>> Hi Patrick.
>>
>> Sure. Here is the output. To the best of my understanding, most
>> aligners assume that, for instance, 'ch4_random' is just a linear
>> continuation of 'chr4'.
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>>>
>>> range(A)
>>>
>>
>> GRanges with 67 ranges and 0 elementMetadata values
>> seqnames ranges strand |
>> <Rle> <IRanges> <Rle> |
>> [1] chr1 [3000311, 197195093] + |
>> [2] chr1 [2999944, 197195305] - |
>> [3] chr10 [3000180, 129992771] + |
>> [4] chr10 [3000092, 129993250] - |
>> [5] chr11 [3000006, 121843794] + |
>> [6] chr11 [3000010, 121843708] - |
>> [7] chr12 [3000095, 121257390] + |
>> [8] chr12 [3000144, 121257410] - |
>> [9] chr13 [3000434, 120284396] + |
>> ... ... ... ... ...
>> [59] chrUn_random [ -21, 5900319] - |
>> [60] chrX [3000042, 166627000] + |
>> [61] chrX [2999944, 166650266] - |
>> [62] chrX_random [ 326, 1051594] + |
>> [63] chrX_random [ 229, 1309438] - |
>> [64] chrY [ 150, 2902454] + |
>> [65] chrY [ -2, 2902181] - |
>> [66] chrY_random [ 7539, 58497416] + |
>> [67] chrY_random [ 1247, 58489658] - |
>>
>> seqlengths
>> chr1 chr10 chr11 ... chrY chrY_random
>> NA NA NA ... NA NA
>>
>>
>>
>>>
>>> range(A[seqnames(A)=='chrM'])
>>>
>>
>> GRanges with 2 ranges and 0 elementMetadata values
>> seqnames ranges strand |
>> <Rle> <IRanges> <Rle> |
>> [1] chrM [ 4, 16387] + |
>> [2] chrM [-82, 16298] - |
>>
>>
>>
>>>
>>> range(B)
>>>
>>
>> GRanges with 65 ranges and 0 elementMetadata values
>> seqnames ranges strand |
>> <Rle> <IRanges> <Rle> |
>> [1] chr1 [3000506, 197194916] + |
>> [2] chr1 [3000977, 197194382] - |
>> [3] chr10 [3000626, 129989507] + |
>> [4] chr10 [3000643, 129988346] - |
>> [5] chr11 [3000057, 121843783] + |
>> [6] chr11 [3000010, 121843699] - |
>> [7] chr12 [3001370, 121256864] + |
>> [8] chr12 [3000135, 121257132] - |
>> [9] chr13 [3000103, 120284379] + |
>> ... ... ... ... ...
>> [57] chrUn_random [ 125, 5900272] - |
>> [58] chrX [3001345, 166614519] + |
>> [59] chrX [3000031, 166628394] - |
>> [60] chrX_random [ 1177, 1051588] + |
>> [61] chrX_random [ 254, 1051506] - |
>> [62] chrY [ 141, 2902458] + |
>> [63] chrY [ 744, 2901128] - |
>> [64] chrY_random [ 39555, 58472198] + |
>> [65] chrY_random [ 71505, 58487987] - |
>>
>> seqlengths
>> chr1 chr10 chr11 ... chrY chrY_random
>> NA NA NA ... NA NA
>>
>>
>>
>>>
>>> range(B[seqnames(B)=='chrM'])
>>>
>>
>> GRanges with 2 ranges and 0 elementMetadata values
>> seqnames ranges strand |
>> <Rle> <IRanges> <Rle> |
>> [1] chrM [ 5, 16385] + |
>> [2] chrM [-77, 16299] - |
>>
>>
>>
>>>
>>> range(C)
>>>
>>
>> GRanges with 67 ranges and 0 elementMetadata values
>> seqnames ranges strand |
>> <Rle> <IRanges> <Rle> |
>> [1] chr1 [3000125, 197195029] + |
>> [2] chr1 [2999981, 197194997] - |
>> [3] chr10 [3000098, 129993293] + |
>> [4] chr10 [3000105, 129993219] - |
>> [5] chr11 [3000007, 121843789] + |
>> [6] chr11 [3000010, 121843707] - |
>> [7] chr12 [3000113, 121257571] + |
>> [8] chr12 [3000011, 121257521] - |
>> [9] chr13 [3000163, 120284384] + |
>> ... ... ... ... ...
>> [59] chrUn_random [ -75, 5900330] - |
>> [60] chrX [3000056, 166650160] + |
>> [61] chrX [2999951, 166650076] - |
>> [62] chrX_random [ 327, 1200747] + |
>> [63] chrX_random [ 236, 1508524] - |
>> [64] chrY [ 38, 2902563] + |
>> [65] chrY [ -80, 2902221] - |
>> [66] chrY_random [ 5699, 58492209] + |
>> [67] chrY_random [ 7154, 58486839] - |
>>
>> seqlengths
>> chr1 chr10 chr11 ... chrY chrY_random
>> NA NA NA ... NA NA
>>
>>
>>
>>>
>>> range(C[seqnames(C)=='chrM'])
>>>
>>
>> GRanges with 2 ranges and 0 elementMetadata values
>> seqnames ranges strand |
>> <Rle> <IRanges> <Rle> |
>> [1] chrM [ 1, 16387] + |
>> [2] chrM [-81, 16332] - |
>>
>>
>> library(BSgenome.Mmusculus.UCSC.mm9)
>>
>>>
>>> length(Mmusculus$chrM)
>>>
>>
>> [1] 16299
>>
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>>
>>
>>
>> On Mon, Apr 26, 2010 at 4:35 PM, Patrick Aboyoun<paboyoun at fhcrc.org>
>> wrote:
>>
>>>
>>> Ivan,
>>> Could you provide me with the results of
>>>
>>> range(Z)
>>>
>>> for all three of the GRanges objects that seqlengths<- throws and error
>>> on?
>>> I would like to see if there is some adjustment we can make to the
>>> seqlengths<- function that will resolve your issue.
>>>
>>> Thanks,
>>> Patrick
>>>
>>>
>>> On 4/26/10 12:57 PM, Ivan Gregoretti wrote:
>>>
>>>>
>>>> Hi Patrick,
>>>>
>>>> You are correct. The validity check is rejecting the input. The
>>>> problem with that is that I have three set, all of them failing perhap
>>>> because one or two tags are out of bounds.
>>>>
>>>> This tags are coming straight from the Illumina sequencer. There is
>>>> no manipulation. Perhaps it is complaining because the validity check
>>>> does not know that Mitochondrial DNA is circular.
>>>>
>>>> Bottom line, anybody attempting to load the data as GRanges from a
>>>> high coverage tag set will find this seqlengths() rejection.
>>>>
>>>> Is there any way to override it or should I just refrain from
>>>> assigning chromosome lengths? The online vignette does not mention if
>>>> there is a switch to the validity check.
>>>>
>>>> Thank you,
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>> Ivan Gregoretti, PhD
>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>> National Institutes of Health
>>>> 5 Memorial Dr, Building 5, Room 205.
>>>> Bethesda, MD 20892. USA.
>>>> Phone: 1-301-496-1592
>>>> Fax: 1-301-496-9878
>>>>
>>>>
>>>>
>>>> On Mon, Apr 26, 2010 at 3:42 PM, Patrick Aboyoun<paboyoun at fhcrc.org>
>>>> wrote:
>>>>
>>>>
>>>>>
>>>>> Ivan,
>>>>> As you probably already realized, the first error you encountered was
>>>>> do
>>>>> to
>>>>> a misuse of the seqlengths function since function objects (e.g.
>>>>> BSgenome)
>>>>> have no sequence lengths.
>>>>>
>>>>> # Here is the problem
>>>>> seqlengths(Z)<- seqlengths(BSgenome)[names(seqlengths(Z))]
>>>>> Error in function (classes, fdef, mtable) :
>>>>> unable to find an inherited method for function "seqlengths", for
>>>>> signature "function"
>>>>>
>>>>> The second error is the result of a failed validity check on the
>>>>> modified
>>>>> object. All ranges stored in a GRanges object must be between 1 and
>>>>> seqlengths(object) if the seqlengths information is non-NAs.
>>>>>
>>>>> # Here is a second attempt that also fails
>>>>> seqlengths(Z)<- seqlengths(Mmusculus)[names(seqlengths(Z))]
>>>>> Error in validObject(.Object) :
>>>>> invalid class "GRanges" object: slot 'ranges' contains values
>>>>> outside of sequence bounds
>>>>>
>>>>> Compare the results of range(Z), which returns the min start and max
>>>>> end
>>>>> for
>>>>> each of the seqnames in Z, and compare it with
>>>>> seqlengths(Mmusculus)[names(seqlengths(Z))]. This should provide you
>>>>> with
>>>>> some insight as to which ranges are out of bounds. Perhaps your
>>>>> intervals
>>>>> are 0-based instead of 1-based?
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Patrick
>>>>>
>>>>>
>>>>> On 4/26/10 11:08 AM, Ivan Gregoretti wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Hello listers,
>>>>>>
>>>>>> Is anybody having trouble assigning seqlengths() to a GRanges instance
>>>>>> with the new GenomicRanges version?
>>>>>>
>>>>>>
>>>>>> This morning I upgraded my GenomicRanges from 0.0.9 to 0.1.17 and
>>>>>> since then I am unable to assign chromosome lengths to any of my tag
>>>>>> sets from my Illumina 36 nucleotide sequences.
>>>>>>
>>>>>> On Friday this worked. Let me show you how it complains now:
>>>>>>
>>>>>> Z<- import('millionsoftags.bed.gz', 'bed')
>>>>>>
>>>>>> Z<- as(Z, 'GRanges')
>>>>>>
>>>>>> class(Z)
>>>>>> [1] "GRanges"
>>>>>> attr(,"package")
>>>>>> [1] "GenomicRanges"
>>>>>>
>>>>>> Z
>>>>>> GRanges with 23293177 ranges and 2 elementMetadata values
>>>>>> seqnames ranges strand |
>>>>>> <Rle> <IRanges> <Rle> |
>>>>>> [1] chr1 [3000506, 3000541] + |
>>>>>> [2] chr1 [3001061, 3001096] - |
>>>>>> [3] chr1 [3001075, 3001110] - |
>>>>>> [4] chr1 [3001098, 3001133] + |
>>>>>> [5] chr1 [3001310, 3001345] + |
>>>>>> [6] chr1 [3001559, 3001594] + |
>>>>>> [7] chr1 [3001603, 3001638] + |
>>>>>> [8] chr1 [3001603, 3001638] + |
>>>>>> [9] chr1 [3001609, 3001644] - |
>>>>>> ... ... ... ... ...
>>>>>> [23293169] chrY_random [58402685, 58402720] + |
>>>>>> [23293170] chrY_random [58403358, 58403393] + |
>>>>>> [23293171] chrY_random [58406154, 58406189] + |
>>>>>> [23293172] chrY_random [58411077, 58411112] - |
>>>>>> [23293173] chrY_random [58430677, 58430712] + |
>>>>>> [23293174] chrY_random [58435117, 58435152] - |
>>>>>> [23293175] chrY_random [58472079, 58472114] + |
>>>>>> [23293176] chrY_random [58483725, 58483760] - |
>>>>>> [23293177] chrY_random [58487952, 58487987] - |
>>>>>> name score
>>>>>> <character> <numeric>
>>>>>> [1] HWI-EAS179_1:7:39:506:1302 96
>>>>>> [2] HWI-EAS179_1:2:69:562:1539 119
>>>>>> [3] HWI-EAS179_1:8:28:1327:394 119
>>>>>> [4] HWI-EAS179_1:7:96:619:454 119
>>>>>> [5] HWI-EAS179_49:3:4:1219:1729 119
>>>>>> [6] HWI-EAS179_49:3:88:949:558 118
>>>>>> [7] HWI-EAS179_1:7:60:1151:1790 119
>>>>>> [8] HWI-EAS179_1:7:61:1586:147 114
>>>>>> [9] HWI-EAS179_1:7:55:813:365 106
>>>>>> ... ... ...
>>>>>> [23293169] HWI-EAS179_1:7:49:1416:1573 17
>>>>>> [23293170] HWI-EAS179_1:8:25:405:1723 59
>>>>>> [23293171] HWI-EAS179_1:7:75:1366:1224 25
>>>>>> [23293172] HWI-EAS179_1:2:5:1338:80 5
>>>>>> [23293173] HWI-EAS179_49:3:13:151:166 83
>>>>>> [23293174] HWI-EAS179_49:3:29:1091:472 6
>>>>>> [23293175] HWI-EAS179_1:2:69:1424:733 17
>>>>>> [23293176] HWI-EAS179_1:7:16:945:1051 25
>>>>>> [23293177] HWI-EAS179_1:7:74:1117:1801 14
>>>>>>
>>>>>> seqlengths
>>>>>> chr1 chr10 chr11 ... chrY chrY_random
>>>>>> NA NA NA ... NA NA
>>>>>>
>>>>>> # Here is the problem
>>>>>> seqlengths(Z)<- seqlengths(BSgenome)[names(seqlengths(Z))]
>>>>>> Error in function (classes, fdef, mtable) :
>>>>>> unable to find an inherited method for function "seqlengths", for
>>>>>> signature "function"
>>>>>>
>>>>>> # Here is a second attempt that also fails
>>>>>> seqlengths(Z)<- seqlengths(Mmusculus)[names(seqlengths(Z))]
>>>>>> Error in validObject(.Object) :
>>>>>> invalid class "GRanges" object: slot 'ranges' contains values
>>>>>> outside of sequence bounds
>>>>>>
>>>>>>
>>>>>> As you can see, I haven't had the chance to mess the data.
>>>>>> Any idea how to circumvent this problem?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>>
>>>>>> sessionInfo()
>>>>>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
>>>>>> x86_64-unknown-linux-gnu
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.16 BSgenome_1.15.21
>>>>>> [3] Biostrings_2.15.27 GenomicRanges_0.1.17
>>>>>> [5] IRanges_1.5.79 rtracklayer_1.7.12
>>>>>> [7] RCurl_1.4-1 bitops_1.0-4.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] Biobase_2.7.6 tools_2.12.0 XML_2.8-1
>>>>>>
>>>>>>
>>>>>> Ivan Gregoretti, PhD
>>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>>> National Institutes of Health
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
More information about the Bioc-sig-sequencing
mailing list