[Bioc-sig-seq] trimLRPatterns; matching pattern from pos -1 gives solveUserSEW error
Harris A. Jaffee
hj at jhu.edu
Wed Feb 16 21:22:46 CET 2011
This will be a whole lot of detail in lieu of a final fix (proposal
below).
I don't believe the authors, certainly not I, envisioned your
situation, which I think I can
abstract correctly here:
library(Biostrings)
trimLRPatterns(Rpattern="CA", subject="A", max.Rmismatch=1) # same
as max.Rmismatch=c(-1,1)
To have loaded ShortRead was irrelevant and perhaps confusing.
Now, what should happen with the call above, from your point of view?
The first test is for the whole Rpattern "ending at the right end of
the subject", thus extending
off the left end of the subject by 1 letter as you say. For a match,
we need to allow for at least
1 error, and that is sufficient because the rest of the Rpattern is
the same as the subject. Note
that this 1 goes in the *last* element of the max.Rmismatch vector.
So, this test should succeed,
the entire subject should get trimmed, and your result should be "".
To gain insight, you have
showMethods("trimLRPatterns", includeDefs=TRUE)
That might lead you to make your trimLRPatterns call under
debug(Biostrings:::.XStringSet.trimLRPatterns)
The abort
Error in solveUserSEW(width(x), start = start, end = end, width =
width) :
solving row 1: 'allow.nonnarrowing' is FALSE and the supplied
start (0) is < 1
clearly occurs in the call
narrow(subject, start=0, end=-1)
at the end of .XStringSet.trimLRPatterns, but who is at fault?
Well, Biostrings:::.computeTrimEnd() returned the -1, violating this
comment:
### 'subject' must be an XStringSet object of length != 0.
### Returns an integer vector of the same length as 'subject' where
the i-th
### value is guaranteed to be >= 0 and <= width(subject)[i].
I think -1 is correct, and the comment is false [see my claim about
the authors].
Then, just before the call to narrow(), we have this
in .XStringSet.trimLRPatterns:
## For those invalid ranges where 'start > end + 1L', we
arbitrarily
## decide to set the 'start' to 'end + 1' (another reasonable
choice
## would have been to set the 'end' to 'start - 1').
With end=-1, start=1 (from .computeTrimStart, correctly) gets changed
to end+1 = 0.
In this situation, the other choice would have fixed your problem. I
don't think it
would create any other problems.
-Harris
On Feb 16, 2011, at 8:12 AM, Ludo Pagie wrote:
> Hi all,
>
> I'm trimming reads using an Rpattern which in some cases extends on
> the left side. If the pattern starts at exactly position -1
> trimLRPattern throws an error. If the pattern starts further
> 'upstream' it returns a zero length sequence, as expected:
>
> library(ShortRead)
> # create a pattern
> fragment <- paste(sample(c('A','C','G','T'),
> 10,replace=TRUE),collapse='')
> # create some reads based on the pattern; for different reads the
> # pattern extends either on the left, the right, or both sides
> reads <- substring(fragment,1:4,7:10)
>
> # trim all reads; all reads should match the pattern fully and be
> # trimmed from start to end
> trimLRPatterns(Rpattern=fragment, subject=DNAStringSet(reads),
> max.Rmismatch=0.5, ranges=TRUE, Rfixed=FALSE)
>
> # IRanges of length 4
> # start end width
> # [1] 1 0 0
> # [2] 0 -1 0
> # [3] -1 -2 0
> # [4] -2 -3 0
>
> # if I want the reads to be trimmed right away an error is thrown for
> # the second read
> trimLRPatterns(Rpattern=fragment, subject=DNAStringSet(reads[c
> (1,3,4)]), max.Rmismatch=0.5, ranges=FALSE)
> # A DNAStringSet instance of length 3
> # width seq
> # [1] 0
> # [2] 0
> # [3] 0
>
> trimLRPatterns(Rpattern=fragment, subject=DNAStringSet(reads[2]),
> max.Rmismatch=0.5, ranges=FALSE)
>
> # Error in solveUserSEW(width(x), start = start, end = end, width =
> # width) :
> # solving row 1: 'allow.nonnarrowing' is FALSE and the supplied
> start
> # (0) is < 1
>
>
> > sessionInfo()
> R version 2.12.0 (2010-10-15)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ShortRead_1.8.2 Rsamtools_1.2.1 lattice_0.19-13
> [4] Biostrings_2.18.2 GenomicRanges_1.2.2 IRanges_1.8.7
> [7] multtest_2.6.0 Biobase_2.10.0
>
> loaded via a namespace (and not attached):
> [1] grid_2.12.0 hwriter_1.3 MASS_7.3-9 splines_2.12.0
> [5] survival_2.36-2 tools_2.12.0
>
>
>
> Where is this error coming from?
>
> Ludo
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list