[Bioc-sig-seq] trimLRPatterns; matching pattern from pos -1 gives solveUserSEW error

Wed Feb 16 21:22:46 CET 2011

This will be a whole lot of detail in lieu of a final fix (proposal  
below).

I don't believe the authors, certainly not I, envisioned your  
situation, which I think I can
abstract correctly here:

	library(Biostrings)
	trimLRPatterns(Rpattern="CA", subject="A", max.Rmismatch=1)	# same  
as max.Rmismatch=c(-1,1)

To have loaded ShortRead was irrelevant and perhaps confusing.

Now, what should happen with the call above, from your point of view?

The first test is for the whole Rpattern "ending at the right end of  
the subject", thus extending
off the left end of the subject by 1 letter as you say.  For a match,  
we need to allow for at least
1 error, and that is sufficient because the rest of the Rpattern is  
the same as the subject.  Note
that this 1 goes in the *last* element of the max.Rmismatch vector.   
So, this test should succeed,
the entire subject should get trimmed, and your result should be "".

To gain insight, you have

	showMethods("trimLRPatterns", includeDefs=TRUE)

That might lead you to make your trimLRPatterns call under

	debug(Biostrings:::.XStringSet.trimLRPatterns)

The abort

   Error in solveUserSEW(width(x), start = start, end = end, width =  
width) :
   solving row 1: 'allow.nonnarrowing' is FALSE and the supplied  
start (0) is < 1

clearly occurs in the call

	narrow(subject, start=0, end=-1)

at the end of .XStringSet.trimLRPatterns, but who is at fault?

Well, Biostrings:::.computeTrimEnd() returned the -1, violating this  
comment:

	### 'subject' must be an XStringSet object of length != 0.
	### Returns an integer vector of the same length as 'subject' where  
the i-th
	### value is guaranteed to be >= 0 and <= width(subject)[i].

I think -1 is correct, and the comment is false [see my claim about  
the authors].

Then, just before the call to narrow(), we have this  
in .XStringSet.trimLRPatterns:

     ## For those invalid ranges where 'start > end + 1L', we  
arbitrarily
     ## decide to set the 'start' to 'end + 1' (another reasonable  
choice
     ## would have been to set the 'end' to 'start - 1').

With end=-1, start=1 (from .computeTrimStart, correctly) gets changed  
to end+1 = 0.

In this situation, the other choice would have fixed your problem.  I  
don't think it
would create any other problems.

-Harris

On Feb 16, 2011, at 8:12 AM, Ludo Pagie wrote:

> Hi all,
>
> I'm trimming reads using an Rpattern which in some cases extends on  
> the left side. If the pattern starts at exactly position -1  
> trimLRPattern throws an error. If the pattern starts further  
> 'upstream' it returns a zero length sequence, as expected:
>
> library(ShortRead)
> # create a pattern
> fragment <- paste(sample(c('A','C','G','T'), 
> 10,replace=TRUE),collapse='')
> # create some reads based on the pattern; for different reads the
> # pattern extends either on the left, the right, or both sides
> reads <- substring(fragment,1:4,7:10)
>
> # trim all reads; all reads should match the pattern fully and be
> # trimmed from start to end
> trimLRPatterns(Rpattern=fragment, subject=DNAStringSet(reads),  
> max.Rmismatch=0.5, ranges=TRUE, Rfixed=FALSE)
>
> # IRanges of length 4
> #     start end width
> # [1]     1   0     0
> # [2]     0  -1     0
> # [3]    -1  -2     0
> # [4]    -2  -3     0
>
> # if I want the reads to be trimmed right away an error is thrown for
> # the second read
> trimLRPatterns(Rpattern=fragment, subject=DNAStringSet(reads[c 
> (1,3,4)]), max.Rmismatch=0.5, ranges=FALSE)
> #   A DNAStringSet instance of length 3
> #     width seq
> # [1]     0
> # [2]     0
> # [3]     0
>
> trimLRPatterns(Rpattern=fragment, subject=DNAStringSet(reads[2]),  
> max.Rmismatch=0.5, ranges=FALSE)
>
> # Error in solveUserSEW(width(x), start = start, end = end, width =
> # width) :
> #   solving row 1: 'allow.nonnarrowing' is FALSE and the supplied  
> start
> # (0) is < 1
>
>
> > sessionInfo()
> R version 2.12.0 (2010-10-15)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.8.2     Rsamtools_1.2.1     lattice_0.19-13
> [4] Biostrings_2.18.2   GenomicRanges_1.2.2 IRanges_1.8.7
> [7] multtest_2.6.0      Biobase_2.10.0
>
> loaded via a namespace (and not attached):
> [1] grid_2.12.0     hwriter_1.3     MASS_7.3-9      splines_2.12.0
> [5] survival_2.36-2 tools_2.12.0
>
>
>
> Where is this error coming from?
>
> Ludo
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing