[BioC] issue with Views on PairwiseAlignmentsSingleSubject?

Sat Oct 6 01:43:41 CEST 2012

Hi Janet,

Yes, this is clearly a bug. Thanks for reporting it. Just to make the
problem even more apparent:

   subject1 <- paste(rep("N", 60), collapse="")
   subject2 <- 
"GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC"
   subject <- DNAStringSet(c(subject1, subject2))

   pattern <- DNAStringSet(c("ATATAAAAATTAATT", "CTTCCATTTCG"))

   myAln <- pairwiseAlignment(pattern, subject[2])

Then:

   > myAln
   Global PairwiseAlignmentsSingleSubject (1 of 2)
   pattern:  [1] ATATAAAAAT-TAATT
   subject: [39] ATATAAAAATATAATT
   score: -180.2737

   > Views(myAln)
     Views on a 120-letter DNAString subject
   subject: 
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAATATATAAATATATAAAAATATAATTTTCATC
   views:
       start end width
   [1]    39  54    16 [NNNNNNNNNNNNNNNN]
   [2]     1  19    19 [NNNNNNNNNNNNNNNNNNN]

A fix is coming (in Biostrings 2.26.1 and 2.27.2). With this fix:

   > Views(myAln)
     Views on a 60-letter DNAString subject
   subject: GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC
   views:
       start end width
   [1]    39  54    16 [ATATAAAAATATAATT]
   [2]     1  19    19 [GTTTCACTACTTCCTTTCG]

And looking closely at 'myAln' with writePairwiseAlignments() will
confirm those views:

 > writePairwiseAlignments(myAln)
########################################
# Program: Biostrings (version 2.26.0), a Bioconductor package
# Rundate: Fri Oct  5 16:15:47 2012
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 14.0
# Extend_penalty: 4.0
#
# Length: 60
# Identity:      15/60 (25.0%)
# Similarity:    NA/60 (NA%)
# Gaps:          45/60 (75.0%)
# Score: -180.2737
#
#
#=======================================

P1                 1 --------------------------------------ATATAAAAAT-T 
     11
                                                            |||||||||| |
S1                 1 GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATAT 
     50

P1                12 AATT------     15
                      ||||
S1                51 AATTTTCATC     60

#=======================================
#
# Aligned_sequences: 2
# 1: P2
# 2: S1
# Matrix: NA
# Gap_penalty: 14.0
# Extend_penalty: 4.0
#
# Length: 60
# Identity:       9/60 (15.0%)
# Similarity:    NA/60 (NA%)
# Gaps:          49/60 (81.7%)
# Score: -209.9628
#
#
#=======================================

P2                 1 CTTCCA--------TTTCG------------------------------- 
     11
                       || ||        |||||
S1                 1 GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATAT 
     50

P2                12 ----------     11

S1                51 AATTTTCATC     60

#---------------------------------------
#---------------------------------------

More below...

On 10/05/2012 03:31 PM, Janet Young wrote:
> Hi there,
>
> I've been using Biostrings' pairwiseAlignment for various things.  I just saw the Views feature described in  help("PairwiseAlignments-class") and thought I'd give it a try, but it's not behaving as I'd expect in cases where my subject sequence is one sequence selected from a DNAStringSet (rather than starting with a DNAString).
>
> I think I found a bug (?) although I could also be misunderstanding what Views is supposed to do.  I updated to the new Bioc devel and new R today, but had the same issue before updating.
>
> I think the code chunk below will explain - does it make sense?
>
> thanks very much,
>
> Janet
>
>
>
> library(Biostrings)
>
> ## example seqs (taken from ?pairwiseAlignment)
> s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
> s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")
>
> ### make some sequence sets
> s1a <- DNAStringSet(c(as.character(s1),as.character(s1)))
> s2a <- DNAStringSet(c(as.character(s2),as.character(s2)))
>
>
> ### align
> myAln <- pairwiseAlignment ( s1a[1], s2a[1])
>
>
> ### I think the next line of code should give me a Views on s1a[1], but instead it gives me Views on all the seqs of s1a concatenated together
> Views(myAln)
>
> ##### and, an aside, it would seem intuitive to me that the code below should work to make a stringset of two sequences, but instead it concetenates them to one sequence
> s1b <- DNAStringSet(c(s1,s1))

Combining (with c()) 2 vectors of n1 and n2 elements, respectively,
is expected to return a vector of n1 + n2 elements (where the
elements of the 2nd vector have been placed after the elements of
the 1st vector), and of the same class as the original vectors.

In a DNAString object, the elements are the letters and the length
of the object is the number of letters:

   > x <- DNAString("AAAAACTTA")
   > x
     9-letter "DNAString" instance
   seq: AAAAACTTA
   > length(x)
   [1] 9
   > x[6:7]
     2-letter "DNAString" instance
   seq: CT

So doing c() on DNAString objects does:

   > c(x, DNAString("NN"))
     11-letter "DNAString" instance
   seq: AAAAACTTANN

In a DNAStringSet object, the elements are the sequences (represented
as DNAString objects) and the length of the object is the number of
sequences:

   > dna <- DNAStringSet("AAAAACTTA")
   > dna
     A DNAStringSet instance of length 1
       width seq
   [1]     9 AAAAACTTA
   > length(dna)
   [1] 1

So doing c() on DNAStringSet objects does:

   > dna2 <- c(dna, dna)
   > dna2
     A DNAStringSet instance of length 2
       width seq
   [1]     9 AAAAACTTA
   [2]     9 AAAAACTTA
   > length(dna2)
   [1] 2
   > dna2[[1]]
     9-letter "DNAString" instance
   seq: AAAAACTTA
   > dna2[[2]]
     9-letter "DNAString" instance
   seq: AAAAACTTA
   > dna2[[3]]
   Error in checkAndTranslateDbleBracketSubscript(x, i) :
     subscript out of bounds

So if you had made 's1' a DNAStringSet instead of a DNAString, you
would have gotten what you wanted:

   s1 <- 
DNAStringSet("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
   s1b <- c(s1,s1)

For those reasons a DNAStringSet object should be seen as the analog
of an ordinary character vector. But not DNAString. There is actually
no good analogy for DNAString in R ordinary objects. The closest to
it is maybe a raw vector: a DNAString object can also be seen as a
vector of bytes.

HTH,
H.

>
> #########
> sessionInfo()
>
> R Under development (unstable) (2012-10-03 r60868)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.27.1  IRanges_1.17.0     BiocGenerics_0.5.0
>
> loaded via a namespace (and not attached):
> [1] parallel_2.16.0 stats4_2.16.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319