[BioC] issue with Views on PairwiseAlignmentsSingleSubject?
Hervé Pagès
hpages at fhcrc.org
Sat Oct 6 01:43:41 CEST 2012
Hi Janet,
Yes, this is clearly a bug. Thanks for reporting it. Just to make the
problem even more apparent:
subject1 <- paste(rep("N", 60), collapse="")
subject2 <-
"GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC"
subject <- DNAStringSet(c(subject1, subject2))
pattern <- DNAStringSet(c("ATATAAAAATTAATT", "CTTCCATTTCG"))
myAln <- pairwiseAlignment(pattern, subject[2])
Then:
> myAln
Global PairwiseAlignmentsSingleSubject (1 of 2)
pattern: [1] ATATAAAAAT-TAATT
subject: [39] ATATAAAAATATAATT
score: -180.2737
> Views(myAln)
Views on a 120-letter DNAString subject
subject:
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAATATATAAATATATAAAAATATAATTTTCATC
views:
start end width
[1] 39 54 16 [NNNNNNNNNNNNNNNN]
[2] 1 19 19 [NNNNNNNNNNNNNNNNNNN]
A fix is coming (in Biostrings 2.26.1 and 2.27.2). With this fix:
> Views(myAln)
Views on a 60-letter DNAString subject
subject: GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC
views:
start end width
[1] 39 54 16 [ATATAAAAATATAATT]
[2] 1 19 19 [GTTTCACTACTTCCTTTCG]
And looking closely at 'myAln' with writePairwiseAlignments() will
confirm those views:
> writePairwiseAlignments(myAln)
########################################
# Program: Biostrings (version 2.26.0), a Bioconductor package
# Rundate: Fri Oct 5 16:15:47 2012
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 14.0
# Extend_penalty: 4.0
#
# Length: 60
# Identity: 15/60 (25.0%)
# Similarity: NA/60 (NA%)
# Gaps: 45/60 (75.0%)
# Score: -180.2737
#
#
#=======================================
P1 1 --------------------------------------ATATAAAAAT-T
11
|||||||||| |
S1 1 GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATAT
50
P1 12 AATT------ 15
||||
S1 51 AATTTTCATC 60
#=======================================
#
# Aligned_sequences: 2
# 1: P2
# 2: S1
# Matrix: NA
# Gap_penalty: 14.0
# Extend_penalty: 4.0
#
# Length: 60
# Identity: 9/60 (15.0%)
# Similarity: NA/60 (NA%)
# Gaps: 49/60 (81.7%)
# Score: -209.9628
#
#
#=======================================
P2 1 CTTCCA--------TTTCG-------------------------------
11
|| || |||||
S1 1 GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATAT
50
P2 12 ---------- 11
S1 51 AATTTTCATC 60
#---------------------------------------
#---------------------------------------
More below...
On 10/05/2012 03:31 PM, Janet Young wrote:
> Hi there,
>
> I've been using Biostrings' pairwiseAlignment for various things. I just saw the Views feature described in help("PairwiseAlignments-class") and thought I'd give it a try, but it's not behaving as I'd expect in cases where my subject sequence is one sequence selected from a DNAStringSet (rather than starting with a DNAString).
>
> I think I found a bug (?) although I could also be misunderstanding what Views is supposed to do. I updated to the new Bioc devel and new R today, but had the same issue before updating.
>
> I think the code chunk below will explain - does it make sense?
>
> thanks very much,
>
> Janet
>
>
>
> library(Biostrings)
>
> ## example seqs (taken from ?pairwiseAlignment)
> s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
> s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")
>
> ### make some sequence sets
> s1a <- DNAStringSet(c(as.character(s1),as.character(s1)))
> s2a <- DNAStringSet(c(as.character(s2),as.character(s2)))
>
>
> ### align
> myAln <- pairwiseAlignment ( s1a[1], s2a[1])
>
>
> ### I think the next line of code should give me a Views on s1a[1], but instead it gives me Views on all the seqs of s1a concatenated together
> Views(myAln)
>
> ##### and, an aside, it would seem intuitive to me that the code below should work to make a stringset of two sequences, but instead it concetenates them to one sequence
> s1b <- DNAStringSet(c(s1,s1))
Combining (with c()) 2 vectors of n1 and n2 elements, respectively,
is expected to return a vector of n1 + n2 elements (where the
elements of the 2nd vector have been placed after the elements of
the 1st vector), and of the same class as the original vectors.
In a DNAString object, the elements are the letters and the length
of the object is the number of letters:
> x <- DNAString("AAAAACTTA")
> x
9-letter "DNAString" instance
seq: AAAAACTTA
> length(x)
[1] 9
> x[6:7]
2-letter "DNAString" instance
seq: CT
So doing c() on DNAString objects does:
> c(x, DNAString("NN"))
11-letter "DNAString" instance
seq: AAAAACTTANN
In a DNAStringSet object, the elements are the sequences (represented
as DNAString objects) and the length of the object is the number of
sequences:
> dna <- DNAStringSet("AAAAACTTA")
> dna
A DNAStringSet instance of length 1
width seq
[1] 9 AAAAACTTA
> length(dna)
[1] 1
So doing c() on DNAStringSet objects does:
> dna2 <- c(dna, dna)
> dna2
A DNAStringSet instance of length 2
width seq
[1] 9 AAAAACTTA
[2] 9 AAAAACTTA
> length(dna2)
[1] 2
> dna2[[1]]
9-letter "DNAString" instance
seq: AAAAACTTA
> dna2[[2]]
9-letter "DNAString" instance
seq: AAAAACTTA
> dna2[[3]]
Error in checkAndTranslateDbleBracketSubscript(x, i) :
subscript out of bounds
So if you had made 's1' a DNAStringSet instead of a DNAString, you
would have gotten what you wanted:
s1 <-
DNAStringSet("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
s1b <- c(s1,s1)
For those reasons a DNAStringSet object should be seen as the analog
of an ordinary character vector. But not DNAString. There is actually
no good analogy for DNAString in R ordinary objects. The closest to
it is maybe a raw vector: a DNAString object can also be seen as a
vector of bytes.
HTH,
H.
>
> #########
> sessionInfo()
>
> R Under development (unstable) (2012-10-03 r60868)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.27.1 IRanges_1.17.0 BiocGenerics_0.5.0
>
> loaded via a namespace (and not attached):
> [1] parallel_2.16.0 stats4_2.16.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list