[Bioc-sig-seq] Biostrings: problem to access indel-details form pairwiseAlignment()
Wolfgang Raffelsberger
wraff at igbmc.fr
Fri Jul 24 16:35:40 CEST 2009
Patrick,
thank for your quick response !
For a while things went well but now I got across a case where I have
difficulty interpreting the number(s) given for start-positions of
deletions. This is when there is a preceding insertion before a deletion
occurs/ gets encountered.
Of course I could correct this, ie shift the values myself according to
the number preceding inserted nucletides, but I posted this since I'm
not sure if you're aware of this special case/problem.
Or maybe you have some other command/way to access (ie extract) the
sequence(s) deleted ?
(Note that the equivalent approach for extracting insertions works
without any problems :))
Wolfgang
> suppressMessages(library(Biostrings))
> mySubject <- ref1 <- DNAString("GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC")
> myPattern <- samp1 <-
DNAStringSet(c("GGGATAGTGACTACAGGATCCAGCTCTTCGCCTGGC","ATAGTGACGTAGGATCACAGCTCTTCGCCGGC","TTGGGAGTGACTTGCAGGATCCAGTCTTCGCCTGGCAT"))
> ## 1st has a mutation; 2nd: {5'del+ mut(g)+ del(C)+ del(T)+ ins(A)}
; 3rd: {5'ins + del(TA)+ ins(G)+ del(C)+ 3'ins}
> ## multiple alignment of test-case (gapOpening= -4, gapExtension= -2)
> ## GGGATAGTGACTT-CAGGATC-CAGCTCTTCGCCTGGC .. ref = subject
> ## GGGATAGTGACTa-CAGGATC-CAGCTCTTCGCCTGGC .. (1) mutation (T
-> A)
> ## ATAGTGACgT--AGGATCACAGCTCTTCGCC-GGC .. (2) 5'del +
mut(g) + del(C)+ ins(A)+ del(T)
> ## ttGGGA--GTGACTTGCAGGATC-CAG-TCTTCGCCTGGCat .. (3) 5'ins +
del(TA)+ ins(G)+ del(C)+ 3'ins
>
> align <- pairwiseAlignment(myPattern,mySubject,gapOpening= -4,
gapExtension= -2)
>
> nindel(align) # no of indels per seq
An object of class "InDel"
Slot "insertion":
Length WidthSum
[1,] 0 0
[2,] 1 1
[3,] 1 1
Slot "deletion":
Length WidthSum
[1,] 0 0
[2,] 2 2
[3,] 2 3
> (delStart <- start(indel(pattern(align))))
[1] 11 30 5 23
> (delEnd <- end(indel(pattern(align))))
[1] 11 30 6 23
>
> (seqNo_wDel <- rep.int(1:length(align),
elementLengths(indel(pattern(align))))) # sequence-number for each
instance of deletion
[1] 2 2 3 3
>
> for(i in 1:length(seqNo_wDel))
print(substr(as.character(unaligned(subject(align[seqNo_wDel[i]]))),delStart[i],delEnd[i]))
[1] "C"
[1] "G"
[1] "TA"
[1] "G"
## however the "true" deleted ones are : C, T, TA, C => problem with
deletions after preceding insertion (ie ech 2nd case)
> for(i in 1:length(seqNo_wDel)) print(substr(as.character(
aligned(pattern(align[seqNo_wDel[i]]))),delStart[i],delEnd[i]))
[1] "-"
[1] "C"
[1] "--"
[1] "A"
> ## ==> problem with pôsition of deletions after preceding insertion
(ie ech 2nd case)
>
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-07-22 r48979)
x86_64-unknown-linux-gnu
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.13.27 IRanges_1.3.42
loaded via a namespace (and not attached):
[1] Biobase_2.5.4
>
Patrick Aboyoun a écrit :
> Wolfgang,
> The IRanges infrastructure has evolved greatly from BioC 2.4/R-2.9 to
> BioC 2.5/R-2.10 and I highly recommend you use accessor functions,
> rather than direct slot access to obtain the information you are
> looking for. For example, the IRangesList class definition has changed
> greatly and the @elements slot is no longer present.
>
> Question: how to transform an IRangesList object to a list of IRanges
> object
> Answer:
> Short answer, use as.list as in as.list(indel(subject(align))). Long
> answer, this is typically not necessary and not desired given that
> IRangesList has many methods defined for them. See man pages for
> IRangesList, RangesList, and Sequence (if you are on BioC 2.5/R-2.10).
> If you can tell me what operations you would like to perform, I can
> point you in the right direction.
>
> Question: how to the the start, width, and end locations of a pairwise
> alignment.
> Answer: use the start, width and end methods.
> > start(subject(align))
> [1] 1 1 4
> > width(subject(align))
> [1] 24 24 19
> > end(subject(align))
> [1] 24 24 22
>
>
> Patrick
>
>
> Wolfgang Raffelsberger wrote:
>> Patrick,
>>
>> thank you very much for your quick and helpful answer !
>>
>> Yes, using :
>> > align <- pairwiseAlignment(samp1,ref1)
>> > indel(subject(align))
>>
>> I'm about to get what I'm looking for. Now my question is, which
>> commands are (will be) availabel for mining an IRangesList-object.
>> Most of all I'm interested in what would correspond to getting :
>>
>> > indel(subject(align))@elements
>> > subject(align)@range at start
>> > subject(align)@range at witdth # in fact, so far I can do without
>> this one
>>
>> (unless you think the @elements, and @range won't change in the
>> future ...)
>> With these elements I manage now to extract the very nucleotides that
>> were inserted/deleted.
>>
>> Wolfgang
>>
>>
>> Patrick Aboyoun a écrit :
>>> Wolfgang,
>>> Below is code that retrieves the indel locations you are looking
>>> for. I like your attempts at using indel, insertion, and deletion
>>> for PairwiseAlignment objects and I'll add the methods for
>>> PairwiseAlignment objects to BioC 2.5 (devel) shortly using the
>>> conventions that I specify below.
>>>
>>> > suppressMessages(library(Biostrings))
>>> > ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
>>> > samp1 <-
>>> DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG"))
>>>
>>> > # 1st has a mutation, 2nd has an insertion, the 3rd is simply
>>> shorter ...
>>> >
>>> > align <- pairwiseAlignment(samp1,ref1)
>>> >
>>> > nindel(align)
>>> An object of class “InDel”
>>> Slot "insertion":
>>> Length WidthSum
>>> [1,] 0 0
>>> [2,] 1 1
>>> [3,] 0 0
>>>
>>> Slot "deletion":
>>> Length WidthSum
>>> [1,] 0 0
>>> [2,] 0 0
>>> [3,] 0 0
>>>
>>> > deletions <- indel(pattern(align))
>>> > deletions
>>> CompressedIRangesList: 3 elements
>>> > insertions <- indel(subject(align))
>>> > insertions
>>> CompressedIRangesList: 3 elements
>>> > insertions[[2]]
>>> IRanges instance:
>>> start end width
>>> [1] 10 10 1
>>> > sessionInfo()
>>> R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
>>> i386-apple-darwin9.7.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.13.26 IRanges_1.3.41
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.5.4
>>>
>>>
>>> Wolfgang Raffelsberger wrote:
>>>> Dear list,
>>>>
>>>> previously I've been extracting indel-information from sequences
>>>> aligned by the Biostrings function pairwiseAlignment(), which is
>>>> probably not the best way since the class
>>>> 'PairwiseAlignedFixedSubject' has evoled & changed and my old code
>>>> won't work any more. Now trying to use the library-provided
>>>> functions to access the information/details about indels (ie their
>>>> localization on the pattern and possibly the indel sequence ).
>>>> However, I can't find a function to extract this information, that
>>>> is (to the best of my knowledge) part of the aligned object.
>>>>
>>>> ## here an example :
>>>> library(Biostrings)
>>>> ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
>>>> samp1 <-
>>>> DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG"))
>>>>
>>>> # 1st has a mutation, 2nd has an insertion, the 3rd is simply
>>>> shorter ...
>>>>
>>>> align <- pairwiseAlignment(samp1,ref1)
>>>>
>>>> nindel(align) # insertion was found properly but I can't see at
>>>> which nt position the indel was found (neither if it's an insertion
>>>> or deletion)
>>>> indel(align) # Error in function (classes, fdef, mtable) unable to
>>>> find an inherited method for function...
>>>> insertion(align) # Error in function (classes, fdef, mtable) unable
>>>> to find an inherited method for function ...
>>>> deletion(align) # neither ...
>>>> ?AlignedXStringSet # says under 'Accessor methods' that indel()
>>>> exists ..
>>>>
>>>> ## ideally I'd be looking for something like
>>>> mismatchTable(align) # but addressing indels ...
>>>>
>>>>
>>>> ## for completeness :
>>>> > sessionInfo()
>>>> R version 2.9.1 (2009-06-26)
>>>> i386-pc-mingw32
>>>>
>>>> locale:
>>>> LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
>>>>
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>> other attached packages:
>>>> [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3
>>>> Biostrings_2.12.7 IRanges_1.2.3
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
>>>>
>>>> Thank's in advance,
>>>> Wolfgang Raffelsberger
>>>>
>>>>
>
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC,
1 rue Laurent Fries, 67404 Illkirch Strasbourg, France
Tel (+33) 388 65 3300 Fax (+33) 388 65 3276
wolfgang.raffelsberger (at) igbmc.fr
More information about the Bioc-sig-sequencing
mailing list