[Bioc-sig-seq] Biostrings: problem to access indel-details form pairwiseAlignment()

Fri Jul 24 16:52:02 CEST 2009

Wolfgang,
If I'm understanding the situation correctly, the 
start(indel(pattern())) operation returns the positions in the pattern 
where there is a gap and the length of those gaps, the 
start(indel(subject())) function returns the positions in the subject 
where there is a gap and the length of those gap and what you would like 
to do is map this information back to the subject to see where the 
deletions occurred. I don't recall adding a function to do that, but I 
can see it being useful. Here is a somewhat kludgey method of getting 
the deletion locations via manipulation of output of the aligned 
function (there may be a cleaner way of doing this with existing tools, 
but this is the first idea that came to mind):

 > aligned(align)
 A DNAStringSet instance of length 3
   width seq
[1]    36 GGGATAGTGACTACAGGATCCAGCTCTTCGCCTGGC
[2]    36 ---ATAGTGACGT-AGGATCCAGCTCTTCGCC-GGC
[3]    36 GGGA--GTGACTTCAGGATCCAG-TCTTCGCCTGGC

 > deleteRanges <- vmatchPattern("-", aligned(align))
 > deleteRanges
MIndex object of length 3
 > deleteRanges <- as(deleteRanges, "CompressedIRangesList")
 > deleteRanges
CompressedIRangesList: 3 elements
 > deleteRanges <- lapply(deleteRanges, asNormalIRanges)
 > deleteRanges
[[1]]
NormalIRanges instance:
[1] start end   width
<0 rows> (or 0-length row.names)

[[2]]
NormalIRanges instance:
   start end width
[1]     1   3     3
[2]    14  14     1
[3]    33  33     1

[[3]]
NormalIRanges instance:
   start end width
[1]     5   6     2
[2]    24  24     1

 > for(i in 1:3) print(Views(mySubject, deleteRanges[[i]]))
 Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views: NONE
 Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views:
   start end width
[1]     1   3     3 [GGG]
[2]    14  14     1 [C]
[3]    33  33     1 [T]
 Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views:
   start end width
[1]     5   6     2 [TA]
[2]    24  24     1 [C]

Wolfgang Raffelsberger wrote:
> Patrick,
>
> thank for your quick response !
> For a while things went well but now I got across a case where I have 
> difficulty interpreting the number(s) given for start-positions of 
> deletions. This is when there is a preceding insertion before a 
> deletion occurs/ gets encountered.
> Of course I could correct this, ie shift the values myself according 
> to the number preceding inserted nucletides, but I posted this since 
> I'm not sure if you're aware of this special case/problem.
> Or maybe you have some other command/way to access (ie extract) the 
> sequence(s) deleted ?
>
> (Note that the equivalent approach for extracting insertions works 
> without any problems :))
>
> Wolfgang
>
> > suppressMessages(library(Biostrings))
> > mySubject <- ref1 <- DNAString("GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC")
> > myPattern <-  samp1 <- 
> DNAStringSet(c("GGGATAGTGACTACAGGATCCAGCTCTTCGCCTGGC","ATAGTGACGTAGGATCACAGCTCTTCGCCGGC","TTGGGAGTGACTTGCAGGATCCAGTCTTCGCCTGGCAT")) 
>
> > ## 1st has a mutation;  2nd: {5'del+ mut(g)+ del(C)+ del(T)+ ins(A)} 
> ; 3rd: {5'ins + del(TA)+ ins(G)+ del(C)+ 3'ins}
> > ##  multiple alignment of test-case (gapOpening= -4, gapExtension= -2)
> > ##     GGGATAGTGACTT-CAGGATC-CAGCTCTTCGCCTGGC    .. ref = subject
> > ##     GGGATAGTGACTa-CAGGATC-CAGCTCTTCGCCTGGC    .. (1) mutation (T 
> -> A)
> > ##        ATAGTGACgT--AGGATCACAGCTCTTCGCC-GGC    .. (2) 5'del + 
> mut(g) + del(C)+ ins(A)+ del(T)
> > ##   ttGGGA--GTGACTTGCAGGATC-CAG-TCTTCGCCTGGCat  .. (3) 5'ins + 
> del(TA)+ ins(G)+ del(C)+ 3'ins
> >
> > align <- pairwiseAlignment(myPattern,mySubject,gapOpening= -4, 
> gapExtension= -2)
> >
> > nindel(align)                             # no of indels per seq
> An object of class "InDel"
> Slot "insertion":
>    Length WidthSum
> [1,]      0        0
> [2,]      1        1
> [3,]      1        1
>
> Slot "deletion":
>    Length WidthSum
> [1,]      0        0
> [2,]      2        2
> [3,]      2        3
>
> > (delStart <- start(indel(pattern(align))))
> [1] 11 30  5 23
> > (delEnd <- end(indel(pattern(align))))
> [1] 11 30  6 23
> >
> > (seqNo_wDel <- rep.int(1:length(align), 
> elementLengths(indel(pattern(align)))))  # sequence-number for each 
> instance of deletion
> [1] 2 2 3 3
> >
> > for(i in 1:length(seqNo_wDel)) 
> print(substr(as.character(unaligned(subject(align[seqNo_wDel[i]]))),delStart[i],delEnd[i])) 
>
> [1] "C"
> [1] "G"
> [1] "TA"
> [1] "G"
> ## however the "true" deleted ones are : C, T, TA, C   => problem with 
> deletions after preceding insertion (ie ech 2nd case)
> > for(i in 1:length(seqNo_wDel)) print(substr(as.character(  
> aligned(pattern(align[seqNo_wDel[i]]))),delStart[i],delEnd[i]))
> [1] "-"
> [1] "C"
> [1] "--"
> [1] "A"
> > ## ==> problem with pôsition of deletions after preceding insertion 
> (ie ech 2nd case)
> >
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-07-22 r48979)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> other attached packages:
> [1] Biostrings_2.13.27 IRanges_1.3.42  loaded via a namespace (and not 
> attached):
> [1] Biobase_2.5.4
> >
>
>
>
>
> Patrick Aboyoun a écrit :
>> Wolfgang,
>> The IRanges infrastructure has evolved greatly from BioC 2.4/R-2.9 to 
>> BioC 2.5/R-2.10 and I highly recommend you use accessor functions, 
>> rather than direct slot access to obtain the information you are 
>> looking for. For example, the IRangesList class definition has 
>> changed greatly and the @elements slot is no longer present.
>>
>> Question: how to transform an IRangesList object to a list of IRanges 
>> object
>> Answer:
>> Short answer, use as.list as in as.list(indel(subject(align))). Long 
>> answer, this is typically not necessary and not desired given that 
>> IRangesList has many methods defined for them. See man pages for 
>> IRangesList, RangesList, and Sequence (if you are on BioC 
>> 2.5/R-2.10). If you can tell me what operations you would like to 
>> perform, I can point you in the right direction.
>>
>> Question: how to the the start, width, and end locations of a 
>> pairwise alignment.
>> Answer: use the start, width and end methods.
>> > start(subject(align))
>> [1] 1 1 4
>> > width(subject(align))
>> [1] 24 24 19
>> > end(subject(align))
>> [1] 24 24 22
>>
>>
>> Patrick
>>
>>
>> Wolfgang Raffelsberger wrote:
>>> Patrick,
>>>
>>> thank you very much for your quick and helpful answer !
>>>
>>> Yes, using  :
>>> > align <- pairwiseAlignment(samp1,ref1)
>>> > indel(subject(align))
>>>
>>> I'm about to get what I'm looking for.  Now my question is, which 
>>> commands are (will be) availabel for mining an IRangesList-object.
>>> Most of all I'm interested in what would correspond to getting :
>>>
>>> > indel(subject(align))@elements
>>> > subject(align)@range at start
>>> > subject(align)@range at witdth   # in fact, so far I can do without 
>>> this one
>>>
>>> (unless you think the @elements, and  @range won't change in the 
>>> future ...)
>>> With these elements I manage now to extract the very nucleotides 
>>> that were inserted/deleted.
>>>
>>> Wolfgang
>>>
>>>
>>> Patrick Aboyoun a écrit :
>>>> Wolfgang,
>>>> Below is code that retrieves the indel locations you are looking 
>>>> for. I like your attempts at using indel, insertion, and deletion 
>>>> for PairwiseAlignment objects and I'll add the methods for 
>>>> PairwiseAlignment objects to BioC 2.5 (devel) shortly using the 
>>>> conventions that I specify below.
>>>>
>>>> > suppressMessages(library(Biostrings))
>>>> > ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
>>>> > samp1 <- 
>>>> DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG")) 
>>>>
>>>> > # 1st has a mutation, 2nd has an insertion, the 3rd is simply 
>>>> shorter ...
>>>> >
>>>> > align <- pairwiseAlignment(samp1,ref1)
>>>> >
>>>> > nindel(align)
>>>> An object of class “InDel”
>>>> Slot "insertion":
>>>> Length WidthSum
>>>> [1,] 0 0
>>>> [2,] 1 1
>>>> [3,] 0 0
>>>>
>>>> Slot "deletion":
>>>> Length WidthSum
>>>> [1,] 0 0
>>>> [2,] 0 0
>>>> [3,] 0 0
>>>>
>>>> > deletions <- indel(pattern(align))
>>>> > deletions
>>>> CompressedIRangesList: 3 elements
>>>> > insertions <- indel(subject(align))
>>>> > insertions
>>>> CompressedIRangesList: 3 elements
>>>> > insertions[[2]]
>>>> IRanges instance:
>>>> start end width
>>>> [1] 10 10 1
>>>> > sessionInfo()
>>>> R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
>>>> i386-apple-darwin9.7.0
>>>>
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] Biostrings_2.13.26 IRanges_1.3.41
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.5.4
>>>>
>>>>
>>>> Wolfgang Raffelsberger wrote:
>>>>> Dear list,
>>>>>
>>>>> previously I've been extracting indel-information from sequences 
>>>>> aligned by the Biostrings function pairwiseAlignment(), which is 
>>>>> probably not the best way since the class 
>>>>> 'PairwiseAlignedFixedSubject' has evoled & changed and my old code 
>>>>> won't work any more. Now trying to use the library-provided 
>>>>> functions to access the information/details about indels (ie their 
>>>>> localization on the pattern and possibly the indel sequence ). 
>>>>> However, I can't find a function to extract this information, that 
>>>>> is (to the best of my knowledge) part of the aligned object.
>>>>>
>>>>> ## here an example :
>>>>> library(Biostrings)
>>>>> ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
>>>>> samp1 <- 
>>>>> DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG")) 
>>>>>
>>>>> # 1st has a mutation, 2nd has an insertion, the 3rd is simply 
>>>>> shorter ...
>>>>>
>>>>> align <- pairwiseAlignment(samp1,ref1)
>>>>>
>>>>> nindel(align) # insertion was found properly but I can't see at 
>>>>> which nt position the indel was found (neither if it's an 
>>>>> insertion or deletion)
>>>>> indel(align) # Error in function (classes, fdef, mtable) unable to 
>>>>> find an inherited method for function...
>>>>> insertion(align) # Error in function (classes, fdef, mtable) 
>>>>> unable to find an inherited method for function ...
>>>>> deletion(align) # neither ...
>>>>> ?AlignedXStringSet # says under 'Accessor methods' that indel() 
>>>>> exists ..
>>>>>
>>>>> ## ideally I'd be looking for something like
>>>>> mismatchTable(align) # but addressing indels ...
>>>>>
>>>>>
>>>>> ## for completeness :
>>>>> > sessionInfo()
>>>>> R version 2.9.1 (2009-06-26)
>>>>> i386-pc-mingw32
>>>>>
>>>>> locale:
>>>>> LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252 
>>>>>
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>> other attached packages:
>>>>> [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3 
>>>>> Biostrings_2.12.7 IRanges_1.2.3
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
>>>>>
>>>>> Thank's in advance,
>>>>> Wolfgang Raffelsberger
>>>>>
>>>>>
>>
>
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Wolfgang Raffelsberger, PhD
> Laboratoire de BioInformatique et Génomique Intégratives
> CNRS UMR7104, IGBMC,  1 rue Laurent Fries,  67404 Illkirch  
> Strasbourg,  France
> Tel (+33) 388 65 3300         Fax (+33) 388 65 3276
> wolfgang.raffelsberger (at) igbmc.fr
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing