[Bioc-sig-seq] Biostrings: problem to access indel-details form pairwiseAlignment()
Patrick Aboyoun
paboyoun at fhcrc.org
Fri Jul 24 16:52:02 CEST 2009
Wolfgang,
If I'm understanding the situation correctly, the
start(indel(pattern())) operation returns the positions in the pattern
where there is a gap and the length of those gaps, the
start(indel(subject())) function returns the positions in the subject
where there is a gap and the length of those gap and what you would like
to do is map this information back to the subject to see where the
deletions occurred. I don't recall adding a function to do that, but I
can see it being useful. Here is a somewhat kludgey method of getting
the deletion locations via manipulation of output of the aligned
function (there may be a cleaner way of doing this with existing tools,
but this is the first idea that came to mind):
> aligned(align)
A DNAStringSet instance of length 3
width seq
[1] 36 GGGATAGTGACTACAGGATCCAGCTCTTCGCCTGGC
[2] 36 ---ATAGTGACGT-AGGATCCAGCTCTTCGCC-GGC
[3] 36 GGGA--GTGACTTCAGGATCCAG-TCTTCGCCTGGC
> deleteRanges <- vmatchPattern("-", aligned(align))
> deleteRanges
MIndex object of length 3
> deleteRanges <- as(deleteRanges, "CompressedIRangesList")
> deleteRanges
CompressedIRangesList: 3 elements
> deleteRanges <- lapply(deleteRanges, asNormalIRanges)
> deleteRanges
[[1]]
NormalIRanges instance:
[1] start end width
<0 rows> (or 0-length row.names)
[[2]]
NormalIRanges instance:
start end width
[1] 1 3 3
[2] 14 14 1
[3] 33 33 1
[[3]]
NormalIRanges instance:
start end width
[1] 5 6 2
[2] 24 24 1
> for(i in 1:3) print(Views(mySubject, deleteRanges[[i]]))
Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views: NONE
Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views:
start end width
[1] 1 3 3 [GGG]
[2] 14 14 1 [C]
[3] 33 33 1 [T]
Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views:
start end width
[1] 5 6 2 [TA]
[2] 24 24 1 [C]
Wolfgang Raffelsberger wrote:
> Patrick,
>
> thank for your quick response !
> For a while things went well but now I got across a case where I have
> difficulty interpreting the number(s) given for start-positions of
> deletions. This is when there is a preceding insertion before a
> deletion occurs/ gets encountered.
> Of course I could correct this, ie shift the values myself according
> to the number preceding inserted nucletides, but I posted this since
> I'm not sure if you're aware of this special case/problem.
> Or maybe you have some other command/way to access (ie extract) the
> sequence(s) deleted ?
>
> (Note that the equivalent approach for extracting insertions works
> without any problems :))
>
> Wolfgang
>
> > suppressMessages(library(Biostrings))
> > mySubject <- ref1 <- DNAString("GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC")
> > myPattern <- samp1 <-
> DNAStringSet(c("GGGATAGTGACTACAGGATCCAGCTCTTCGCCTGGC","ATAGTGACGTAGGATCACAGCTCTTCGCCGGC","TTGGGAGTGACTTGCAGGATCCAGTCTTCGCCTGGCAT"))
>
> > ## 1st has a mutation; 2nd: {5'del+ mut(g)+ del(C)+ del(T)+ ins(A)}
> ; 3rd: {5'ins + del(TA)+ ins(G)+ del(C)+ 3'ins}
> > ## multiple alignment of test-case (gapOpening= -4, gapExtension= -2)
> > ## GGGATAGTGACTT-CAGGATC-CAGCTCTTCGCCTGGC .. ref = subject
> > ## GGGATAGTGACTa-CAGGATC-CAGCTCTTCGCCTGGC .. (1) mutation (T
> -> A)
> > ## ATAGTGACgT--AGGATCACAGCTCTTCGCC-GGC .. (2) 5'del +
> mut(g) + del(C)+ ins(A)+ del(T)
> > ## ttGGGA--GTGACTTGCAGGATC-CAG-TCTTCGCCTGGCat .. (3) 5'ins +
> del(TA)+ ins(G)+ del(C)+ 3'ins
> >
> > align <- pairwiseAlignment(myPattern,mySubject,gapOpening= -4,
> gapExtension= -2)
> >
> > nindel(align) # no of indels per seq
> An object of class "InDel"
> Slot "insertion":
> Length WidthSum
> [1,] 0 0
> [2,] 1 1
> [3,] 1 1
>
> Slot "deletion":
> Length WidthSum
> [1,] 0 0
> [2,] 2 2
> [3,] 2 3
>
> > (delStart <- start(indel(pattern(align))))
> [1] 11 30 5 23
> > (delEnd <- end(indel(pattern(align))))
> [1] 11 30 6 23
> >
> > (seqNo_wDel <- rep.int(1:length(align),
> elementLengths(indel(pattern(align))))) # sequence-number for each
> instance of deletion
> [1] 2 2 3 3
> >
> > for(i in 1:length(seqNo_wDel))
> print(substr(as.character(unaligned(subject(align[seqNo_wDel[i]]))),delStart[i],delEnd[i]))
>
> [1] "C"
> [1] "G"
> [1] "TA"
> [1] "G"
> ## however the "true" deleted ones are : C, T, TA, C => problem with
> deletions after preceding insertion (ie ech 2nd case)
> > for(i in 1:length(seqNo_wDel)) print(substr(as.character(
> aligned(pattern(align[seqNo_wDel[i]]))),delStart[i],delEnd[i]))
> [1] "-"
> [1] "C"
> [1] "--"
> [1] "A"
> > ## ==> problem with pôsition of deletions after preceding insertion
> (ie ech 2nd case)
> >
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-07-22 r48979)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] Biostrings_2.13.27 IRanges_1.3.42 loaded via a namespace (and not
> attached):
> [1] Biobase_2.5.4
> >
>
>
>
>
> Patrick Aboyoun a écrit :
>> Wolfgang,
>> The IRanges infrastructure has evolved greatly from BioC 2.4/R-2.9 to
>> BioC 2.5/R-2.10 and I highly recommend you use accessor functions,
>> rather than direct slot access to obtain the information you are
>> looking for. For example, the IRangesList class definition has
>> changed greatly and the @elements slot is no longer present.
>>
>> Question: how to transform an IRangesList object to a list of IRanges
>> object
>> Answer:
>> Short answer, use as.list as in as.list(indel(subject(align))). Long
>> answer, this is typically not necessary and not desired given that
>> IRangesList has many methods defined for them. See man pages for
>> IRangesList, RangesList, and Sequence (if you are on BioC
>> 2.5/R-2.10). If you can tell me what operations you would like to
>> perform, I can point you in the right direction.
>>
>> Question: how to the the start, width, and end locations of a
>> pairwise alignment.
>> Answer: use the start, width and end methods.
>> > start(subject(align))
>> [1] 1 1 4
>> > width(subject(align))
>> [1] 24 24 19
>> > end(subject(align))
>> [1] 24 24 22
>>
>>
>> Patrick
>>
>>
>> Wolfgang Raffelsberger wrote:
>>> Patrick,
>>>
>>> thank you very much for your quick and helpful answer !
>>>
>>> Yes, using :
>>> > align <- pairwiseAlignment(samp1,ref1)
>>> > indel(subject(align))
>>>
>>> I'm about to get what I'm looking for. Now my question is, which
>>> commands are (will be) availabel for mining an IRangesList-object.
>>> Most of all I'm interested in what would correspond to getting :
>>>
>>> > indel(subject(align))@elements
>>> > subject(align)@range at start
>>> > subject(align)@range at witdth # in fact, so far I can do without
>>> this one
>>>
>>> (unless you think the @elements, and @range won't change in the
>>> future ...)
>>> With these elements I manage now to extract the very nucleotides
>>> that were inserted/deleted.
>>>
>>> Wolfgang
>>>
>>>
>>> Patrick Aboyoun a écrit :
>>>> Wolfgang,
>>>> Below is code that retrieves the indel locations you are looking
>>>> for. I like your attempts at using indel, insertion, and deletion
>>>> for PairwiseAlignment objects and I'll add the methods for
>>>> PairwiseAlignment objects to BioC 2.5 (devel) shortly using the
>>>> conventions that I specify below.
>>>>
>>>> > suppressMessages(library(Biostrings))
>>>> > ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
>>>> > samp1 <-
>>>> DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG"))
>>>>
>>>> > # 1st has a mutation, 2nd has an insertion, the 3rd is simply
>>>> shorter ...
>>>> >
>>>> > align <- pairwiseAlignment(samp1,ref1)
>>>> >
>>>> > nindel(align)
>>>> An object of class “InDel”
>>>> Slot "insertion":
>>>> Length WidthSum
>>>> [1,] 0 0
>>>> [2,] 1 1
>>>> [3,] 0 0
>>>>
>>>> Slot "deletion":
>>>> Length WidthSum
>>>> [1,] 0 0
>>>> [2,] 0 0
>>>> [3,] 0 0
>>>>
>>>> > deletions <- indel(pattern(align))
>>>> > deletions
>>>> CompressedIRangesList: 3 elements
>>>> > insertions <- indel(subject(align))
>>>> > insertions
>>>> CompressedIRangesList: 3 elements
>>>> > insertions[[2]]
>>>> IRanges instance:
>>>> start end width
>>>> [1] 10 10 1
>>>> > sessionInfo()
>>>> R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
>>>> i386-apple-darwin9.7.0
>>>>
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] Biostrings_2.13.26 IRanges_1.3.41
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.5.4
>>>>
>>>>
>>>> Wolfgang Raffelsberger wrote:
>>>>> Dear list,
>>>>>
>>>>> previously I've been extracting indel-information from sequences
>>>>> aligned by the Biostrings function pairwiseAlignment(), which is
>>>>> probably not the best way since the class
>>>>> 'PairwiseAlignedFixedSubject' has evoled & changed and my old code
>>>>> won't work any more. Now trying to use the library-provided
>>>>> functions to access the information/details about indels (ie their
>>>>> localization on the pattern and possibly the indel sequence ).
>>>>> However, I can't find a function to extract this information, that
>>>>> is (to the best of my knowledge) part of the aligned object.
>>>>>
>>>>> ## here an example :
>>>>> library(Biostrings)
>>>>> ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
>>>>> samp1 <-
>>>>> DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG"))
>>>>>
>>>>> # 1st has a mutation, 2nd has an insertion, the 3rd is simply
>>>>> shorter ...
>>>>>
>>>>> align <- pairwiseAlignment(samp1,ref1)
>>>>>
>>>>> nindel(align) # insertion was found properly but I can't see at
>>>>> which nt position the indel was found (neither if it's an
>>>>> insertion or deletion)
>>>>> indel(align) # Error in function (classes, fdef, mtable) unable to
>>>>> find an inherited method for function...
>>>>> insertion(align) # Error in function (classes, fdef, mtable)
>>>>> unable to find an inherited method for function ...
>>>>> deletion(align) # neither ...
>>>>> ?AlignedXStringSet # says under 'Accessor methods' that indel()
>>>>> exists ..
>>>>>
>>>>> ## ideally I'd be looking for something like
>>>>> mismatchTable(align) # but addressing indels ...
>>>>>
>>>>>
>>>>> ## for completeness :
>>>>> > sessionInfo()
>>>>> R version 2.9.1 (2009-06-26)
>>>>> i386-pc-mingw32
>>>>>
>>>>> locale:
>>>>> LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
>>>>>
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>> other attached packages:
>>>>> [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3
>>>>> Biostrings_2.12.7 IRanges_1.2.3
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
>>>>>
>>>>> Thank's in advance,
>>>>> Wolfgang Raffelsberger
>>>>>
>>>>>
>>
>
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Wolfgang Raffelsberger, PhD
> Laboratoire de BioInformatique et Génomique Intégratives
> CNRS UMR7104, IGBMC, 1 rue Laurent Fries, 67404 Illkirch
> Strasbourg, France
> Tel (+33) 388 65 3300 Fax (+33) 388 65 3276
> wolfgang.raffelsberger (at) igbmc.fr
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list