[Bioc-sig-seq] SNP counting (Biostrings ?) question
Wolfgang Raffelsberger
wraff at titus.u-strasbg.fr
Mon Nov 3 13:35:30 CET 2008
Dear Patrick,
thank's for your response. After looking in the nice functions that you
pointed out now I have three new questions:
1) Using commands like mismatchTable or consensusMatrix, is there a way
to I tell/recall which of my sample-sequences has given alterations in
the pairwise alignments. This might correspond to a list for each line
of mismatchSummary(alignments)$subject tells which of the
sample-sequence contributed to the effect counted in
mismatchSummary(alignments)$subject$Count . Or are there some other
ways/commands allowing to do so ?
2) I noticed that I get the same paire-wise alignment results (even same
score) when using +5 insted of -5 for gapOpening etc. Why bother about
writing values as negative if they're taken as abs() ?
3) I don't see insertions displayed with consensusMatrix(), ie the "+"
lines stays empty while deletions ("-") are counted correctly (and I
didn't get insertion() to work). My example sample-sequences no 5 & 6
have insertions.
Thank's in advance,
Wolfgang
library(Biostrings)
mat <- matrix(-3,nc=5,nr=5)
diag(mat) <- 1
mat[,5] <- 0
mat[5,] <- 0
rownames(mat) <- colnames(mat) <- c("A","C","G","T","N")
ref <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAA")
samp <-
DNAStringSet(c("CTTCtCCAGCTCCCTGGCGGTAAGTTGA","ACTTCtCCAGCTagCTGnnGGTAAGTTGA",
"TTCACCAGCTCCCTGCGGGTAAGTT","TTCACGCTcCTGGGTAAGGATCA",
"cACTTCACCAGCTCCCTGGCGGTAAGT","nnnnTTCtCCAGCTCCCccTGGCGatngGTAAGTTGATCtcgt"))
# samp: no1: single mut, no2: mult mut, no3: just shorter, no4:
deletions, no5: insertion, no6: mult insertions
alignments <- pairwiseAlignment(samp, ref, substitutionMatrix=mat,
gapOpening= -5, gapExtension= -2, scoreOnly=FALSE, type="global")
# test if abs() values for gapOpening change anything
alignments2 <- pairwiseAlignment(samp, ref, substitutionMatrix=mat,
gapOpening= 5, gapExtension= 2, scoreOnly=FALSE, type="global")
identical(alignments,alignments2) # is TRUE
# check for deletions
consensusMatrix(alignments, baseOnly = F)[c(1:4,15:17),]
# the last line "+" doesn't show/count any insertions while deletions
"-" are counted OK
> sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32
locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.10.0 IRanges_1.0.1
loaded via a namespace (and not attached):
[1] grid_2.8.0 lattice_0.17-15 Matrix_0.999375-16
Patrick Aboyoun a écrit :
> Wolfgang,
> I'm glad to hear you are finding the Biostrings package to be useful.
> The pairwiseAlignment function has some helper functions that would be
> useful to a SNP analysis. In particular, you may want to take a look
> at the mismatchTable, mismatchSummary, summary, and consensusMatrix
> functions to see if they provide the summaries you are looking for. I
> also recommend you leverage the vectorized capabilities of the
> pairwiseAlignment function since it will be easier to maintain your
> code, provide faster execution time, and make it easier to summarize
> your results. I have modified your code slightly to show you what you
> can get "out of the box" from the Biostrings package without writing
> too much code:
>
>
> library(Biostrings)
> # the substitution matrix
> mat <- matrix(-3, nrow = 5, ncol = 5)
> diag(mat) <- 1
> mat[,5] <- 0; mat[5,] <- 0
> rownames(mat) <- colnames(mat) <- DNA_ALPHABET[c(1:4,15)]
>
> # the consensus reference sequence
> ref <- DNAString("ACTTCACCAGCTCCCTGGC")
> # the sample sequences; the 3rd one has no mutations, it's simply
> shorter ...
> samp <-
>
> DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))
>
>
> # vectorized alignments using the ref as the "subject"
> alignments <-
> pairwiseAlignment(samp, ref, substitutionMatrix=mat,
> gapOpening= -5, gapExtension= -2, scoreOnly=FALSE,
> type="global")
>
> # summaries useful for SNP analysis
> summary(alignments)
> mismatchSummary(alignments)
> mismatchTable(alignments)
> consensusMatrix(alignments, baseOnly = TRUE)
>
>
>
> Patrick
>
>
> Wolfgang Raffelsberger wrote:
>> Hi Herve,
>>
>> thank you for your helpful answer.
>> Initially I was wondering if there are already some functions/packages
>> directly for my tasks available, as Bioconducter and CRAN are very
>> strong resources.
>>
>> Since your Biostrings package provides all the basic functionalities,
>> I've built a few (additional) functions (on top of Biostrings) doing the
>> job.
>> To be more precise, yes, I'm interested in detecting all alterations
>> (mutations+ insertions + deletions). However, within the project's
>> biological context mutations are the prime focus. Right now I'm rather
>> in an exploratory/pilot phase for the nucleotide-alteration/SNP
>> counting, so I'll probably improve & expand some of my quickly
>> written code.
>>
>> For the sequence alignments I prefer global, since I'd rather want the
>> 'best' match possible (ie the default setting with pairwiseAlignment()
>> ). Concerning the choice of (custom) substitution matrices I'm not very
>> decided. Right now I'm using very generic ones (like in the Biostrings
>> vignette). Recently I've learned that there are several 'versions' of
>> the Needleman-Wunsch algorithm (ie various improvements to the original
>> one), so I'm not any more surprised of seeing different scores from
>> different implementations...
>> Anyway, once it's clear which sample/target (ie sequencer output) should
>> be compared with which reference sequence (ie the consenus genomic
>> sequence of the very region to be tested) I don't use the score any more
>> and luckily my test-data contain few completely unexpected 'alien'
>> sequences. For this initial step now I use matchPattern() - thank's for
>> your hint- where I primarily need to test if sequences have to be read
>> as inverse/complement.
>> Again, after a bit more digging I found the corresponding functions and
>> I'm quite happy with the performance !
>>
>> At the core of SNP/alterations counting I dissect the pairwise alignment
>> and record what kind of changes have happened (at this level I noted
>> some changes to the new BioC 2.3 implementation ...see also code below).
>> Here I use (so far) strsplit() to make two vectors with individual
>> nucleotides (and gaps) to extract the positions&changes from each of the
>> lines of the pairwise alignment (see code below). I guess there may be
>> other more elegant/faster ways to do this...
>> As I'm primarily interested in 'simple mutations', so far I took a
>> shortcut consisting in simply looking at the first of inserted
>> nucleotide in cases of insertions (some additional lines not part of the
>> code below). Maybe later I'll get back to expand this ... After all I
>> end up with a simple matrix where each row corresponds to a position of
>> the reference sequence and the columns contain counts of the
>> events/alterations observed (where at least I'm able to tell if any
>> insertions have happened at all) where I can run classical statistical
>> tests.
>>
>> Cheers,
>> Wolfgang
>>
>> # here the way I 'parse' pairwise alignments :
>>
>> library(Biostrings)
>> mat <- matrix(-3, nrow = 5, ncol = 5)
>> diag(mat) <- 1
>> mat[,5] <- 0; mat[5,] <- 0
>> rownames(mat) <- colnames(mat) <- DNA_ALPHABET[c(1:4,15)]
>>
>> ref <- DNAString("ACTTCACCAGCTCCCTGGC") # the consensus
>> reference sequence
>> samp <-
>> DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))
>>
>> # the 3rd one has no mutations, it's simply shorter ...
>> # here (simplified) I already know that my sequence is oriented
>> correctly
>> alignRes <- matrix(nr=nchar(ref),nc=length(samp))
>> for(i in 1:length(samp)) {
>> x <- pairwiseAlignment(as.character(ref), samp[[i]],
>> substitutionMatrix= mat, gapOpening= -5, gapExtension= -2,scoreOnly=F,
>> type="global")
>> cutsamp <- unlist(strsplit(as.character(x at subject),""))
>> cutref <- unlist(strsplit(as.character(x at pattern),""))
>> if(nchar(ref) != length(cutref)) {
>> origRefNt <- unlist(strsplit(as.character(ref),""))
>> offS <- 1
>> while(origRefNt[1+offS] != cutref[1]) offS <- offS+1
>> }
>> out <- rep("Z",nchar(ref)) # use
>> "Z" for (default) constant
>> if( !identical(cutsamp,cutref)) {
>> mutPos <- !(cutref== cutsamp)
>> out[mutPos+offS] <- cutsamp[mutPos+offS]
>> if(sum(cutref=="-")>0) out <- out[-which(cutref=="-")] # remove
>> gaps in reference
>> # gaps could/can be retained at this stage only to a limited degree
>> by additional subsitutions
>> }
>> alignRes[,i] <- out
>> }
>>
>>> sessionInfo()
>> R version 2.8.0 (2008-10-20)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
>>
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Biostrings_2.10.0 IRanges_1.0.1
>>
>> loaded via a namespace (and not attached):
>> [1] grid_2.8.0 lattice_0.17-15 Matrix_0.999375-16
>>
>>
>>
>> Herve Pages a écrit :
>>> Hi Wolfgang,
>>>
>>> If I understand correctly you want to compare a collection of "sample"
>>> sequences against a "reference" sequence.
>>>
>>> There are many ways to "compare" 2 sequences and to get a score from
>>> this
>>> comparison: Do you want to allow indels or just substitutions (aka
>>> replacements)?
>>> Do you want the alignment to be global or local? Do you want the
>>> score to be
>>> determined by a custom substitution matrix or just to reflect the "edit
>>> distance" between the 2 sequences (i.e. the minimum number of
>>> replacements
>>> + insertions + deletions required to transform sequence 1 into
>>> sequence 2).
>>>
>>> If you are trying to get the edit distance between the sample sequences
>>> and the reference sequence then you might want to look at this post:
>>>
>>>
>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2008-July/000067.html
>>>
>>>
>>> Note that the 'type' argument is omitted in this post so
>>> pairwiseAlignment()
>>> is used to perform "global" alignments, that is, the score that is
>>> returned
>>> indicates the min nb of edit operations needed to transform the
>>> *complete*
>>> sequences in 'pattern' (your "sample" sequences) into the *complete*
>>> 'subject'
>>> (your "reference" sequence).
>>> One nice property of "global" alignment is that it is a symetric
>>> operation on
>>> the 2 sequences to align. You can take advantage of this by
>>> inverting the roles
>>> of the "sample" and the "reference" sequences so 'samp' can be the
>>> pattern
>>> and 'ref' the subject. Then you can benefit from the fact that
>>> pairwiseAlignment() is vectorized on the pattern.
>>>
>>> But then you say that you would primarily be interested in finding
>>> where
>>> a given nucleotide differs from the query. This suggests that you
>>> are in fact
>>> looking for a much simpler form of alignment where only mismatches
>>> (i.e.
>>> replacements) are allowed but no indels. Also you say that your
>>> sample-sequences may start or end slightly later/earlier.
>>> This means that you can use the matchPattern() function for your
>>> problem.
>>>
>>> > m2 <- matchPattern(samp[[2]], ref, max.mismatch=3)
>>> > m2
>>> Views on a 19-letter DNAString subject
>>> subject: ACTTCACCAGCTCCCTGGC
>>> views:
>>> start end width
>>> [1] 1 18 18 [ACTTCACCAGCTCCCTGG]
>>>
>>> Then use the mismatch() function to get the position(s) of the
>>> mismatche(s):
>>>
>>> > mismatch(samp[[2]], m2)[[1]]
>>> [1] 6 13
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> Wolfgang Raffelsberger wrote:
>>>
>>>> Dear list,
>>>>
>>>> I would like to count the occurrence of (mostly single) nucleotide
>>>> polymorphisms from nucleotide sequences.
>>>> I got across the Biostrings package and pairwiseAlignment() that
>>>> allows
>>>> me to get closer to what I want but
>>>> 1) I noticed that the score produced from pairwiseAlignment() is quite
>>>> different to other implementations of the Needlaman-Wunsch alogorithm
>>>> (eg in EMBOSS)
>>>> 2) the score is not directly the information I 'm looking for since
>>>> it's
>>>> a mixture of the gaps & mismatches (and I don't see if/how one could
>>>> modify that).
>>>>
>>>> However, I would primarily be interested in finding where a given
>>>> nucleotide differs from the query (from a pairwise alignment) to some
>>>> statistics on them, ie at which position I get which other element
>>>> instead. Note, that my sample-sequences may start or end slightly
>>>> later/earlier.
>>>> Any suggestions ?
>>>>
>>>> Sample code might look like (of course, my real sequences are
>>>> longer ...):
>>>>
>>>> ref <- DNAString("ACTTCACCAGCTCCCTGGC")
>>>> samp <-
>>>> DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))
>>>> # the 3rd one has no mutations, it's simply shorter ...
>>>> pairwiseAlignment(ref, samp[[1]], substitutionMatrix = mat,
>>>> gapOpening
>>>> = -5, gapExtension = -2)
>>>> alignScores <- numeric()
>>>> for(i in 1:3) alignScores[i] <- pairwiseAlignment(ref, samp[[i]],
>>>> substitutionMatrix = mat, gapOpening = -5, gapExtension = -2,
>>>> scoreOnly=T)
>>>> alignScores # the 3rd sequence without mismatches gets worst score
>>>>
>>>>
>>>> (Based on a previous post on BioC) I just subscribed to
>>>> bioc-sig-sequencing at r-project.org, but I don't know if I don't
>>>> mange to
>>>> search the previous mail archives (on http://search.gmane.org/)
>>>> since I
>>>> keep getting (general) Bioconductor messages.
>>>>
>>>> Thank's in advance,
>>>> Wolfgang
>>>>
>>>>
>>>> By the way, if that matters, I'm (still) running R-2.7.2
>>>>
>>>>> sessionInfo()
>>>>>
>>>> R version 2.7.2 (2008-08-25)
>>>> i386-pc-mingw32
>>>>
>>>> locale:
>>>> LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
>>>>
>>>>
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices datasets tcltk utils
>>>> methods base other attached packages:
>>>> [1] Biostrings_2.8.18 svSocket_0.9-5 svIO_0.9-5
>>>> R2HTML_1.59 svMisc_0.9-5 svIDE_0.9-5 loaded via a
>>>> namespace (and not attached):
>>>> [1] tools_2.7.2
>>>>
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC
1 rue Laurent Fries, 67404 Illkirch Strasbourg, France
Tel (+33) 388 65 3300 Fax (+33) 388 65 3276
wolfgang.raffelsberger at igbmc.fr
More information about the Bioc-sig-sequencing
mailing list