[Bioc-sig-seq] SNP counting (Biostrings ?) question

Fri Oct 24 21:24:09 CEST 2008

Hi Wolfgang,

If I understand correctly you want to compare a collection of "sample"
sequences against a "reference" sequence.

There are many ways to "compare" 2 sequences and to get a score from this
comparison: Do you want to allow indels or just substitutions (aka replacements)?
Do you want the alignment to be global or local? Do you want the score to be
determined by a custom substitution matrix or just to reflect the "edit
distance" between the 2 sequences (i.e. the minimum number of replacements
+ insertions + deletions required to transform sequence 1 into sequence 2).

If you are trying to get the edit distance between the sample sequences
and the reference sequence then you might want to look at this post:

  https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2008-July/000067.html

Note that the 'type' argument is omitted in this post so pairwiseAlignment()
is used to perform "global" alignments, that is, the score that is returned
indicates the min nb of edit operations needed to transform the *complete*
sequences in 'pattern' (your "sample" sequences) into the *complete* 'subject'
(your "reference" sequence).
One nice property of "global" alignment is that it is a symetric operation on
the 2 sequences to align. You can take advantage of this by inverting the roles
of the "sample" and the "reference" sequences so 'samp' can be the pattern
and 'ref' the subject. Then you can benefit from the fact that
pairwiseAlignment() is vectorized on the pattern.

But then you say that you would primarily be interested in finding where
a given nucleotide differs from the query. This suggests that you are in fact
looking for a much simpler form of alignment where only mismatches (i.e.
replacements) are allowed but no indels. Also you say that your
sample-sequences may start or end slightly later/earlier.
This means that you can use the matchPattern() function for your problem.

  > m2 <- matchPattern(samp[[2]], ref, max.mismatch=3)
  > m2
    Views on a 19-letter DNAString subject
  subject: ACTTCACCAGCTCCCTGGC
  views:
      start end width
  [1]     1  18    18 [ACTTCACCAGCTCCCTGG]

Then use the mismatch() function to get the position(s) of the mismatche(s):

  > mismatch(samp[[2]], m2)[[1]]
  [1]  6 13

Cheers,
H.

Wolfgang Raffelsberger wrote:
> Dear list,
> 
> I would like to count the occurrence of (mostly single) nucleotide
> polymorphisms from nucleotide sequences.
> I got across the Biostrings package and pairwiseAlignment() that allows
> me to get closer to what I want but
> 1) I noticed that the score produced from pairwiseAlignment() is quite
> different to other implementations of the Needlaman-Wunsch alogorithm
> (eg in EMBOSS)
> 2) the score is not directly the information I 'm looking for since it's
> a mixture of the gaps & mismatches (and I don't see if/how one could
> modify that).
> 
> However, I would primarily be interested in finding where a given
> nucleotide differs from the query (from a pairwise alignment) to some
> statistics on them, ie at which position I get which other element
> instead. Note, that my sample-sequences may start or end slightly
> later/earlier.
> Any suggestions ?
> 
> Sample code might look like (of course, my real sequences are longer ...):
> 
> ref <- DNAString("ACTTCACCAGCTCCCTGGC")
> samp <-
> DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))  
> # the 3rd one has no mutations, it's simply shorter ...
>  pairwiseAlignment(ref, samp[[1]], substitutionMatrix = mat, gapOpening
> = -5, gapExtension = -2)
> alignScores <- numeric()
> for(i in 1:3) alignScores[i] <- pairwiseAlignment(ref, samp[[i]],
> substitutionMatrix = mat, gapOpening = -5, gapExtension = -2, scoreOnly=T)
> alignScores     # the 3rd sequence without mismatches gets worst score
> 
> 
> (Based on a previous post on BioC) I just subscribed to
> bioc-sig-sequencing at r-project.org, but I don't know if I don't mange to
> search the previous mail archives (on http://search.gmane.org/) since I
> keep getting (general) Bioconductor messages.
> 
> Thank's in advance,
> Wolfgang
> 
> 
> By the way, if that matters, I'm (still) running R-2.7.2
>> sessionInfo()
> R version 2.7.2 (2008-08-25)
> i386-pc-mingw32
> 
> locale:
> LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
> 
> 
> attached base packages:
> [1] stats     graphics  grDevices datasets  tcltk     utils    
> methods   base   
> other attached packages:
> [1] Biostrings_2.8.18 svSocket_0.9-5    svIO_0.9-5       
> R2HTML_1.59       svMisc_0.9-5      svIDE_0.9-5    
> loaded via a namespace (and not attached):
> [1] tools_2.7.2
> 
> 
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Wolfgang Raffelsberger, PhD
> Laboratoire de BioInformatique et Génomique Intégratives
> CNRS UMR7104, IGBMC 1 rue Laurent Fries,  67404 Illkirch  Strasbourg, 
> France
> Tel (+33) 388 65 3300         Fax (+33) 388 65 3276
> wolfgang.raffelsberger (at) igbmc.fr
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing