[Bioc-sig-seq] SNP counting (Biostrings ?) question
Wolfgang Raffelsberger
wraff at titus.u-strasbg.fr
Thu Oct 23 19:24:48 CEST 2008
Dear list,
I would like to count the occurrence of (mostly single) nucleotide
polymorphisms from nucleotide sequences.
I got across the Biostrings package and pairwiseAlignment() that allows
me to get closer to what I want but
1) I noticed that the score produced from pairwiseAlignment() is quite
different to other implementations of the Needlaman-Wunsch alogorithm
(eg in EMBOSS)
2) the score is not directly the information I 'm looking for since it's
a mixture of the gaps & mismatches (and I don't see if/how one could
modify that).
However, I would primarily be interested in finding where a given
nucleotide differs from the query (from a pairwise alignment) to some
statistics on them, ie at which position I get which other element
instead. Note, that my sample-sequences may start or end slightly
later/earlier.
Any suggestions ?
Sample code might look like (of course, my real sequences are longer ...):
ref <- DNAString("ACTTCACCAGCTCCCTGGC")
samp <-
DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))
# the 3rd one has no mutations, it's simply shorter ...
pairwiseAlignment(ref, samp[[1]], substitutionMatrix = mat, gapOpening
= -5, gapExtension = -2)
alignScores <- numeric()
for(i in 1:3) alignScores[i] <- pairwiseAlignment(ref, samp[[i]],
substitutionMatrix = mat, gapOpening = -5, gapExtension = -2, scoreOnly=T)
alignScores # the 3rd sequence without mismatches gets worst score
(Based on a previous post on BioC) I just subscribed to
bioc-sig-sequencing at r-project.org, but I don't know if I don't mange to
search the previous mail archives (on http://search.gmane.org/) since I
keep getting (general) Bioconductor messages.
Thank's in advance,
Wolfgang
By the way, if that matters, I'm (still) running R-2.7.2
> sessionInfo()
R version 2.7.2 (2008-08-25)
i386-pc-mingw32
locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices datasets tcltk utils
methods base
other attached packages:
[1] Biostrings_2.8.18 svSocket_0.9-5 svIO_0.9-5
R2HTML_1.59 svMisc_0.9-5 svIDE_0.9-5
loaded via a namespace (and not attached):
[1] tools_2.7.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC
1 rue Laurent Fries, 67404 Illkirch Strasbourg, France
Tel (+33) 388 65 3300 Fax (+33) 388 65 3276
wolfgang.raffelsberger (at) igbmc.fr
More information about the Bioc-sig-sequencing
mailing list