[Bioc-sig-seq] Bioc short read directions

Wed Apr 2 15:31:57 CEST 2008

Hi Stephen --

Some details on PDict are in the Biostrings package, especially

?PDict
?matchPDict
?"matchPDict-inexact"

PDict is a 'dictionary' of reads. It allows for a 'trusted band'. The trusted
band is a (constant-width) region (e.g., the first 12 nucleotides of each read
in the PDict) that the user has special trust in, e.g., as a highly reliable
part of the sequence.

matchPDict / countPDict process the PDict by matching exactly the trusted band,
and allowing mismatches (NOT indels) outside the trusted band. 

A little more under the hood, a PDict is an implementation of an Aho-Corasick
automaton. The entire dictionary is processed in a single pass along the
'subject' (e.g., reference genome); I think the processing time for exact
matches is on the order of the length of the subject + the width (not number of
entries!) of the trusted band. The current strategy to allow for mismatches is
more brute-force, scanning read and subject from exact matches for the trusted
band. Variants on the PDict / matchPDict theme are still being developed, e.g.,
tailored for SNP identification.

Again, this is Herve Page's work.

Hope that helps,

Martin

Quoting Stephen Henderson <s.henderson at ucl.ac.uk>:

> Hi
> This all soungs very promising Im very impressed by the speed of these
> operations.
>  
> Can you give a little more detail about the underlying structure of PDict and
> how it 'matches' sequences?
> 
>  
> ________________________________
> 
> From: bioc-sig-sequencing-bounces at r-project.org on behalf of Martin Morgan
> Sent: Wed 02/04/2008 04:20
> To: bioc-sig-sequencing at r-project.org
> Subject: [Bioc-sig-seq] Bioc short read directions
> 
> 
> 
> Short-readers!
> 
> I want to take this opportunity to provide an update on the directions
> we've initiated for short reads. It would be great to get your feedback,
> and to hear of your own efforts.
> 
> WARNING: This software is in development, and is far from final. In
> particular, functions in the BiostringsCinterfaceDemo package are NOT
> meant to be final or for use in other packages; they're here to
> demonstrate 'proof-of-concept' and to illustrate how users can access
> the Biostrings package from C code. I'll indicate which functions are
> from BiostringsCinterfaceDemo below. Expect a short-read 'base' package
> to materialize in the devel branch of Bioconductor in the not too
> distant future.
> 
> WARNING: The data used to illustrate functionality are not meant to be
> indiciative of the quality of data produced by Solexa; they are
> generally 'first runs' that present a number of interesting challenges
> for interpretation.
> 
> HARDWARE AND SOFTWARE: The following include timing and object size
> measurements. The machine being used is fast, but we're not doing
> anything fancy to, e.g., exploit multiple processors. The machine has a
> very large amount of memory; we used about 10 GB below, looking at three
> different data sets. The following uses the R-2-7-branch (this is
> different from R-devel). The Biostrings and BiostringsCinterfaceDemo
> packages are updated very regularly, so be prepared for broken or
> outdated functions.
> 
> Herve Pages is responsible for the clever code; I am just a scribe.
> 
> Ok, first a convenience function to print out 'size' in megabytes,
> 'cause objects are large!
> 
>  > mb <- function(sz) noquote(sprintf("%.1f MB", sz / (1024^2)))
> 
> 
> * Starters...
> 
> We load the BiostringsCinterfaceDemo, which requires Biostrings. Both of
> these need to be from the 'development' branch of Bioconductor. Both are
> changing rapidly, and should be obtained from svn and updated regularly
> (http://wiki.fhcrc.org/bioc/DeveloperPage,
> http://wiki.fhcrc.org/bioc/SvnHowTo).
> 
>  > library(BiostringsCinterfaceDemo)
> 
> 
> * I/O, DNAStringSet, and alphabetFrequency
> 
> We next read in a fasta file derived from a lane of solexa reads. Here's
> what the data looks like:
> 
>  > readLines(fastaFile, 10)
>   [1] ">5_1_102_368"
>   [2] "TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT"
>   [3] ">5_1_120_254"
>   [4] "TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT"
>   [5] ">5_1_110_385"
>   [6] "GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC"
>   [7] ">5_1_118_88"
>   [8] "GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC"
>   [9] ">5_1_113_327"
> [10] "GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC"
> 
> This is a single lane from a Solexa training run; the data are not
> filtered, and the run is not meant to be representative in terms of
> quality or other characteristics. The DNA used for the reads is from
> phage phiX-174. Here's how we read it in (countLines and readSolexaFastA
> are in BiostringsCinterfaceDemo)
> 
>  > countLines(fastaFile)
> s_5.fasta
>   18955056
>  > system.time({
> +     seqa <- readSolexaFastA(fastaFile)
> + }, gcFirst=TRUE)
>     user  system elapsed
>    67.48    2.08   69.68
>  > mb(object.size(seqa))
> [1] 397.7 MB
>  > seqa
>    A DNAStringSet instance of length 9477528
>            width seq
>        [1]    36 TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT
>        [2]    36 TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT
>        [3]    36 GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC
>        [4]    36 GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC
>        [5]    36 GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC
>        [6]    36 TGACCCTCAGCAATCTTAAACTTCTTAGACGAATCA
>        [7]    36 GCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGC
>        [8]    36 TTTAGGTGTCTGTAAAACAGGTGCCGAAGAAGCTGG
>        [9]    36 GGTCTGTTGAACACGACCAGAAAACTGGCCTAACGA
>        ...   ... ...
> [9477520]    36 TACGCAGTTTTGCCGTATACTCGTTGTTCTGACTCT
> [9477521]    36 TATACCCCCCCTCCTACTTGTGCTGTTTCTCATGTT
> [9477522]    36 CAGGTTGTTTCTGTTGGTGCTGATATTTCTTTTTTT
> [9477523]    36 GTCTTCCTTGCTTGTCAGATTGGTCGTCTTATTACC
> [9477524]    36 ATACGAAAGACCAGGTATATGCACAAAATGAGTTGC
> [9477525]    36 ACCACAAACGCGCTCGTTTATGCTTGCCTCTATTAC
> [9477526]    36 ------------------------------------
> [9477527]    36 CCAGCAAGGAAGCCAAGATGGGAAAGGTCATGCGGC
> [9477528]    36 CATTGTTGACCACCTACATACCACAGACGAGCACCT
> 
> It takes just over a minute to read in the nearly 9.5 million reads. The
> reads are stored efficiently, without the overhead of R character
> strings. The data structure (a DNAStringSet, from Biostrings and
> therefore stable) will not copy the large data, but instead contains
> 'views' into it.
> 
> A basic question is about the nucleotides present in the reads.
> alphabetFrequency (from Biostrings) scans all the sequences and tallies
> nucleotides
> 
>  > system.time({
> +     alf1 <- alphabetFrequency(seqa, collapse=TRUE, freq=TRUE)
> + }, gcFirst=TRUE)
>     user  system elapsed
>    0.612   0.000   0.612
>  > alf1
>       A      C      G      T      M      R      W      S      Y      K
> 0.2449 0.2119 0.2201 0.3030 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
>       V      H      D      B      N      -
> 0.0000 0.0000 0.0000 0.0000 0.0000 0.0201
> 
> (bases are recored with IUPAC symbols, see ?IUPAC_CODE_MAP). This
> executes very efficiently. A variant produces a matrix with rows
> corresponding to reads and columns to bases.
> 
>  > alf <- alphabetFrequency(seqa, baseOnly=TRUE)
>  > dim(alf)
> [1] 9477528       5
>  > head(alf)
>        A  C  G  T other
> [1,]  9  4  6 17     0
> [2,] 11  6  5 14     0
> [3,] 10  9  4 13     0
> [4,]  4  9 12 11     0
> [5,]  6  6 11 13     0
> [6,] 12 10  4 10     0
> 
> This can be remarkably useful. For instance, to select just the 'clean'
> sequences (those without ambiguous base calls), one can
> 
>  > cleanSeqs <- seqa[alf[,"other"]==0]
>  > length(seqa)
> [1] 9477528
>  > length(cleanSeqs)
> [1] 9207292
> 
> This creates a new DNAStringSet with just the clean sequences. It
> executes very quickly, because the DNAStringSet is a view into the
> original. The memory associated with the reads themselves is not copied.
> here is the alphabetFrequency of the 'clean' reads.
> 
>  > cleanAlf <- alphabetFrequency(cleanSeqs, baseOnly=TRUE)
> 
> Again this is very useful, for instance the distribution of GC content
> among clean reads is
> 
>  > plot(density(rowSums(cleanAlf[,c("G", "C")]) / rowSums(cleanAlf)))
> 
> 
> * PDict, countPDict, matchPDict
> 
> A 'PDict' (defined in Biostrings) is a dictionary-like structure that
> can be used for very efficient exact- and partially-exact matching
> algorithms. To illustrate, we'll use data from about a million reads of
> the Solexa BAC cloning vector. These reads again come from an early run
> on the Solexa instrumentation here, and results should not be taken to
> be representative of performance.
> 
> We read and clean the sequences as above, resulting in
> 
>  > length(cleanSeqs)
> [1] 923680
> 
> We then create a PDict from our DNAStringSet with
> 
>  > system.time({
> +     pDict <- PDict(cleanSeqs)
> + }, gcFirst=TRUE)
>     user  system elapsed
>     1.09    0.00    1.10
>  > pDict
> 923680-pattern constant width PDict object of width 25 (patterns have no
> names)
>  > mb(object.size(pDict))
> [1] 160.4 MB
> 
> This is created quickly. It is a larger object, but the size allows fast
> searches. Here we'll use Biostrings readFASTA to read in the sequence to
> which the data are to be aligned.
> 
>  > bac <- read.DNAStringSet(bacFile, "fasta")[[1]]
> Read 2479 items
>  > length(bac)
> [1] 173427
> 
> This is a BAC clone. We'll match our pDict to the BAC subject, finding
> all EXACT matches;
> 
>  > system.time({
> +     counts <- countPDict(pDict, bac)
> + }, gcFirst=FALSE)
>     user  system elapsed
>    0.200   0.048   0.268
>  > length(counts)
> [1] 923680
>  > table(counts)[1:5]/sum(table(counts))
> counts
>        0       1       2       3       4
> 0.53954 0.42738 0.01136 0.00528 0.00334
> 
> This is very fast, partly because the subject against which the PDict is
> being matched is short. A more realistic use case is to match against a
> genome. The integration with BSgenome packages is very smooth. To match
> our pDict against human chromosome 6 (where the BAC cloning vector comes
> from), we can load the appropriate BSgenome package and chromosome
> 
>  > library(BSgenome.Hsapiens.UCSC.hg18)
> 
>  > Hsapiens[["chr6"]]
>    170899992-letter "DNAString" instance
> seq:
> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> 
> (The N's represent the chromsome telomeres, whose seuqences is not
> available). We look for exact matches
> 
>  > system.time({
> +     hcount <- countPDict(pDict, Hsapiens[["chr6"]])
> + }, gcFirst=TRUE)
>     user  system elapsed
>     24.2     0.0    24.2
>  > table(hcount)[1:5] / sum(table(hcount))
> hcount
>        0       1       2       3       4
> 0.50466 0.32996 0.02043 0.00959 0.00761
> 
> About 1/2 the sequences exactly match one or more locations on
> chromosome 6. Some sequences match many times, though the reason (in
> this case, anyway) is not too surprising:
> 
>  > max(hcount)
> [1] 8286
>  > maxIdx = which(hcount==max(hcount))
>  > cleanSeqs[maxIdx]
>    A DNAStringSet instance of length 70
>       width seq
>   [1]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   [2]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   [3]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   [4]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   [5]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   [6]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   [7]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   [8]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   [9]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>   ...   ... ...
> [62]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> [63]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> [64]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> [65]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> [66]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> [67]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> [68]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> [69]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> [70]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
> 
> (that these are identical can be checked with
> unique(as.character(cleanSeqs[maxIdx])))
> 
> The current implementation of PDict allows for a 'trusted band' of
> nucleotides that need to match exactly, allowing for approximate matches
> in the remaining nucleotides. Here we trust the first 12 bases, and
> allow up to 3 mismatches. Also, we use the more informative matchPDict,
> which allows us to find the positions of all matches
> 
>  > trusted <- PDict(cleanSeqs, tb.end=12)
>  > system.time({
> +     mmMatch <- matchPDict(trusted, Hsapiens[[6]], max.mismatch=3)
> + }, gcFirst=TRUE)
>     user  system elapsed
>   195.58    3.85  199.44
>  > table(cut(countIndex(mmMatch), c(0, 10^(0:5)), right=FALSE))
> 
>          [0,1)        [1,10)      [10,100)   [100,1e+03) [1e+03,1e+04)
>         377233        337806         72983         66322         60818
> [1e+04,1e+05)
>           8518
> 
> Execution time increases as the stringency of the match decreases. PDict
> facilities do not yet incorporate quality scores, but filtering results
> based on quality (qualities are discussed further below) represents a
> natural direction for development.
> 
> 
> 
> * alphabetByCycle
> 
> alphabetByCycle uses a small C function in BiostringsCinterfaceDemo,
> alphabet_by_cycle. It and the read* functions in this package illustrate
> how to access DNAStringSet objects at the C level. alphabetByCycle is a
> matrix that tallies nucleotide use per cycle
> 
>  > system.time({
> +     abc <- alphabetByCycle(seqa)
> + })
>     user  system elapsed
>     2.68    0.00    2.68
> 
> Again this can be quite useful. For instance, we can find out the number
> of bases that were not called, as a function of cycle
> 
>  > abc["-",]
>   [1]  22181 108829 123173 180382 225091 225055 216787 208538 208881
> [10] 213104 148936 142966 141030 148163 178747 204304 211538 211303
> [19] 213722 211167 208021 208715 165441 158359 147110 151462 204781
> [28] 221008 223922 221171 227622 226936 232232 236606 242387 241718
> 
> and the number of 'T' nucleotides as a function of cycle
> 
>  > abc["T",] / colSums(abc[1:4,])
>   [1] 0.286 0.292 0.292 0.284 0.292 0.280 0.293 0.297 0.299 0.287 0.290
> [12] 0.294 0.294 0.299 0.300 0.301 0.304 0.305 0.306 0.309 0.310 0.311
> [23] 0.312 0.316 0.315 0.319 0.320 0.322 0.326 0.328 0.331 0.335 0.339
> [34] 0.342 0.346 0.358
> 
> That's quite a striking increase after cycle 25!
> 
> 
> * Qualities
> 
> Solexa reads have quality scores associated with each base call. These
> are summarized in files formatted like:
> 
>  > readLines(fastqFile, n=8)
> [1] "@HWI-EAS88_1_1_1_1001_499"
> [2] "GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT"
> [3] "+HWI-EAS88_1_1_1_1001_499"
> [4] "]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS"
> [5] "@HWI-EAS88_1_1_1_898_392"
> [6] "GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC"
> [7] "+HWI-EAS88_1_1_1_898_392"
> [8] "]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK"
> 
> A record consists of four lines. The first and third lines are
> identifiers (repeated), the second line is the sequence, the fourth line
> an ASCII character representing the score (']' is good, Z is better than
> A). Here we read a quality file into a data structure defined in
> BiostringsCinterfaceDemo designed to coordinate the sequence, name, and
> quality information.
> 
>  > system.time({
> +     seqq <- readSolexaFastQ(fastqFile)
> + }, gcFirst=TRUE)
>     user  system elapsed
>   17.533   0.417  17.969
>  > mb(object.size(seqq))
> [1] 254.5 MB
>  > seqq
> class: SolexaSequenceQ
> length: 2218237
> 
>  > sequences(seqq)[1:5]
>    A DNAStringSet instance of length 5
>      width seq
> [1]    36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
> [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC
> [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT
> [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA
> [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT
>  > names(seqq)[1:5]
>    A BStringSet instance of length 5
>      width seq
> [1]    24 HWI-EAS88_1_1_1_1001_499
> [2]    23 HWI-EAS88_1_1_1_898_392
> [3]    23 HWI-EAS88_1_1_1_922_465
> [4]    23 HWI-EAS88_1_1_1_895_493
> [5]    23 HWI-EAS88_1_1_1_953_493
>  > scores(seqq)[1:5]
>    A BStringSet instance of length 5
>      width seq
> [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
> [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
> [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
> [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
> [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
> 
> The object returned by readSolexaFastQ is an S4 object with three slots.
> Each slot contains a XStringSet, where 'X' is DNA for sequences, and
> 'BString' for names and scores. This represents one way of structuring
> quality data; the S4 class coordinates subsetting, and provides
> (read-only) accessors to the underlying objects. A likely addition to
> this class as it matures is the inclusion of lane-specific phenotype
> (sample) information, much as an ExpressionSet coordinates sample and
> expression values.
> 
> We can gain some basic insight into the sequences as before, e.g.,
> 
>  > abc <- alphabetByCycle(sequences(seqq))
>  > abc["N",]
>   [1]     0     0     0     0     0     0     0     0     0     0     0
> [12]     0  1213  1631  1155  1240   721   418  8503   526  6493   703
> [23] 14999   718  1623   737   243 40704   811   590  1964   961   809
> [34]   910   477   208
> 
> Solexa provides files with quality information after filtering reads
> based on 'purity', a measure that precludes uncertain bases (IUPAC code
> 'N') from a user-specified region (the first 12 cycles, by default).
> 
>  > abc["T",] / colSums(abc[1:4,])
>   [1] 0.298 0.290 0.292 0.290 0.295 0.279 0.292 0.298 0.301 0.293 0.295
> [12] 0.299 0.296 0.293 0.294 0.294 0.293 0.295 0.294 0.297 0.297 0.299
> [23] 0.298 0.303 0.302 0.307 0.312 0.312 0.321 0.331 0.334 0.351 0.361
> [34] 0.374 0.380 0.423
> 
> We can also summarize quality information by cycle, using an alphabet
> that reflects the encoded scores:
> 
>  > alphabet <- sapply(as.raw(32:93), rawToChar)
>  > abcScore <- alphabetByCycle(scores(seqq), alphabet=alphabet)
>  >
>  > rowSums(abcScore)
>                  !        "        #        $        %        &        '
>         0        0        0        0        0        0        0        0
>         (        )        *        +        ,        -        .        /
>         0        0        0        0        0        0        0        0
>         0        1        2        3        4        5        6        7
>         0        0        0        0        0        0        0        0
>         8        9        :        ;        <        =        >        ?
>         0        0        0        0        0        0        0        0
>         @        A        B        C        D        E        F        G
>         0  1367337        0  2035064        0  1006372   819775        0
>         H        I        J        K        L        M        N        O
>   1926112   116395  1614109   356496   517731  1602298   578172  2016009
>         P        Q        R        S        T        U        V        W
>   3043393   401196  2152160  1553511  2149231   254108  3896558   232409
>         X        Y        Z        [       \\        ]
>    706452  4353706  1292309   732210        0 45133419
>  > abcScore[34:62,c(1:4, 33:36)]
>        [,1]    [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]
> A        1    1788    1798       8 260944 299631 332500 354717
> B        0       0       0       0      0      0      0      0
> C       59   36426   12705     227     39 131268 140461 154643
> D        0       0       0       0      0      0      0      0
> E      133   13391    4033     180 120135      0      0      0
> F        0       0       0       0      0 259549 272238 287988
> G        0       0       0       0      0      0      0      0
> H      272    8815    2933     357 242741 234088 239917 239157
> I        0       0       0       0 116395      0      0      0
> J      472    4678    1656     586      0 195991 198114 187535
> K        5      59      25      11 110052  84345  83744  77353
> L        0       0       0       0 102438 144459 142049 128785
> M       75     171     121      92  93945 119209 116492 103765
> N      653    2992    1232     769  85940  91924  89084  79246
> O     1227    2719    1754    1576 175403 187339 176997 160100
> P    65834   57830   60855   65863  61910      0      0      0
> Q        0       0       0       0 236290      0      0      0
> R     3211    4482    4321    4215      0      0      0      0
> S        0       0       0       0  68378 470434 426641 444948
> T     4444    5373    5868    5637      0      0      0      0
> U        0       0       0       0      0      0      0      0
> V     9146   10270   11922   11547 543627      0      0      0
> W        0       0       0       0      0      0      0      0
> X        0       0       0       0      0      0      0      0
> Y    28363   29080   34093   32873      0      0      0      0
> Z        0       0       0       0      0      0      0      0
> [        0       0       0       0      0      0      0      0
> \\       0       0       0       0      0      0      0      0
> ]  2104342 2040163 2074921 2094296      0      0      0      0
> 
> Output from the last line shows how scores decrease from the first four
> cycles to the last four. Standard R and Biostrings commands can be used
> to ask many interesting questions, such as an overall quality score of
> reads (e.g., summing the scores of individual nucleotides) and the
> relationship between sequence characteristics (e.g., frequency of 'T')
> and read quality.
> 
>  > atgc <- alphabetFrequency(sequences(seqq), baseOnly=TRUE)
>  > qscore <- alphabetFrequency(scores(seqq))
>  > dim(qscore)
> [1] 2218237     256
>  > mb(object.size(qscore))
> [1] 2166.2 MB
> 
>  > quality <- colSums(t(qscore) * 1:ncol(qscore))
>  > plot(density(quality))   # small secondary peak at low quality
>  > scores(seqq)[which(quality<2925)]
>    A BStringSet instance of length 87696
>          width seq
>      [1]    36 PPPPPPPPPPPPPEPPPPPOPPMOOPPOMMMPOOOJ
>      [2]    36 PPPPPPPPPPPPPPPPPPPPHPPPOPPMOPPPNKMA
>      [3]    36 PPPPPPPPPPPPPPPPPOPPPPPPEPMPPPMPOFAF
>      [4]    36 PPPPPPPPPPPPPPPPPPPPOPPPPPPPPPOPOOHK
>      [5]    36 PPPPPPPPPPPPPPPPPPPPPPPOPPOPOPPMKMLF
>      [6]    36 PPPPPPPPPPPPPPPPOPPPPPCPPPPHOOPPOOOO
>      [7]    36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOKO
>      [8]    36 PPPPPPPPPPPPPPPPPPPOPPPPPPPPJPOPOMAO
>      [9]    36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOOL
>      ...   ... ...
> [87688]    36 PPPPPPPPMPOPCOOCOOJEOMMCCOEEHCCMAAAA
> [87689]    36 PTYOR]NTVVVJOHJCCPMEHCMHOJCECCCCAACA
> [87690]    36 Y]]]NTVYYT]TPNRPNRYHCTECOHCHHCECHAAC
> [87691]    36 YPRVNVYVPYHORCOCOPEPECJCCHCCJECHAAAA
> [87692]    36 PPPPPPPPPPPPPHPPOCPOPPOPCHMMPPPEKHOO
> [87693]    36 JJHPTTJYV]]]]JJCJCJJTCCOVCMHMCCOAAAA
> [87694]    36 PPPPPPPPPPPPPPPPPOPPPJPPPCPOHHCMOCAF
> [87695]    36 TYR]R]TN]YOEPTNERPPTVTCTVCCCRHOMIFAC
> [87696]    36 YRRYJVVTPHVPPECHPNCMCCJCEECOCCCCAAAA
>  >
>  > t <- atgc[,"T"] / rowSums(atgc[,1:4])
>  > cor(t, quality)
> [1] 0.154
> 
> All of these operations are quick enough to perform in an interactive
> session; the qscore is a large matrix (it can be made smaller by
> choosing bounds that reflect allowable scores, e.g., 32:127), and its
> transposition is relatively expensive.
> 
> A final point to remember is that R stores a matrix m as a vector of
> length nrow(m) * ncol(m). R has an internal limit on the size of a
> vector equal to 2^32-1, so the maximum number of reads whose scores can
> be represented by alphabetFrequency is less than 2^32 / 256, or about 16
> million reads; this number of reads might be approached in a single
> Solexa lane; a simple solution is to divide the DNAStringSet into pieces
> that are processed separately.
> 
> 
> 
> I hope that the forgoing provides some indication of where we stand at
> the moment. Again, it would be great to have feedback, and to hear of
> other efforts. And again, the programming credit goes to Herve Pages.
> 
> Martin
> 
> The obligatory sessionInfo, plus some stats on processing this document
> (referencing, in the end, three different data sets).
> 
>  > sessionInfo()
> R version 2.7.0 alpha (2008-03-28 r44972)
> x86_64-unknown-linux-gnu
> 
> locale:
>
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets
> [7] methods   base
> 
> other attached packages:
> [1] BiostringsCinterfaceDemo_0.1.2
> [2] BSgenome.Hsapiens.UCSC.hg18_1.3.2
> [3] BSgenome_1.7.4
> [4] Biostrings_2.7.41
> [5] Biobase_1.99.4
>  > gc()
>             used   (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells 9.37e+05   50.1   4.87e+06   260 1.93e+07  1033
> Vcells 6.88e+08 5252.6   1.31e+09 10031 1.31e+09  9998
>  > proc.time()
>     user  system elapsed
>    356.7    16.6   378.5
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 
> 
> **********************************************************************
> This email and any files transmitted with it are confi...{{dropped:12}}