[Bioc-sig-seq] Bioc short read directions

Martin Morgan mtmorgan at fhcrc.org
Wed Apr 2 05:20:09 CEST 2008


Short-readers!

I want to take this opportunity to provide an update on the directions 
we've initiated for short reads. It would be great to get your feedback, 
and to hear of your own efforts.

WARNING: This software is in development, and is far from final. In 
particular, functions in the BiostringsCinterfaceDemo package are NOT 
meant to be final or for use in other packages; they're here to 
demonstrate 'proof-of-concept' and to illustrate how users can access 
the Biostrings package from C code. I'll indicate which functions are 
from BiostringsCinterfaceDemo below. Expect a short-read 'base' package 
to materialize in the devel branch of Bioconductor in the not too 
distant future.

WARNING: The data used to illustrate functionality are not meant to be 
indiciative of the quality of data produced by Solexa; they are 
generally 'first runs' that present a number of interesting challenges 
for interpretation.

HARDWARE AND SOFTWARE: The following include timing and object size 
measurements. The machine being used is fast, but we're not doing 
anything fancy to, e.g., exploit multiple processors. The machine has a 
very large amount of memory; we used about 10 GB below, looking at three 
different data sets. The following uses the R-2-7-branch (this is 
different from R-devel). The Biostrings and BiostringsCinterfaceDemo 
packages are updated very regularly, so be prepared for broken or 
outdated functions.

Herve Pages is responsible for the clever code; I am just a scribe.

Ok, first a convenience function to print out 'size' in megabytes, 
'cause objects are large!

 > mb <- function(sz) noquote(sprintf("%.1f MB", sz / (1024^2)))


* Starters...

We load the BiostringsCinterfaceDemo, which requires Biostrings. Both of 
these need to be from the 'development' branch of Bioconductor. Both are 
changing rapidly, and should be obtained from svn and updated regularly 
(http://wiki.fhcrc.org/bioc/DeveloperPage, 
http://wiki.fhcrc.org/bioc/SvnHowTo).

 > library(BiostringsCinterfaceDemo)


* I/O, DNAStringSet, and alphabetFrequency

We next read in a fasta file derived from a lane of solexa reads. Here's 
what the data looks like:

 > readLines(fastaFile, 10)
  [1] ">5_1_102_368"
  [2] "TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT"
  [3] ">5_1_120_254"
  [4] "TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT"
  [5] ">5_1_110_385"
  [6] "GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC"
  [7] ">5_1_118_88"
  [8] "GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC"
  [9] ">5_1_113_327"
[10] "GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC"

This is a single lane from a Solexa training run; the data are not 
filtered, and the run is not meant to be representative in terms of 
quality or other characteristics. The DNA used for the reads is from 
phage phiX-174. Here's how we read it in (countLines and readSolexaFastA 
are in BiostringsCinterfaceDemo)

 > countLines(fastaFile)
s_5.fasta
  18955056
 > system.time({
+     seqa <- readSolexaFastA(fastaFile)
+ }, gcFirst=TRUE)
    user  system elapsed
   67.48    2.08   69.68
 > mb(object.size(seqa))
[1] 397.7 MB
 > seqa
   A DNAStringSet instance of length 9477528
           width seq
       [1]    36 TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT
       [2]    36 TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT
       [3]    36 GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC
       [4]    36 GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC
       [5]    36 GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC
       [6]    36 TGACCCTCAGCAATCTTAAACTTCTTAGACGAATCA
       [7]    36 GCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGC
       [8]    36 TTTAGGTGTCTGTAAAACAGGTGCCGAAGAAGCTGG
       [9]    36 GGTCTGTTGAACACGACCAGAAAACTGGCCTAACGA
       ...   ... ...
[9477520]    36 TACGCAGTTTTGCCGTATACTCGTTGTTCTGACTCT
[9477521]    36 TATACCCCCCCTCCTACTTGTGCTGTTTCTCATGTT
[9477522]    36 CAGGTTGTTTCTGTTGGTGCTGATATTTCTTTTTTT
[9477523]    36 GTCTTCCTTGCTTGTCAGATTGGTCGTCTTATTACC
[9477524]    36 ATACGAAAGACCAGGTATATGCACAAAATGAGTTGC
[9477525]    36 ACCACAAACGCGCTCGTTTATGCTTGCCTCTATTAC
[9477526]    36 ------------------------------------
[9477527]    36 CCAGCAAGGAAGCCAAGATGGGAAAGGTCATGCGGC
[9477528]    36 CATTGTTGACCACCTACATACCACAGACGAGCACCT

It takes just over a minute to read in the nearly 9.5 million reads. The 
reads are stored efficiently, without the overhead of R character 
strings. The data structure (a DNAStringSet, from Biostrings and 
therefore stable) will not copy the large data, but instead contains 
'views' into it.

A basic question is about the nucleotides present in the reads. 
alphabetFrequency (from Biostrings) scans all the sequences and tallies 
nucleotides

 > system.time({
+     alf1 <- alphabetFrequency(seqa, collapse=TRUE, freq=TRUE)
+ }, gcFirst=TRUE)
    user  system elapsed
   0.612   0.000   0.612
 > alf1
      A      C      G      T      M      R      W      S      Y      K
0.2449 0.2119 0.2201 0.3030 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
      V      H      D      B      N      -
0.0000 0.0000 0.0000 0.0000 0.0000 0.0201

(bases are recored with IUPAC symbols, see ?IUPAC_CODE_MAP). This 
executes very efficiently. A variant produces a matrix with rows 
corresponding to reads and columns to bases.

 > alf <- alphabetFrequency(seqa, baseOnly=TRUE)
 > dim(alf)
[1] 9477528       5
 > head(alf)
       A  C  G  T other
[1,]  9  4  6 17     0
[2,] 11  6  5 14     0
[3,] 10  9  4 13     0
[4,]  4  9 12 11     0
[5,]  6  6 11 13     0
[6,] 12 10  4 10     0

This can be remarkably useful. For instance, to select just the 'clean' 
sequences (those without ambiguous base calls), one can

 > cleanSeqs <- seqa[alf[,"other"]==0]
 > length(seqa)
[1] 9477528
 > length(cleanSeqs)
[1] 9207292

This creates a new DNAStringSet with just the clean sequences. It 
executes very quickly, because the DNAStringSet is a view into the 
original. The memory associated with the reads themselves is not copied. 
here is the alphabetFrequency of the 'clean' reads.

 > cleanAlf <- alphabetFrequency(cleanSeqs, baseOnly=TRUE)

Again this is very useful, for instance the distribution of GC content 
among clean reads is

 > plot(density(rowSums(cleanAlf[,c("G", "C")]) / rowSums(cleanAlf)))


* PDict, countPDict, matchPDict

A 'PDict' (defined in Biostrings) is a dictionary-like structure that 
can be used for very efficient exact- and partially-exact matching 
algorithms. To illustrate, we'll use data from about a million reads of 
the Solexa BAC cloning vector. These reads again come from an early run 
on the Solexa instrumentation here, and results should not be taken to 
be representative of performance.

We read and clean the sequences as above, resulting in

 > length(cleanSeqs)
[1] 923680

We then create a PDict from our DNAStringSet with

 > system.time({
+     pDict <- PDict(cleanSeqs)
+ }, gcFirst=TRUE)
    user  system elapsed
    1.09    0.00    1.10
 > pDict
923680-pattern constant width PDict object of width 25 (patterns have no 
names)
 > mb(object.size(pDict))
[1] 160.4 MB

This is created quickly. It is a larger object, but the size allows fast 
searches. Here we'll use Biostrings readFASTA to read in the sequence to 
which the data are to be aligned.

 > bac <- read.DNAStringSet(bacFile, "fasta")[[1]]
Read 2479 items
 > length(bac)
[1] 173427

This is a BAC clone. We'll match our pDict to the BAC subject, finding 
all EXACT matches;

 > system.time({
+     counts <- countPDict(pDict, bac)
+ }, gcFirst=FALSE)
    user  system elapsed
   0.200   0.048   0.268
 > length(counts)
[1] 923680
 > table(counts)[1:5]/sum(table(counts))
counts
       0       1       2       3       4
0.53954 0.42738 0.01136 0.00528 0.00334

This is very fast, partly because the subject against which the PDict is 
being matched is short. A more realistic use case is to match against a 
genome. The integration with BSgenome packages is very smooth. To match 
our pDict against human chromosome 6 (where the BAC cloning vector comes 
from), we can load the appropriate BSgenome package and chromosome

 > library(BSgenome.Hsapiens.UCSC.hg18)

 > Hsapiens[["chr6"]]
   170899992-letter "DNAString" instance
seq: 
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(The N's represent the chromsome telomeres, whose seuqences is not 
available). We look for exact matches

 > system.time({
+     hcount <- countPDict(pDict, Hsapiens[["chr6"]])
+ }, gcFirst=TRUE)
    user  system elapsed
    24.2     0.0    24.2
 > table(hcount)[1:5] / sum(table(hcount))
hcount
       0       1       2       3       4
0.50466 0.32996 0.02043 0.00959 0.00761

About 1/2 the sequences exactly match one or more locations on 
chromosome 6. Some sequences match many times, though the reason (in 
this case, anyway) is not too surprising:

 > max(hcount)
[1] 8286
 > maxIdx = which(hcount==max(hcount))
 > cleanSeqs[maxIdx]
   A DNAStringSet instance of length 70
      width seq
  [1]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  [2]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  [3]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  [4]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  [5]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  [6]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  [7]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  [8]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  [9]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
  ...   ... ...
[62]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
[63]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
[64]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
[65]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
[66]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
[67]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
[68]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
[69]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
[70]    25 TTTTTTTTTTTTTTTTTTTTTTTTT

(that these are identical can be checked with 
unique(as.character(cleanSeqs[maxIdx])))

The current implementation of PDict allows for a 'trusted band' of 
nucleotides that need to match exactly, allowing for approximate matches 
in the remaining nucleotides. Here we trust the first 12 bases, and 
allow up to 3 mismatches. Also, we use the more informative matchPDict, 
which allows us to find the positions of all matches

 > trusted <- PDict(cleanSeqs, tb.end=12)
 > system.time({
+     mmMatch <- matchPDict(trusted, Hsapiens[[6]], max.mismatch=3)
+ }, gcFirst=TRUE)
    user  system elapsed
  195.58    3.85  199.44
 > table(cut(countIndex(mmMatch), c(0, 10^(0:5)), right=FALSE))

         [0,1)        [1,10)      [10,100)   [100,1e+03) [1e+03,1e+04)
        377233        337806         72983         66322         60818
[1e+04,1e+05)
          8518

Execution time increases as the stringency of the match decreases. PDict 
facilities do not yet incorporate quality scores, but filtering results 
based on quality (qualities are discussed further below) represents a 
natural direction for development.



* alphabetByCycle

alphabetByCycle uses a small C function in BiostringsCinterfaceDemo, 
alphabet_by_cycle. It and the read* functions in this package illustrate 
how to access DNAStringSet objects at the C level. alphabetByCycle is a 
matrix that tallies nucleotide use per cycle

 > system.time({
+     abc <- alphabetByCycle(seqa)
+ })
    user  system elapsed
    2.68    0.00    2.68

Again this can be quite useful. For instance, we can find out the number 
of bases that were not called, as a function of cycle

 > abc["-",]
  [1]  22181 108829 123173 180382 225091 225055 216787 208538 208881
[10] 213104 148936 142966 141030 148163 178747 204304 211538 211303
[19] 213722 211167 208021 208715 165441 158359 147110 151462 204781
[28] 221008 223922 221171 227622 226936 232232 236606 242387 241718

and the number of 'T' nucleotides as a function of cycle

 > abc["T",] / colSums(abc[1:4,])
  [1] 0.286 0.292 0.292 0.284 0.292 0.280 0.293 0.297 0.299 0.287 0.290
[12] 0.294 0.294 0.299 0.300 0.301 0.304 0.305 0.306 0.309 0.310 0.311
[23] 0.312 0.316 0.315 0.319 0.320 0.322 0.326 0.328 0.331 0.335 0.339
[34] 0.342 0.346 0.358

That's quite a striking increase after cycle 25!


* Qualities

Solexa reads have quality scores associated with each base call. These 
are summarized in files formatted like:

 > readLines(fastqFile, n=8)
[1] "@HWI-EAS88_1_1_1_1001_499"
[2] "GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT"
[3] "+HWI-EAS88_1_1_1_1001_499"
[4] "]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS"
[5] "@HWI-EAS88_1_1_1_898_392"
[6] "GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC"
[7] "+HWI-EAS88_1_1_1_898_392"
[8] "]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK"

A record consists of four lines. The first and third lines are 
identifiers (repeated), the second line is the sequence, the fourth line 
an ASCII character representing the score (']' is good, Z is better than 
A). Here we read a quality file into a data structure defined in 
BiostringsCinterfaceDemo designed to coordinate the sequence, name, and 
quality information.

 > system.time({
+     seqq <- readSolexaFastQ(fastqFile)
+ }, gcFirst=TRUE)
    user  system elapsed
  17.533   0.417  17.969
 > mb(object.size(seqq))
[1] 254.5 MB
 > seqq
class: SolexaSequenceQ
length: 2218237

 > sequences(seqq)[1:5]
   A DNAStringSet instance of length 5
     width seq
[1]    36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
[2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC
[3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT
[4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA
[5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT
 > names(seqq)[1:5]
   A BStringSet instance of length 5
     width seq
[1]    24 HWI-EAS88_1_1_1_1001_499
[2]    23 HWI-EAS88_1_1_1_898_392
[3]    23 HWI-EAS88_1_1_1_922_465
[4]    23 HWI-EAS88_1_1_1_895_493
[5]    23 HWI-EAS88_1_1_1_953_493
 > scores(seqq)[1:5]
   A BStringSet instance of length 5
     width seq
[1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
[2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
[3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
[4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
[5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS

The object returned by readSolexaFastQ is an S4 object with three slots. 
Each slot contains a XStringSet, where 'X' is DNA for sequences, and 
'BString' for names and scores. This represents one way of structuring 
quality data; the S4 class coordinates subsetting, and provides 
(read-only) accessors to the underlying objects. A likely addition to 
this class as it matures is the inclusion of lane-specific phenotype 
(sample) information, much as an ExpressionSet coordinates sample and 
expression values.

We can gain some basic insight into the sequences as before, e.g.,

 > abc <- alphabetByCycle(sequences(seqq))
 > abc["N",]
  [1]     0     0     0     0     0     0     0     0     0     0     0
[12]     0  1213  1631  1155  1240   721   418  8503   526  6493   703
[23] 14999   718  1623   737   243 40704   811   590  1964   961   809
[34]   910   477   208

Solexa provides files with quality information after filtering reads 
based on 'purity', a measure that precludes uncertain bases (IUPAC code 
'N') from a user-specified region (the first 12 cycles, by default).

 > abc["T",] / colSums(abc[1:4,])
  [1] 0.298 0.290 0.292 0.290 0.295 0.279 0.292 0.298 0.301 0.293 0.295
[12] 0.299 0.296 0.293 0.294 0.294 0.293 0.295 0.294 0.297 0.297 0.299
[23] 0.298 0.303 0.302 0.307 0.312 0.312 0.321 0.331 0.334 0.351 0.361
[34] 0.374 0.380 0.423

We can also summarize quality information by cycle, using an alphabet 
that reflects the encoded scores:

 > alphabet <- sapply(as.raw(32:93), rawToChar)
 > abcScore <- alphabetByCycle(scores(seqq), alphabet=alphabet)
 >
 > rowSums(abcScore)
                 !        "        #        $        %        &        '
        0        0        0        0        0        0        0        0
        (        )        *        +        ,        -        .        /
        0        0        0        0        0        0        0        0
        0        1        2        3        4        5        6        7
        0        0        0        0        0        0        0        0
        8        9        :        ;        <        =        >        ?
        0        0        0        0        0        0        0        0
        @        A        B        C        D        E        F        G
        0  1367337        0  2035064        0  1006372   819775        0
        H        I        J        K        L        M        N        O
  1926112   116395  1614109   356496   517731  1602298   578172  2016009
        P        Q        R        S        T        U        V        W
  3043393   401196  2152160  1553511  2149231   254108  3896558   232409
        X        Y        Z        [       \\        ]
   706452  4353706  1292309   732210        0 45133419
 > abcScore[34:62,c(1:4, 33:36)]
       [,1]    [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]
A        1    1788    1798       8 260944 299631 332500 354717
B        0       0       0       0      0      0      0      0
C       59   36426   12705     227     39 131268 140461 154643
D        0       0       0       0      0      0      0      0
E      133   13391    4033     180 120135      0      0      0
F        0       0       0       0      0 259549 272238 287988
G        0       0       0       0      0      0      0      0
H      272    8815    2933     357 242741 234088 239917 239157
I        0       0       0       0 116395      0      0      0
J      472    4678    1656     586      0 195991 198114 187535
K        5      59      25      11 110052  84345  83744  77353
L        0       0       0       0 102438 144459 142049 128785
M       75     171     121      92  93945 119209 116492 103765
N      653    2992    1232     769  85940  91924  89084  79246
O     1227    2719    1754    1576 175403 187339 176997 160100
P    65834   57830   60855   65863  61910      0      0      0
Q        0       0       0       0 236290      0      0      0
R     3211    4482    4321    4215      0      0      0      0
S        0       0       0       0  68378 470434 426641 444948
T     4444    5373    5868    5637      0      0      0      0
U        0       0       0       0      0      0      0      0
V     9146   10270   11922   11547 543627      0      0      0
W        0       0       0       0      0      0      0      0
X        0       0       0       0      0      0      0      0
Y    28363   29080   34093   32873      0      0      0      0
Z        0       0       0       0      0      0      0      0
[        0       0       0       0      0      0      0      0
\\       0       0       0       0      0      0      0      0
]  2104342 2040163 2074921 2094296      0      0      0      0

Output from the last line shows how scores decrease from the first four 
cycles to the last four. Standard R and Biostrings commands can be used 
to ask many interesting questions, such as an overall quality score of 
reads (e.g., summing the scores of individual nucleotides) and the 
relationship between sequence characteristics (e.g., frequency of 'T') 
and read quality.

 > atgc <- alphabetFrequency(sequences(seqq), baseOnly=TRUE)
 > qscore <- alphabetFrequency(scores(seqq))
 > dim(qscore)
[1] 2218237     256
 > mb(object.size(qscore))
[1] 2166.2 MB

 > quality <- colSums(t(qscore) * 1:ncol(qscore))
 > plot(density(quality))   # small secondary peak at low quality
 > scores(seqq)[which(quality<2925)]
   A BStringSet instance of length 87696
         width seq
     [1]    36 PPPPPPPPPPPPPEPPPPPOPPMOOPPOMMMPOOOJ
     [2]    36 PPPPPPPPPPPPPPPPPPPPHPPPOPPMOPPPNKMA
     [3]    36 PPPPPPPPPPPPPPPPPOPPPPPPEPMPPPMPOFAF
     [4]    36 PPPPPPPPPPPPPPPPPPPPOPPPPPPPPPOPOOHK
     [5]    36 PPPPPPPPPPPPPPPPPPPPPPPOPPOPOPPMKMLF
     [6]    36 PPPPPPPPPPPPPPPPOPPPPPCPPPPHOOPPOOOO
     [7]    36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOKO
     [8]    36 PPPPPPPPPPPPPPPPPPPOPPPPPPPPJPOPOMAO
     [9]    36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOOL
     ...   ... ...
[87688]    36 PPPPPPPPMPOPCOOCOOJEOMMCCOEEHCCMAAAA
[87689]    36 PTYOR]NTVVVJOHJCCPMEHCMHOJCECCCCAACA
[87690]    36 Y]]]NTVYYT]TPNRPNRYHCTECOHCHHCECHAAC
[87691]    36 YPRVNVYVPYHORCOCOPEPECJCCHCCJECHAAAA
[87692]    36 PPPPPPPPPPPPPHPPOCPOPPOPCHMMPPPEKHOO
[87693]    36 JJHPTTJYV]]]]JJCJCJJTCCOVCMHMCCOAAAA
[87694]    36 PPPPPPPPPPPPPPPPPOPPPJPPPCPOHHCMOCAF
[87695]    36 TYR]R]TN]YOEPTNERPPTVTCTVCCCRHOMIFAC
[87696]    36 YRRYJVVTPHVPPECHPNCMCCJCEECOCCCCAAAA
 >
 > t <- atgc[,"T"] / rowSums(atgc[,1:4])
 > cor(t, quality)
[1] 0.154

All of these operations are quick enough to perform in an interactive 
session; the qscore is a large matrix (it can be made smaller by 
choosing bounds that reflect allowable scores, e.g., 32:127), and its 
transposition is relatively expensive.

A final point to remember is that R stores a matrix m as a vector of 
length nrow(m) * ncol(m). R has an internal limit on the size of a 
vector equal to 2^32-1, so the maximum number of reads whose scores can 
be represented by alphabetFrequency is less than 2^32 / 256, or about 16 
million reads; this number of reads might be approached in a single 
Solexa lane; a simple solution is to divide the DNAStringSet into pieces 
that are processed separately.



I hope that the forgoing provides some indication of where we stand at 
the moment. Again, it would be great to have feedback, and to hear of 
other efforts. And again, the programming credit goes to Herve Pages.

Martin

The obligatory sessionInfo, plus some stats on processing this document 
(referencing, in the end, three different data sets).

 > sessionInfo()
R version 2.7.0 alpha (2008-03-28 r44972)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets
[7] methods   base

other attached packages:
[1] BiostringsCinterfaceDemo_0.1.2
[2] BSgenome.Hsapiens.UCSC.hg18_1.3.2
[3] BSgenome_1.7.4
[4] Biostrings_2.7.41
[5] Biobase_1.99.4
 > gc()
            used   (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 9.37e+05   50.1   4.87e+06   260 1.93e+07  1033
Vcells 6.88e+08 5252.6   1.31e+09 10031 1.31e+09  9998
 > proc.time()
    user  system elapsed
   356.7    16.6   378.5



More information about the Bioc-sig-sequencing mailing list