[BioC] FastqSampler not sampling correctly in ShortRead
Martin Morgan
mtmorgan at fhcrc.org
Wed Apr 25 13:43:46 CEST 2012
Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 /
1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4. Records
2:n were actually records 2:n in the file.
FWIW in case there is confusion the intended way to draw independent
samples is just
f = open(FastqSample(fl, 50)
s1 = yield(f); s2 = yield(f)
close(f)
Martin
On 04/25/2012 12:50 AM, Marcus Davy wrote:
> Hi,
> there appears to be a problem in FastqSampler() which seems to sample the
> first read at random, but then the next n-1 reads
> are always the same, so the next sample obtained with yield() is not
> independent. Is this a bug, or my code implementation wrong?
>
> ## From example(FastqSampler)
> sp<- SolexaPath(system.file('extdata', package='ShortRead'))
> fl<- file.path(analysisPath(sp), "s_1_sequence.txt")
>
> f<- FastqFile(fl)
> rfq<- readFastq(f)
>
> set.seed(1)
> f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE)
> s1<- yield(f) # sample of size n=50
> set.seed(2)
> f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE)
> s2<- yield(f) # independent sample of size 50
>
>
> ## Intersection length between samples always 50-1
> length(intersect(id(s1), id(s2)))
>
> ## First read is different, the next 2:n reads are the same
> head(sread(s1))
> head(sread(s2))
> close(f)
>
>> head(sread(s1))
> A DNAStringSet instance of length 6
> width seq
> [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACTTTACTC # Different
> [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same
> [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ...
> [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ...
> [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ...
> [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG Same
>> head(sread(s2))
> A DNAStringSet instance of length 6
> width seq
> [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTGGGGGGA # Different
> [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same
> [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ...
> [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ...
> [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ...
> [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG #Same
>
> Marcus
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5
> [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1
> [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3
> stats4_2.15.0
> [6] tools_2.15.0 zlibbioc_1.2.0
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list