[Bioc-sig-seq] ShortRead, feature request (if not a bug report)
Ivan Gregoretti
ivangreg at gmail.com
Wed May 18 00:52:38 CEST 2011
Hi Martin,
I'll go for the readLines route. Anything is better than the gzip
approach I showed before.
Thank you,
Ivan
On Tue, May 17, 2011 at 6:10 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 05/17/2011 02:35 PM, Ivan Gregoretti wrote:
>>
>> Hello ShortRead connoisseurs,
>>
>> ShortRead::readAligned is very smart because it allows you to load the
>> content of a large file without decompressing it. For example:
>>
>> aln<- readAligned("s_1_export.txt.gz", type="SolexaExport")
>>
>> However, its analogue reading function ShortRead::readFasta in my
>> system complains about being unable to handle gziped targets
>>
>> fas<- readFasta("s_1.fa.gz")
>> Error in .normargInputFilepath(filepath) :
>> file "s_1.fa.gz" has unsupported type: gzfile
>
> This is a limitation of Biostrings' read.DNAStringSet.
>
> a work-around if these are classic single-reads-per-line is
>
> all <- readLines("s_1.fa.gz")
> sread <- DNAStringSet(all[c(FALSE, TRUE)])
> id <- BStringSet(all[c(TRUE, FALSE)])
> fas <- ShortRead(sread=sread, id=id)
>
> (there may be a warning from readLines about an internal error; this can be
> ignored). Also Rsamtools::FaFile, though these are meant more for reference
> sequences than short reads.
>
> Martin
>
>>
>>
>> Currently the solution seems to be:
>>
>> system("gunzip -f s_1.fa.gz")
>> fas<- readFasta("s_1.fa")
>> system("gzip -9f s_1.fa")
>>
>> but this code is highly inefficient, especially with large files.
>>
>> Please consider adding the missing functionality just like in readAligned.
>>
>> In case it is a bug in my ShortRead version, see my session below.
>>
>> Thank you,
>>
>> Ivan
>>
>>> sessionInfo()
>>
>> R version 2.14.0 Under development (unstable) (2011-04-14 r55450)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
>> [7] LC_PAPER=en_US.utf8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] annotate_1.31.0 AnnotationDbi_1.15.1 Biobase_2.13.1
>> [4] ShortRead_1.11.1 Rsamtools_1.5.9 lattice_0.19-26
>> [7] Biostrings_2.21.1 GenomicRanges_1.5.0 IRanges_1.11.1
>>
>> loaded via a namespace (and not attached):
>> [1] DBI_0.2-5 grid_2.14.0 hwriter_1.3 RSQLite_0.9-4 tools_2.14.0
>> [6] xtable_1.5-6
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>
More information about the Bioc-sig-sequencing
mailing list