[Bioc-sig-seq] ShortRead, feature request (if not a bug report)

Wed May 18 00:52:38 CEST 2011

Hi Martin,

I'll go for the readLines route. Anything is better than the gzip
approach I showed before.

Thank you,

Ivan

On Tue, May 17, 2011 at 6:10 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 05/17/2011 02:35 PM, Ivan Gregoretti wrote:
>>
>> Hello ShortRead connoisseurs,
>>
>> ShortRead::readAligned is very smart because it allows you to load the
>> content of a large file without decompressing it. For example:
>>
>> aln<- readAligned("s_1_export.txt.gz", type="SolexaExport")
>>
>> However, its analogue reading function ShortRead::readFasta in my
>> system complains about being unable to handle gziped targets
>>
>> fas<- readFasta("s_1.fa.gz")
>> Error in .normargInputFilepath(filepath) :
>>   file "s_1.fa.gz" has unsupported type: gzfile
>
> This is a limitation of Biostrings' read.DNAStringSet.
>
> a work-around if these are classic single-reads-per-line is
>
>  all <- readLines("s_1.fa.gz")
>  sread <- DNAStringSet(all[c(FALSE, TRUE)])
>  id <- BStringSet(all[c(TRUE, FALSE)])
>  fas <- ShortRead(sread=sread, id=id)
>
> (there may be a warning from readLines about an internal error; this can be
> ignored). Also Rsamtools::FaFile, though these are meant more for reference
> sequences than short reads.
>
> Martin
>
>>
>>
>> Currently the solution seems to be:
>>
>> system("gunzip -f s_1.fa.gz")
>> fas<- readFasta("s_1.fa")
>> system("gzip -9f s_1.fa")
>>
>> but this code is highly inefficient, especially with large files.
>>
>> Please consider adding the missing functionality just like in readAligned.
>>
>> In case it is a bug in my ShortRead version, see my session below.
>>
>> Thank you,
>>
>> Ivan
>>
>>> sessionInfo()
>>
>> R version 2.14.0 Under development (unstable) (2011-04-14 r55450)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
>>  [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
>>  [7] LC_PAPER=en_US.utf8       LC_NAME=C
>>  [9] LC_ADDRESS=C              LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] annotate_1.31.0      AnnotationDbi_1.15.1 Biobase_2.13.1
>> [4] ShortRead_1.11.1     Rsamtools_1.5.9      lattice_0.19-26
>> [7] Biostrings_2.21.1    GenomicRanges_1.5.0  IRanges_1.11.1
>>
>> loaded via a namespace (and not attached):
>> [1] DBI_0.2-5     grid_2.14.0   hwriter_1.3   RSQLite_0.9-4 tools_2.14.0
>> [6] xtable_1.5-6
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>