[Bioc-sig-seq] ShortRead, feature request (if not a bug report)
Martin Morgan
mtmorgan at fhcrc.org
Wed May 18 00:10:40 CEST 2011
On 05/17/2011 02:35 PM, Ivan Gregoretti wrote:
> Hello ShortRead connoisseurs,
>
> ShortRead::readAligned is very smart because it allows you to load the
> content of a large file without decompressing it. For example:
>
> aln<- readAligned("s_1_export.txt.gz", type="SolexaExport")
>
> However, its analogue reading function ShortRead::readFasta in my
> system complains about being unable to handle gziped targets
>
> fas<- readFasta("s_1.fa.gz")
> Error in .normargInputFilepath(filepath) :
> file "s_1.fa.gz" has unsupported type: gzfile
This is a limitation of Biostrings' read.DNAStringSet.
a work-around if these are classic single-reads-per-line is
all <- readLines("s_1.fa.gz")
sread <- DNAStringSet(all[c(FALSE, TRUE)])
id <- BStringSet(all[c(TRUE, FALSE)])
fas <- ShortRead(sread=sread, id=id)
(there may be a warning from readLines about an internal error; this can
be ignored). Also Rsamtools::FaFile, though these are meant more for
reference sequences than short reads.
Martin
>
>
> Currently the solution seems to be:
>
> system("gunzip -f s_1.fa.gz")
> fas<- readFasta("s_1.fa")
> system("gzip -9f s_1.fa")
>
> but this code is highly inefficient, especially with large files.
>
> Please consider adding the missing functionality just like in readAligned.
>
> In case it is a bug in my ShortRead version, see my session below.
>
> Thank you,
>
> Ivan
>
>> sessionInfo()
> R version 2.14.0 Under development (unstable) (2011-04-14 r55450)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
> [7] LC_PAPER=en_US.utf8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] annotate_1.31.0 AnnotationDbi_1.15.1 Biobase_2.13.1
> [4] ShortRead_1.11.1 Rsamtools_1.5.9 lattice_0.19-26
> [7] Biostrings_2.21.1 GenomicRanges_1.5.0 IRanges_1.11.1
>
> loaded via a namespace (and not attached):
> [1] DBI_0.2-5 grid_2.14.0 hwriter_1.3 RSQLite_0.9-4 tools_2.14.0
> [6] xtable_1.5-6
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list