[Bioc-sig-seq] read sequences from the web
Thomas Girke
thomas.girke at ucr.edu
Wed Feb 10 18:08:36 CET 2010
Great, thanks. I like to use this feature especially for teaching purposes
to avoid distractions on how to browse the file systems of different OSs.
Thomas
On Tue, Feb 09, 2010 at 11:28:46PM -0800, Hervé Pagès wrote:
> Hi Thomas,
>
> In Biostrings 2.15.21, read.*StringSet() works again with remote
> files:
>
> > aaset <-
> read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa")
> trying URL
> 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
> ftp data connection made, file length 770075 bytes
> opened URL
> ==================================================
> downloaded 752 Kb
>
> > aaset[1:3]
> A AAStringSet instance of length 3
> width seq names
>
> [1] 401 MTRRSRVGAGLAAIVLALAAVSA...FKIGGAVAVIAIVVVVVRRWRNP
> gi|10579650|gb|AA...
> [2] 221 MSIIELEGVVKRYETGAETVEAL...THDTQLEEFSDRAVNLVDGVLHT
> gi|10579651|gb|AA...
> [3] 369 MAWRNLGRNRVRTALAALGIVIG...SLLSGLYPAWKAANDPPVEALGE
> gi|10579652|gb|AA...
>
> Note that I'm using download.file() in the background with quiet=FALSE
> (the default) hence the verbose output and progress bar.
>
> Cheers,
> H.
>
>
> Thomas Girke wrote:
> >Thanks Hervé. - For me, URL-based sequence imports are useful mainly for
> >demo purposes. For now, I can certainly work around this limitations by
> >using stepwise downloads and imports. As usual, speed matters more in this
> >area than convenience...
> >
> >Best,
> >
> >Thomas
> >
> >
> >On Fri, Feb 05, 2010 at 09:43:15AM -0800, Hervé Pagès wrote:
> >>Hi Thomas,
> >>
> >>Oops, some recent speed improvements to the read.*StringSet() family
> >>that turn out to be regressions for your use case, sorry!
> >>
> >>Back in November I re-implemented in C the FASTA parser used by the
> >>read.*StringSet() family to make it faster. Now it's 10x or 20x
> >>faster (I don't remember exactly) to load Human chr1 from a FASTA
> >>file. Because handling R connections in C is not easily doable
> >>right now (the C code in R that handles these connections has not
> >>been designed to be easily reusable in a package), this FASTA parser
> >>uses standard C facilities to read the file, with all the restrictions
> >>that this implies. For example the file must be local, no more URLs,
> >>pipes, fifos, socket connections, etc... all the fancy stuff
> >>supported by R connections (see ?file).
> >>
> >>I under estimated the value of supporting URLs so I'll work on a fix
> >>to at least support those (the fix will consist in downloading
> >>the file first to a temp file, nothing fancy). I'll post again here
> >>when this is ready.
> >>
> >>Cheers,
> >>H.
> >>
> >>
> >>Thomas Girke wrote:
> >>>Dear Biostrings Developers,
> >>>
> >>>There seems to be a change (bug?) in the behavior of the
> >>>read.XXStringSet functions
> >>>in the latest Biostrings version when pointing to files on the web.
> >>>
> >>>For instance:
> >>>
> >>>## This works under R-2.10.0
> >>>library(Biostrings)
> >>>read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa", "fasta")
> >>>
> >>>## But the same command under R-2.10.1 returns the following error:
> >>>Error in .read.fasta.in.XStringSet(filepath, set.names, elementType,
> >>>lkup) :
> >>>cannot open file
> >>>'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
> >>>
> >>>My session info for R-2.10.0 is:
> >>>
> >>>R version 2.10.1 (2009-12-14)
> >>>x86_64-unknown-linux-gnu
> >>>
> >>>locale:
> >>>[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> >>>LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C
> >>> LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8
> >>> LC_NAME=C [9] LC_ADDRESS=C
> >>> LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>>attached base packages:
> >>>[1] stats graphics grDevices utils datasets methods base
> >>>
> >>>other attached packages:
> >>>[1] Biostrings_2.14.10 IRanges_1.4.9
> >>>
> >>>loaded via a namespace (and not attached):
> >>>[1] Biobase_2.6.1
> >>>
> >>>
> >>>Thanks in advance for your help.
> >>>
> >>>Thomas
> >>>
> >>>_______________________________________________
> >>>Bioc-sig-sequencing mailing list
> >>>Bioc-sig-sequencing at r-project.org
> >>>https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>--
> >>Hervé Pagès
> >>
> >>Program in Computational Biology
> >>Division of Public Health Sciences
> >>Fred Hutchinson Cancer Research Center
> >>1100 Fairview Ave. N, M2-B876
> >>P.O. Box 19024
> >>Seattle, WA 98109-1024
> >>
> >>E-mail: hpages at fhcrc.org
> >>Phone: (206) 667-5791
> >>Fax: (206) 667-1319
> >>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
More information about the Bioc-sig-sequencing
mailing list