[Bioc-sig-seq] read sequences from the web

Wed Feb 10 18:08:36 CET 2010

Great, thanks. I like to use this feature especially for teaching purposes
to avoid distractions on how to browse the file systems of different OSs.

Thomas

On Tue, Feb 09, 2010 at 11:28:46PM -0800, Hervé Pagès wrote:
> Hi Thomas,
> 
> In Biostrings 2.15.21, read.*StringSet() works again with remote
> files:
> 
> > aaset <- 
> read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa")
> trying URL 
> 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
> ftp data connection made, file length 770075 bytes
> opened URL
> ==================================================
> downloaded 752 Kb
> 
> > aaset[1:3]
>   A AAStringSet instance of length 3
>     width seq                                               names 
> 
> [1]   401 MTRRSRVGAGLAAIVLALAAVSA...FKIGGAVAVIAIVVVVVRRWRNP 
> gi|10579650|gb|AA...
> [2]   221 MSIIELEGVVKRYETGAETVEAL...THDTQLEEFSDRAVNLVDGVLHT 
> gi|10579651|gb|AA...
> [3]   369 MAWRNLGRNRVRTALAALGIVIG...SLLSGLYPAWKAANDPPVEALGE 
> gi|10579652|gb|AA...
> 
> Note that I'm using download.file() in the background with quiet=FALSE
> (the default) hence the verbose output and progress bar.
> 
> Cheers,
> H.
> 
> 
> Thomas Girke wrote:
> >Thanks Hervé. - For me, URL-based sequence imports are useful mainly for 
> >demo purposes. For now, I can certainly work around this limitations by 
> >using stepwise downloads and imports. As usual, speed matters more in this 
> >area than convenience...
> >
> >Best, 
> >
> >Thomas
> >
> >
> >On Fri, Feb 05, 2010 at 09:43:15AM -0800, Hervé Pagès wrote:
> >>Hi Thomas,
> >>
> >>Oops, some recent speed improvements to the read.*StringSet() family
> >>that turn out to be regressions for your use case, sorry!
> >>
> >>Back in November I re-implemented in C the FASTA parser used by the
> >>read.*StringSet() family to make it faster. Now it's 10x or 20x
> >>faster (I don't remember exactly) to load Human chr1 from a FASTA
> >>file. Because handling R connections in C is not easily doable
> >>right now (the C code in R that handles these connections has not
> >>been designed to be easily reusable in a package), this FASTA parser
> >>uses standard C facilities to read the file, with all the restrictions
> >>that this implies. For example the file must be local, no more URLs,
> >>pipes, fifos, socket connections, etc... all the fancy stuff
> >>supported by R connections (see ?file).
> >>
> >>I under estimated the value of supporting URLs so I'll work on a fix
> >>to at least support those (the fix will consist in downloading
> >>the file first to a temp file, nothing fancy). I'll post again here
> >>when this is ready.
> >>
> >>Cheers,
> >>H.
> >>
> >>
> >>Thomas Girke wrote:
> >>>Dear Biostrings Developers,
> >>>
> >>>There seems to be a change (bug?) in the behavior of the 
> >>>read.XXStringSet functions
> >>>in the latest Biostrings version when pointing to files on the web. 
> >>>
> >>>For instance: 
> >>>
> >>>## This works under R-2.10.0
> >>>library(Biostrings)
> >>>read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa", "fasta") 
> >>>
> >>>## But the same command under R-2.10.1 returns the following error:
> >>>Error in .read.fasta.in.XStringSet(filepath, set.names, elementType, 
> >>>lkup) :
> >>>cannot open file 
> >>>'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
> >>>
> >>>My session info for R-2.10.0 is:
> >>>
> >>>R version 2.10.1 (2009-12-14) 
> >>>x86_64-unknown-linux-gnu 
> >>>
> >>>locale:
> >>>[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               
> >>>LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C 
> >>>            LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       
> >>>            LC_NAME=C                 [9] LC_ADDRESS=C               
> >>>            LC_TELEPHONE=C   LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
> >>>attached base packages:
> >>>[1] stats     graphics  grDevices utils     datasets  methods   base     
> >>>
> >>>other attached packages:
> >>>[1] Biostrings_2.14.10 IRanges_1.4.9     
> >>>
> >>>loaded via a namespace (and not attached):
> >>>[1] Biobase_2.6.1
> >>>
> >>>
> >>>Thanks in advance for your help.
> >>>
> >>>Thomas
> >>>
> >>>_______________________________________________
> >>>Bioc-sig-sequencing mailing list
> >>>Bioc-sig-sequencing at r-project.org
> >>>https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>-- 
> >>Hervé Pagès
> >>
> >>Program in Computational Biology
> >>Division of Public Health Sciences
> >>Fred Hutchinson Cancer Research Center
> >>1100 Fairview Ave. N, M2-B876
> >>P.O. Box 19024
> >>Seattle, WA 98109-1024
> >>
> >>E-mail: hpages at fhcrc.org
> >>Phone:  (206) 667-5791
> >>Fax:    (206) 667-1319
> >>
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>