[Bioc-sig-seq] read sequences from the web

Thomas Girke thomas.girke at ucr.edu
Fri Feb 5 19:19:48 CET 2010


Thanks Hervé. - For me, URL-based sequence imports are useful mainly for demo 
purposes. For now, I can certainly work around this limitations by using stepwise 
downloads and imports. As usual, speed matters more in this area than convenience...

Best, 

Thomas


On Fri, Feb 05, 2010 at 09:43:15AM -0800, Hervé Pagès wrote:
> Hi Thomas,
> 
> Oops, some recent speed improvements to the read.*StringSet() family
> that turn out to be regressions for your use case, sorry!
> 
> Back in November I re-implemented in C the FASTA parser used by the
> read.*StringSet() family to make it faster. Now it's 10x or 20x
> faster (I don't remember exactly) to load Human chr1 from a FASTA
> file. Because handling R connections in C is not easily doable
> right now (the C code in R that handles these connections has not
> been designed to be easily reusable in a package), this FASTA parser
> uses standard C facilities to read the file, with all the restrictions
> that this implies. For example the file must be local, no more URLs,
> pipes, fifos, socket connections, etc... all the fancy stuff
> supported by R connections (see ?file).
> 
> I under estimated the value of supporting URLs so I'll work on a fix
> to at least support those (the fix will consist in downloading
> the file first to a temp file, nothing fancy). I'll post again here
> when this is ready.
> 
> Cheers,
> H.
> 
> 
> Thomas Girke wrote:
> >Dear Biostrings Developers,
> >
> >There seems to be a change (bug?) in the behavior of the read.XXStringSet 
> >functions
> >in the latest Biostrings version when pointing to files on the web. 
> >
> >For instance: 
> >
> >## This works under R-2.10.0
> >library(Biostrings)
> >read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa", "fasta") 
> >
> >## But the same command under R-2.10.1 returns the following error:
> >Error in .read.fasta.in.XStringSet(filepath, set.names, elementType, lkup) 
> >:
> >cannot open file 
> >'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
> >
> >My session info for R-2.10.0 is:
> >
> >R version 2.10.1 (2009-12-14) 
> >x86_64-unknown-linux-gnu 
> >
> >locale:
> > [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               
> > LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C 
> >             LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       
> >             LC_NAME=C                 [9] LC_ADDRESS=C               LC_TELEPHONE=C   
> > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
> >attached base packages:
> >[1] stats     graphics  grDevices utils     datasets  methods   base     
> >
> >other attached packages:
> >[1] Biostrings_2.14.10 IRanges_1.4.9     
> >
> >loaded via a namespace (and not attached):
> >[1] Biobase_2.6.1
> >
> >
> >Thanks in advance for your help.
> >
> >Thomas
> >
> >_______________________________________________
> >Bioc-sig-sequencing mailing list
> >Bioc-sig-sequencing at r-project.org
> >https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>



More information about the Bioc-sig-sequencing mailing list