[Bioc-sig-seq] read sequences from the web
Hervé Pagès
hpages at fhcrc.org
Wed Feb 10 08:28:46 CET 2010
Hi Thomas,
In Biostrings 2.15.21, read.*StringSet() works again with remote
files:
> aaset <-
read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa")
trying URL
'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
ftp data connection made, file length 770075 bytes
opened URL
==================================================
downloaded 752 Kb
> aaset[1:3]
A AAStringSet instance of length 3
width seq names
[1] 401 MTRRSRVGAGLAAIVLALAAVSA...FKIGGAVAVIAIVVVVVRRWRNP
gi|10579650|gb|AA...
[2] 221 MSIIELEGVVKRYETGAETVEAL...THDTQLEEFSDRAVNLVDGVLHT
gi|10579651|gb|AA...
[3] 369 MAWRNLGRNRVRTALAALGIVIG...SLLSGLYPAWKAANDPPVEALGE
gi|10579652|gb|AA...
Note that I'm using download.file() in the background with quiet=FALSE
(the default) hence the verbose output and progress bar.
Cheers,
H.
Thomas Girke wrote:
> Thanks Hervé. - For me, URL-based sequence imports are useful mainly for demo
> purposes. For now, I can certainly work around this limitations by using stepwise
> downloads and imports. As usual, speed matters more in this area than convenience...
>
> Best,
>
> Thomas
>
>
> On Fri, Feb 05, 2010 at 09:43:15AM -0800, Hervé Pagès wrote:
>> Hi Thomas,
>>
>> Oops, some recent speed improvements to the read.*StringSet() family
>> that turn out to be regressions for your use case, sorry!
>>
>> Back in November I re-implemented in C the FASTA parser used by the
>> read.*StringSet() family to make it faster. Now it's 10x or 20x
>> faster (I don't remember exactly) to load Human chr1 from a FASTA
>> file. Because handling R connections in C is not easily doable
>> right now (the C code in R that handles these connections has not
>> been designed to be easily reusable in a package), this FASTA parser
>> uses standard C facilities to read the file, with all the restrictions
>> that this implies. For example the file must be local, no more URLs,
>> pipes, fifos, socket connections, etc... all the fancy stuff
>> supported by R connections (see ?file).
>>
>> I under estimated the value of supporting URLs so I'll work on a fix
>> to at least support those (the fix will consist in downloading
>> the file first to a temp file, nothing fancy). I'll post again here
>> when this is ready.
>>
>> Cheers,
>> H.
>>
>>
>> Thomas Girke wrote:
>>> Dear Biostrings Developers,
>>>
>>> There seems to be a change (bug?) in the behavior of the read.XXStringSet
>>> functions
>>> in the latest Biostrings version when pointing to files on the web.
>>>
>>> For instance:
>>>
>>> ## This works under R-2.10.0
>>> library(Biostrings)
>>> read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa", "fasta")
>>>
>>> ## But the same command under R-2.10.1 returns the following error:
>>> Error in .read.fasta.in.XStringSet(filepath, set.names, elementType, lkup)
>>> :
>>> cannot open file
>>> 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
>>>
>>> My session info for R-2.10.0 is:
>>>
>>> R version 2.10.1 (2009-12-14)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C
>>> LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8
>>> LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.14.10 IRanges_1.4.9
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.6.1
>>>
>>>
>>> Thanks in advance for your help.
>>>
>>> Thomas
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list