[Bioc-sig-seq] limit to character length for read.DNAStringSet()
    Andrew Yee 
    yee at post.harvard.edu
       
    Wed Sep 22 16:51:20 CEST 2010
    
    
  
Is there a limit to the number of characters in a line for read.DNAStringSet()?
Take for example the following example, which runs fine:
library('Biostrings')
foo <- character()
number.of.characters <- 2000
foo[1] <- '>foo'
foo[2] <- paste(rep('A', number.of.characters), sep='', collapse='')
write(foo, file='~/sandbox/foo.fasta')
bar <- read.DNAStringSet(filepath='~/sandbox/foo.fasta', format='fasta')
# however if I increase number of characters to e.g. 20000, the
example no longer works
foo <- character()
number.of.characters <- 20000
foo[1] <- '>foo'
foo[2] <- paste(rep('A', number.of.characters), sep='', collapse='')
write(foo, file='~/sandbox/foo.fasta')
bar <- read.DNAStringSet(filepath='~/sandbox/foo.fasta', format='fasta')
# the above read.DNAStringSet generates the following error message
> bar <- read.DNAStringSet(filepath='~/sandbox/foo.fasta', format='fasta')
Error in .read.fasta.in.XStringSet(filepath, set.names, elementType, lkup) :
  reading FASTA file     : cannot read line 2, line is too long
Thanks,
Andrew
> sessionInfo()
R version 2.11.1 Patched (2010-09-04 r52880)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] Biostrings_2.16.9 IRanges_1.6.17
loaded via a namespace (and not attached):
[1] Biobase_2.8.0 tools_2.11.1
    
    
More information about the Bioc-sig-sequencing
mailing list