[Bioc-sig-seq] Fastq File size limit in the Short Read Package
Martin Morgan
mtmorgan at fhcrc.org
Sat Apr 3 00:37:45 CEST 2010
Hi Sirisha --
On 04/02/2010 02:57 PM, Sirisha Sunkara wrote:
> Hi Martin,
>
> The readFastq function in the devel version of ShortRead, installed with
> the devel version of R does seem to read >6.5 Gb files fine, but
> the quality scores upon extraction and conversion to a matrix, gives the
> following memory error...
>
>> reads <- readFastq("./s_7_1_sequence.txt", qualityType="SFastqQuality")
>> qual <- quality(reads)
>> qual <- as(qual, "matrix")
> Error in asMethod(object) : allocMatrix: too many elements specified
>
> This fastq file has >31 million 76 cycle reads. Is this a known issue?
I cc'd the bioc-sig-seq mailing list, as this might be useful to others.
R is not able to create a matrix of that size
> matrix(0, 31000000, 72)
Error in matrix(0, 3.1e+07, 72) : too many elements specified
So yes, this is a fundamental limit imposed by R. If the idea is to
summarize the quality scores in some way, then perhaps
qual = as(quality(read)[sample(nrow(read), 1e7)], "matrix")
or looping over subsets would capture enough information to be useful?
Martin
>
> Thank You,
> Sirisha
>
>> sessionInfo()
> R version 2.11.0 Under development (unstable) (2010-03-07 r51225)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] ShortRead_1.5.21 lattice_0.18-3 Biostrings_2.15.22
> [4] GenomicRanges_0.1.0 IRanges_1.5.74
> loaded via a namespace (and not attached):
> [1] Biobase_2.7.5 grid_2.11.0 hwriter_1.2
>
>
>
> Martin Morgan wrote:
>> On 03/23/2010 05:00 PM, Sirisha Sunkara wrote:
>>
>>> Hi Martin,
>>>
>>> Using the ShortRead package, for files > 6.5 Gb size, I seem to be
>>> running into this error using the readFastq function:
>>>
>>> Error in .Call(.read_solexa_fastq, src, withIds) :
>>> negative length vectors are not allowed
>>>
>>> If this is memory related - is there a work-around to working with the
>>> entire file?
>>>
>>
>> Hi Sirisha,
>>
>> This is addressed in the 'devel' version of ShortRead, for which you
>> would need to install the 'devel' version of R and then re-install
>> Bioconductor packages. The workaround is to use an external tool (e.g.,
>> the command 'split' in linux) to split the file into smaller chunks
>> (split files using the -l command and such that lines are multiples of
>> 4).
>>
>> Martin
>>
>>
>>> Thank You,
>>> Sirisha
>>>
>>>
>>>> sessionInfo()
>>>>
>>> R version 2.10.1 (2009-12-14)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods
>>> base other attached packages:
>>> [1] ShortRead_1.4.0 lattice_0.17-26 BSgenome_1.14.2
>>> Biostrings_2.14.12
>>> [5] IRanges_1.4.11 loaded via a namespace (and not attached):
>>> [1] Biobase_2.6.1 grid_2.10.1 hwriter_1.1
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>>
>>
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-sig-sequencing
mailing list