[Bioc-sig-seq] PDict question
Herve Pages
hpages at fhcrc.org
Tue Jun 3 22:48:47 CEST 2008
Hi Harris,
[Sorry but this discussion belongs to where it came from so I'm
putting it back there.]
Harris A. Jaffee wrote:
> Nothing intelligent to say, but I agree -- it seems
> very suspicious that he had a failure. Also, if you
> use some form of malloc(), won't you have accesss to
> *virtual* memory, so his so-called "20GB" of RAM is
> somewhat irrelevant. More than that is available.
>
> But I don't understand your memory allocation scheme.
> I just did PDict on 4M unique strings of width 36. It
> ran up about 10 minutes of CPU time and was increasing
> in size VERY gradually from 2.5G to about 12G, but it
> didn't pass 5G until 9 minutes or so. Certainly doesn't
> sound like everything is pre-allocated, as you describe.
> Is there an easy way to delineate what I am missing?
Good point! I forgot to mention that the temp buffer uses
user-controlled memory (malloc) instead of R-controlled (aka
transient) memory (Salloc). The reason I decided to use malloc()
is that it's *much* faster than Salloc(), at least on Linux (you
don't say what your OS is), but this might just be due to the fact
that Linux's malloc() is cheating i.e. it doesn't really allocate
the memory pages until the process actually tries to access them
(lazy memory allocation). So in the end the Linux kernel will end
up making a lot of small real allocations behind the scene as the
temp buffer is being filled up with the AC tree that is currently
under construction. And that could explain why top (you don't say
how you monitor this) is reporting that the memory used by your R
process is increasing gradually.
From malloc()'s man page on my 64-bit openSUSE 10.3 system:
BUGS
By default, Linux follows an optimistic memory allocation
strategy. This means that when malloc() returns non-NULL
there is no guarantee that the memory really is available.
This is a really bad bug. In case it turns out that the sys‐
tem is out of memory, one or more processes will be killed by
the infamous OOM killer. In case Linux is employed under
circumstances where it would be less desirable to suddenly
lose some randomly picked processes, and moreover the kernel
version is sufficiently recent, one can switch off this over‐
committing behavior using a command like:
# echo 2 > /proc/sys/vm/overcommit_memory
Can you try the above and see whether the R process is actually using
the 5G of mem from the very beginning or not? PDict() will also need
some extra G towards the end for copying the final AC tree back to
the R space (to an R integer vector that corresponds more or less to
the @actree at nodes slot of the PDict object) but I have to admit that
this doesn't really explain why you need 7 extra G for this. I'm not
sure what's going on...
In theory the total amount of memory you need is BPS + RS where
BPS is the biggest possible size of the tree and RS its real size
(BPS >= RS).
Then the amount of memory used by PDict() should be something like
this:
PDict() progression Memory in use
------------------- -------------
phase 1: build AC tree in temp buffer BPS
phase 2: copy the temp buf to the BPS + RS
@actree at nodes slot of the PDict object
phase 3: after freeing the temp buffer RS
That's for the AC tree only but there is also the @dups slot in
the resulting PDict object that can be big too, it all depends on
how many duplicated reads you have in your input dictionary.
Hope this helps,
H.
>
> Perhaps 'conservative' is better than your 'optimal'.
>
> On Jun 3, 2008, at 2:36 PM, hpages at fhcrc.org wrote:
>>
More information about the Bioc-sig-sequencing
mailing list