memory efficient hash table extension? like lchash ...

am 23.01.2010 22:11:41 von Dante Lorenso

All,

I'm loading millions of records into a backend PHP cli script that I
need to build a hash index from to optimize key lookups for data that
I'm importing into a MySQL database. The problem is that storing this
data in a PHP array is not very memory efficient and my millions of
records are consuming about 4-6 GB of ram.

I have tried using some external key/value storage solutions like
MemcacheDB, MongoDB, and straight MySQL, but none of these are fast
enough for what I'm trying to do.

Then I found the "lchash" extension for PHP and it looks like exactly
what I want. It's a c-lib hash which is accessed from PHP. Using it
would be slightly slower than using straight PHP arrays, but would be
much more memory efficient since not all data needs to be stored as PHP
zvals, etc.

Problem is that the lchash extension can't ben installed in my PHP 5.3
build because "pecl install lchash" fails with a message about invalid
checksum on the README file. Apparently this extension has been
neglected and abandoned and hasn't been updated since 2005.

Is there something like lchash that *is* being maintained? What would
you all suggest?

-- Dante

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: memory efficient hash table extension? like lchash ...

am 23.01.2010 22:36:46 von hSiplu

On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso wrote=
:
> All,
>
> I'm loading millions of records into a backend PHP cli script that I
> need to build a hash index from to optimize key lookups for data that
> I'm importing into a MySQL database. Â The problem is that storing th=
is
> data in a PHP array is not very memory efficient and my millions of
> records are consuming about 4-6 GB of ram.
>

What are you storing? An array of row objects??
In that case storing only the row id is will reduce the memory.

If you are loading full row objects, it will take a lot of memory.
But if you just load the row id values, it will significantly decrease
the memory amount.

Besides, You can load row ids in a chunk by chunk basis. if you have
10 millions of rows to process. load 10000 rows as a chunk. process
them then load the next chunk. This will significantly reduce memory
usage.

A good algorithm can solve your problem anytime. ;-)

--=20
Shiplu Mokaddim
My talks, http://talk.cmyweb.net
Follow me, http://twitter.com/shiplu
SUST Programmers, http://groups.google.com/group/p2psust
Innovation distinguishes bet ... ... (ask Steve Jobs the rest)

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: memory efficient hash table extension? like lchash ...

am 24.01.2010 18:39:36 von Dante Lorenso

shiplu wrote:
> On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso wrote:
>> All,
>>
>> I'm loading millions of records into a backend PHP cli script that I
>> need to build a hash index from to optimize key lookups for data that
>> I'm importing into a MySQL database. The problem is that storing this
>> data in a PHP array is not very memory efficient and my millions of
>> records are consuming about 4-6 GB of ram.
>>
>
> What are you storing? An array of row objects??
> In that case storing only the row id is will reduce the memory.

I am querying a MySQL database which contains 40 million records and
mapping string columns to numeric ids. You might consider it
normalizing the data.

Then, I am importing a new 40 million records and comparing the new
values to the old values. Where the value matches, I update records,
but where they do not match, I insert new records, and finally I go back
and delete old records. So, the net result is that I have a database
with 40 million records that I need to "sync" on a daily basis.

> If you are loading full row objects, it will take a lot of memory.
> But if you just load the row id values, it will significantly decrease
> the memory amount.

For what I am trying to do, I just need to map a string value (32 bytes)
to a bigint value (8 bytes) in a fast-lookup hash.

> Besides, You can load row ids in a chunk by chunk basis. if you have
> 10 millions of rows to process. load 10000 rows as a chunk. process
> them then load the next chunk. This will significantly reduce memory
> usage.

When importing the fresh 40 million records, I need to compare each
record with 4 different indexes that will map the record to existing
other records, or into a "group_id" that the record also belongs to. My
current solution uses a trigger in MySQL that will do the lookups inside
MySQL, but this is extremely slow. Pre-loading the mysql indexes into
PHP ram and processing that was is thousands of times faster.

I just need an efficient way to hold my hash tables in PHP ram. PHP
arrays are very fast, but like my original post says, they consume way
too much ram.

> A good algorithm can solve your problem anytime. ;-)

It takes about 5-10 minutes to build my hash indexes in PHP ram
currently which makes up for the 10,000 x speedup on key lookups that I
get later on. I just want to not use the whole 6 GB of ram to do this.
I need an efficient hashing API that supports something like:

$value = (int) fasthash_get((string) $key);
$exists = (bool) fasthash_exists((string) $key);
fasthash_set((string) $key, (int) $value);

Or ... it feels like a "memcached" api but where the data is stored
locally instead of accessed via a network. So this is how my search led
me to what appears to be a dead "lchash" extension.

-- Dante

----------
D. Dante Lorenso
dante@lorenso.com
972-333-4139

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: memory efficient hash table extension? like lchash ...

am 25.01.2010 20:40:26 von Ravi Menon

PHP does expose sys V shared-memory apis (shm_* functions):

http://us2.php.net/manual/en/book.sem.php

If you already have apc installed, you could also try:

http://us2.php.net/manual/en/book.apc.php

APC also allows you to store user specific data too (it will be in a
shared memory).

Haven't tried these myself, so I would do some quick tests to ensure
if they meet your performance requirements. In theory, it should be
faster than berkeley-db like solutions (which is also another option
but it seems something similar like MongoDB was not good enough?).

I am curious to know if someone here has run these tests. Note that
with memcached installed locally (on the same box running php), it can
be surprisingly efficient - using pconnect(), caching the handler in
a static var for a given request cycle etc...

Ravi

On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso wrote=
:
> shiplu wrote:
>>
>> On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso
>> wrote:
>>>
>>> All,
>>>
>>> I'm loading millions of records into a backend PHP cli script that I
>>> need to build a hash index from to optimize key lookups for data that
>>> I'm importing into a MySQL database. =A0The problem is that storing thi=
s
>>> data in a PHP array is not very memory efficient and my millions of
>>> records are consuming about 4-6 GB of ram.
>>>
>>
>> What are you storing? An array of row objects??
>> In that case storing only the row id is will reduce the memory.
>
> I am querying a MySQL database which contains 40 million records and mapp=
ing
> string columns to numeric ids. =A0You might consider it normalizing the d=
ata.
>
> Then, I am importing a new 40 million records and comparing the new value=
s
> to the old values. =A0Where the value matches, I update records, but wher=
e
> they do not match, I insert new records, and finally I go back and delete
> old records. =A0So, the net result is that I have a database with 40 mill=
ion
> records that I need to "sync" on a daily basis.
>
>> If you are loading full row objects, it will take a lot of memory.
>> But if you just load the row id values, it will significantly decrease
>> the memory amount.
>
> For what I am trying to do, I just need to map a string value (32 bytes) =
to
> a bigint value (8 bytes) in a fast-lookup hash.
>
>> Besides, You can load row ids in a chunk by chunk basis. if you have
>> 10 millions of rows to process. load 10000 rows as a chunk. process
>> them then load the next chunk. =A0This will significantly reduce memory
>> usage.
>
> When importing the fresh 40 million records, I need to compare each recor=
d
> with 4 different indexes that will map the record to existing other recor=
ds,
> or into a "group_id" that the record also belongs to. =A0My current solut=
ion
> uses a trigger in MySQL that will do the lookups inside MySQL, but this i=
s
> extremely slow. =A0Pre-loading the mysql indexes into PHP ram and process=
ing
> that was is thousands of times faster.
>
> I just need an efficient way to hold my hash tables in PHP ram. =A0PHP ar=
rays
> are very fast, but like my original post says, they consume way too much
> ram.
>
>> A good algorithm can solve your problem anytime. ;-)
>
> It takes about 5-10 minutes to build my hash indexes in PHP ram currently
> which makes up for the 10,000 x speedup on key lookups that I get later o=
n.
> =A0I just want to not use the whole 6 GB of ram to do this. =A0 I need an
> efficient hashing API that supports something like:
>
> =A0 =A0 =A0 =A0$value =3D (int) fasthash_get((string) $key);
> =A0 =A0 =A0 =A0$exists =3D (bool) fasthash_exists((string) $key);
> =A0 =A0 =A0 =A0fasthash_set((string) $key, (int) $value);
>
> Or ... it feels like a "memcached" api but where the data is stored local=
ly
> instead of accessed via a network. =A0So this is how my search led me to =
what
> appears to be a dead "lchash" extension.
>
> -- Dante
>
> ----------
> D. Dante Lorenso
> dante@lorenso.com
> 972-333-4139
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: memory efficient hash table extension? like lchash ...

am 26.01.2010 00:49:28 von Dante Lorenso

J Ravi Menon wrote:
> PHP does expose sys V shared-memory apis (shm_* functions):
> http://us2.php.net/manual/en/book.sem.php

I will look into this. I really need a key/value map, though and would
rather not have to write my own on top of SHM.

> If you already have apc installed, you could also try:
> http://us2.php.net/manual/en/book.apc.php
> APC also allows you to store user specific data too (it will be in a
> shared memory).

I've looked into the apc_store and apc_fetch routines:
http://php.net/manual/en/function.apc-store.php
http://www.php.net/manual/en/function.apc-fetch.php
.... but quickly ran out of memory for APC and though I figured out how
to configure it to use more (adjust shared memory allotment), there were
other problems. I ran into issues with logs complaining about "cache
slamming" and other known bugs with APC version 3.1.3p1. Also, after
several million values were stored, the APC storage began to slow down
*dramatically*. I wasn't certain if APC was using only RAM or was
possibly also writing to disk. Performance tanked so quickly that I set
it aside as an option and moved on.

> Haven't tried these myself, so I would do some quick tests to ensure
> if they meet your performance requirements. In theory, it should be
> faster than berkeley-db like solutions (which is also another option
> but it seems something similar like MongoDB was not good enough?).

I will run more tests against MongoDB. Initially I tried to use it to
store everything. If I only store my indexes, it might fare better.
Certainly, though, running queries and updates against a remote server
will always be slower than doing the lookups locally in ram.

> I am curious to know if someone here has run these tests. Note that
> with memcached installed locally (on the same box running php), it can
> be surprisingly efficient - using pconnect(), caching the handler in
> a static var for a given request cycle etc...

memcached gives no guarantee about data persistence. I need to have a
hash table that will contain all the values I set. They don't need to
survive a server shutdown (don't need to be written to disk), but I can
not afford for the server to throw away values that don't fit into
memory. If there is a way to configure memcached guarantee storage,
that might work.

-- Dante

> On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso wrote:
>> shiplu wrote:
>>> On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso
>>> wrote:
>>>> All,
>>>>
>>>> I'm loading millions of records into a backend PHP cli script that I
>>>> need to build a hash index from to optimize key lookups for data that
>>>> I'm importing into a MySQL database. The problem is that storing this
>>>> data in a PHP array is not very memory efficient and my millions of
>>>> records are consuming about 4-6 GB of ram.
>>>>
>>> What are you storing? An array of row objects??
>>> In that case storing only the row id is will reduce the memory.
>> I am querying a MySQL database which contains 40 million records and mapping
>> string columns to numeric ids. You might consider it normalizing the data.
>>
>> Then, I am importing a new 40 million records and comparing the new values
>> to the old values. Where the value matches, I update records, but where
>> they do not match, I insert new records, and finally I go back and delete
>> old records. So, the net result is that I have a database with 40 million
>> records that I need to "sync" on a daily basis.
>>
>>> If you are loading full row objects, it will take a lot of memory.
>>> But if you just load the row id values, it will significantly decrease
>>> the memory amount.
>> For what I am trying to do, I just need to map a string value (32 bytes) to
>> a bigint value (8 bytes) in a fast-lookup hash.
>>
>>> Besides, You can load row ids in a chunk by chunk basis. if you have
>>> 10 millions of rows to process. load 10000 rows as a chunk. process
>>> them then load the next chunk. This will significantly reduce memory
>>> usage.
>> When importing the fresh 40 million records, I need to compare each record
>> with 4 different indexes that will map the record to existing other records,
>> or into a "group_id" that the record also belongs to. My current solution
>> uses a trigger in MySQL that will do the lookups inside MySQL, but this is
>> extremely slow. Pre-loading the mysql indexes into PHP ram and processing
>> that was is thousands of times faster.
>>
>> I just need an efficient way to hold my hash tables in PHP ram. PHP arrays
>> are very fast, but like my original post says, they consume way too much
>> ram.
>>
>>> A good algorithm can solve your problem anytime. ;-)
>> It takes about 5-10 minutes to build my hash indexes in PHP ram currently
>> which makes up for the 10,000 x speedup on key lookups that I get later on.
>> I just want to not use the whole 6 GB of ram to do this. I need an
>> efficient hashing API that supports something like:
>>
>> $value = (int) fasthash_get((string) $key);
>> $exists = (bool) fasthash_exists((string) $key);
>> fasthash_set((string) $key, (int) $value);
>>
>> Or ... it feels like a "memcached" api but where the data is stored locally
>> instead of accessed via a network. So this is how my search led me to what
>> appears to be a dead "lchash" extension.
>>
>> -- Dante
>>
>> ----------
>> D. Dante Lorenso
>> dante@lorenso.com
>> 972-333-4139
>>
>> --
>> PHP General Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>

--
----------
D. Dante Lorenso
dante@lorenso.com
972-333-4139

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: memory efficient hash table extension? like lchash ...

am 26.01.2010 02:35:17 von Ravi Menon

> values were stored, the APC storage began to slow down *dramatically*. I
> wasn't certain if APC was using only RAM or was possibly also writing to
> disk. Performance tanked so quickly that I set it aside as an option and
> moved on.
IIRC, i think it is built over shm and there is no disk backing store.

> memcached gives no guarantee about data persistence. I need to have a ha=
sh
> table that will contain all the values I set. They don't need to survive=
a
> server shutdown (don't need to be written to disk), but I can not afford =
for
> the server to throw away values that don't fit into memory. If there is =
a
> way to configure memcached guarantee storage, that might work.

True but the lru policy only kicks in lazily. So if you ensure that
you never hit near the max allowed limit (-m option), and you store
your key-val pairs with no expiry, it will be present till the next
restart. So essentially you would have to estimate the value for the
-m option to big enough to accommodate all possible key-val pairs (the
evictions counter in memcached stats should remain 0). BTW, I have
seen this implementation behavior in 1.2.x series but not sure it is
necessarily guaranteed in future versions.

Ravi

On Mon, Jan 25, 2010 at 3:49 PM, D. Dante Lorenso wrote=
:
> J Ravi Menon wrote:
>>
>> PHP does expose sys V shared-memory apis (shm_* functions):
>> http://us2.php.net/manual/en/book.sem.php
>
>
> I will look into this. =A0I really need a key/value map, though and would
> rather not have to write my own on top of SHM.
>
>
>> If you already have apc installed, you could also try:
>> http://us2.php.net/manual/en/book.apc.php
>> APC also allows you to store user specific data too (it will be in a
>> shared memory).
>
>
> I've looked into the apc_store and apc_fetch routines:
> http://php.net/manual/en/function.apc-store.php
> http://www.php.net/manual/en/function.apc-fetch.php
> ... but quickly ran out of memory for APC and though I figured out how to
> configure it to use more (adjust shared memory allotment), there were oth=
er
> problems. =A0I ran into issues with logs complaining about "cache slammin=
g"
> and other known bugs with APC version 3.1.3p1. =A0Also, after several mil=
lion
> values were stored, the APC storage began to slow down *dramatically*. =
=A0I
> wasn't certain if APC was using only RAM or was possibly also writing to
> disk. =A0Performance tanked so quickly that I set it aside as an option a=
nd
> moved on.
>
>
>> Haven't tried these myself, so I would do some quick tests to ensure
>> if they meet your performance requirements. In theory, it should be
>> faster than berkeley-db like solutions (which is also another option
>> but it seems something similar like MongoDB was not good enough?).
>
>
> I will run more tests against MongoDB. =A0Initially I tried to use it to =
store
> everything. =A0If I only store my indexes, it might fare better. Certainl=
y,
> though, running queries and updates against a remote server will always b=
e
> slower than doing the lookups locally in ram.
>
>
>> I =A0am curious to know if someone here has run these tests. Note that
>> with memcached installed locally (on the same box running php), it can
>> be surprisingly efficient - using pconnect(), =A0caching the handler in
>> a static var for a given request cycle etc...
>
> memcached gives no guarantee about data persistence. =A0I need to have a =
hash
> table that will contain all the values I set. =A0They don't need to survi=
ve a
> server shutdown (don't need to be written to disk), but I can not afford =
for
> the server to throw away values that don't fit into memory. =A0If there i=
s a
> way to configure memcached guarantee storage, that might work.
>
> -- Dante
>
>
>> On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso
>> wrote:
>>>
>>> shiplu wrote:
>>>>
>>>> On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso
>>>> wrote:
>>>>>
>>>>> All,
>>>>>
>>>>> I'm loading millions of records into a backend PHP cli script that I
>>>>> need to build a hash index from to optimize key lookups for data that
>>>>> I'm importing into a MySQL database. =A0The problem is that storing t=
his
>>>>> data in a PHP array is not very memory efficient and my millions of
>>>>> records are consuming about 4-6 GB of ram.
>>>>>
>>>> What are you storing? An array of row objects??
>>>> In that case storing only the row id is will reduce the memory.
>>>
>>> I am querying a MySQL database which contains 40 million records and
>>> mapping
>>> string columns to numeric ids. =A0You might consider it normalizing the
>>> data.
>>>
>>> Then, I am importing a new 40 million records and comparing the new
>>> values
>>> to the old values. =A0Where the value matches, I update records, but wh=
ere
>>> they do not match, I insert new records, and finally I go back and dele=
te
>>> old records. =A0So, the net result is that I have a database with 40
>>> million
>>> records that I need to "sync" on a daily basis.
>>>
>>>> If you are loading full row objects, it will take a lot of memory.
>>>> But if you just load the row id values, it will significantly decrease
>>>> the memory amount.
>>>
>>> For what I am trying to do, I just need to map a string value (32 bytes=
)
>>> to
>>> a bigint value (8 bytes) in a fast-lookup hash.
>>>
>>>> Besides, You can load row ids in a chunk by chunk basis. if you have
>>>> 10 millions of rows to process. load 10000 rows as a chunk. process
>>>> them then load the next chunk. =A0This will significantly reduce memor=
y
>>>> usage.
>>>
>>> When importing the fresh 40 million records, I need to compare each
>>> record
>>> with 4 different indexes that will map the record to existing other
>>> records,
>>> or into a "group_id" that the record also belongs to. =A0My current
>>> solution
>>> uses a trigger in MySQL that will do the lookups inside MySQL, but this
>>> is
>>> extremely slow. =A0Pre-loading the mysql indexes into PHP ram and
>>> processing
>>> that was is thousands of times faster.
>>>
>>> I just need an efficient way to hold my hash tables in PHP ram. =A0PHP
>>> arrays
>>> are very fast, but like my original post says, they consume way too muc=
h
>>> ram.
>>>
>>>> A good algorithm can solve your problem anytime. ;-)
>>>
>>> It takes about 5-10 minutes to build my hash indexes in PHP ram current=
ly
>>> which makes up for the 10,000 x speedup on key lookups that I get later
>>> on.
>>> =A0I just want to not use the whole 6 GB of ram to do this. =A0 I need =
an
>>> efficient hashing API that supports something like:
>>>
>>> =A0 =A0 =A0 $value =3D (int) fasthash_get((string) $key);
>>> =A0 =A0 =A0 $exists =3D (bool) fasthash_exists((string) $key);
>>> =A0 =A0 =A0 fasthash_set((string) $key, (int) $value);
>>>
>>> Or ... it feels like a "memcached" api but where the data is stored
>>> locally
>>> instead of accessed via a network. =A0So this is how my search led me t=
o
>>> what
>>> appears to be a dead "lchash" extension.
>>>
>>> -- Dante
>>>
>>> ----------
>>> D. Dante Lorenso
>>> dante@lorenso.com
>>> 972-333-4139
>>>
>>> --
>>> PHP General Mailing List (http://www.php.net/)
>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>
>>>
>>
>
>
> --
> ----------
> D. Dante Lorenso
> dante@lorenso.com
> 972-333-4139
>

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php