How to increase performance BerkeleyDB?

am 14.11.2007 13:25:16 von palexvs

I have BDB with about 10 million records and every day add ~100K.
Now my perl-script that use it DB work very slow: program spend 4-5
minutes to find 40K records.
Script start every 10 minute, get client identification from some logs
(such as apache), find UUID in BDB and add if it not exists.
Use perl5.8.8, p5-BerkeleyDB-0.31, FreeBSD 6.2.

Settings:
Algorithm: B-Tree
Key - 67-72 bytes
Value - 1 byte - 1Kbyte
'bt_ndata' => 10500248,
'bt_int_pgfree' => 826208,
'bt_pagesize' => 16384,
'bt_free' => 0,
'bt_over_pgfree' => 0,
'bt_leaf_pg' => 197947,
'bt_dup_pg' => 0,
'bt_levels' => 3,
'bt_version' => 9,
'bt_dup_pgfree' => 0,
'bt_flags' => 0,
'bt_minkey' => 2,
'bt_re_pad' => 32,
'bt_nkeys' => 10500248,
'bt_magic' => 340322,
'bt_leaf_pgfree' => 1542999750,
'bt_metaflags' => 0,
'bt_maxkey' => 0,
'bt_re_len' => 0,
'bt_int_pg' => 348,
'bt_over_pg' => 0

How to increase performance?

P.S: I've try:
- changed Pagesize to 1K,6K,8K,16K,32K,64K
- split one big BDB to 16 by first character in key
- use Hash
....but have no effect.

#### SCRIPT
#!/usr/bin/perl

use strict;
use warnings;
use 5.8.8;

use BerkeleyDB;

tie my %bdbh, 'BerkeleyDB::Btree', -Filename => 'uniq.db', -Cachesize
=> 200000000, -Flags => DB_RDONLY or die "$!\n";
open(FH,'<','UUID.list') or die "$!\n";
while(my $key=) {
chomp($key);
if(exists($bdbh{$key})) {
### Fined key
}
else {
### Not fined key
}
}
close(FH);
untie %bdbh;

##### UUID.list (40K recordes)
00000000000000000000000000000000_000000000000000000000000000 00000_000000
.....
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF_FFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFF_999999

All keys in UUID.list already exists in uniq.db.

Re: How to increase performance BerkeleyDB?

am 15.11.2007 22:03:07 von Mark Clements

palexvs@gmail.com wrote:
> I have BDB with about 10 million records and every day add ~100K.
> Now my perl-script that use it DB work very slow: program spend 4-5
> minutes to find 40K records.
> Script start every 10 minute, get client identification from some logs
> (such as apache), find UUID in BDB and add if it not exists.

I've just had a quick play with this and can do 40000 lookups on a 626MB
BerkeleyDB file containing 10 million records (the keys being UUIDs) in
a matter of seconds .

I suggest you identify the bottlenecks in your code using

Benchmark::Timer
Devel::DProf

Do you have limited RAM? Is the data on a network filesystem? Is the
machine heavily loaded?

Mark

Re: How to increase performance BerkeleyDB?

am 16.11.2007 15:18:21 von palexvs

On 15 ÎÏÑÂ, 23:03, Mark Clements >
wrote:
> pale...@gmail.com wrote:
> > I have BDB with about 10 million records and every day add ~100K.
> > Now my perl-script that use it DB work very slow: program spend 4-5
> > minutes to find 40K records.
> > Script start every 10 minute, get client identification from some logs
> > (such as apache), find UUID in BDB and add if it not exists.
>
>
> I've just had a quick play with this and can do 40000 lookups on a 626MB
> BerkeleyDB file containing 10 million records (the keys being UUIDs) in
> a matter of seconds .
>
> I suggest you identify the bottlenecks in your code using
>
> Benchmark::Timer
> Devel::DProf
Total Elapsed Time =3D 399.8157 Seconds
User+System Time =3D 2.573587 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
81.0 2.085 2.085 40085 0.0001 0.0001
BerkeleyDB::Common::db_get
10.5 0.272 2.358 40085 0.0000 0.0001
BerkeleyDB::_tiedHash::EXISTS
0.58 0.015 0.037 5 0.0031 0.0074 main::BEGIN
0.31 0.008 0.008 1 0.0078 0.0078
BerkeleyDB::Btree::_db_open_btree
0.31 0.008 0.008 3 0.0026 0.0026 DynaLoader::dl_load_file
0.31 0.008 0.015 7 0.0011 0.0021 IO::File::BEGIN
0.31 0.008 0.008 34 0.0002 0.0002 Exporter::import
0.00 0.000 0.000 1 0.0000 0.0000 BerkeleyDB::__ANON__
0.00 - -0.000 1 - - IO::bootstrap
0.00 - -0.000 1 - - BerkeleyDB::bootstrap
0.00 - -0.000 1 - - BerkeleyDB::AUTOLOAD
0.00 - -0.000 1 - - BerkeleyDB::constant
0.00 - -0.000 1 - -
BerkeleyDB::ParseParameters
0.00 - -0.000 1 - - BerkeleyDB::parseEncrypt
0.00 - -0.000 1 - - warnings::BEGIN

#iostat -w 1
CPU: use 0-1%
HDD: 1-2MB/s

> Do you have limited RAM? Is the data on a network filesystem?
OS: FreeBSD 6.2-RELEASE-p4
CPU: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz (1597.65-MHz 686-class CPU);
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
RAM: avail memory =3D 2094858240 (1997 MB)
HDD: mfid1: 238848MB (489160704 sectors)

> Is the machine heavily loaded?
Server is free, used only for my test.

Re: How to increase performance BerkeleyDB?

am 16.11.2007 20:08:30 von Mark Clements

palexvs@gmail.com wrote:
> On 15 ÎÏÑÂ, 23:03, Mark Clements
> wrote:
>> pale...@gmail.com wrote:
>>> I have BDB with about 10 million records and every day add ~100K.
>>> Now my perl-script that use it DB work very slow: program spend 4-5
>>> minutes to find 40K records.
>>> Script start every 10 minute, get client identification from some logs
>>> (such as apache), find UUID in BDB and add if it not exists.
>>
>> I've just had a quick play with this and can do 40000 lookups on a 626MB
>> BerkeleyDB file containing 10 million records (the keys being UUIDs) in
>> a matter of seconds .
>>
>> I suggest you identify the bottlenecks in your code using
>>
>> Benchmark::Timer
>> Devel::DProf
> Total Elapsed Time = 399.8157 Seconds
> User+System Time = 2.573587 Seconds
> Exclusive Times

OK - so there is a massive discrepancy between the time reported by
dprofpp and the total elapsed time. You need to establish why this is
the case.

http://www.perlmonks.org/?node_id=633699

explains that

Total Elapsed Time
This is wall clock time from the program's start to finish, no
matter how that time was spent.
User+System Time
This is how much CPU time the program took. This does not include
time spent waiting on the disk, the network, or other tasks. ("User
time" is time spent in your code and "system time" is time the operating
system spent serving your code.)

I'd go back to the Benchmark::Timer suggestion to try and get broader
statistics on the execution of your program. Mine gives

F:\Documents and Settings\Mark3>perl testberk.pl
1 trial of all (2.121s total)
1 trial of tie (14.760ms total)
1 trial of lookuploop (2.106s total)
40000 trials of lookup (1.285s total), 32us/trial

for example.

>
>> Do you have limited RAM? Is the data on a network filesystem?
> OS: FreeBSD 6.2-RELEASE-p4
> CPU: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz (1597.65-MHz 686-class CPU);
> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
> RAM: avail memory = 2094858240 (1997 MB)
> HDD: mfid1: 238848MB (489160704 sectors)
OK - so it's far from underpowered.

>> Is the machine heavily loaded?
> Server is free, used only for my test.
Can you gather sar statistics (might be atsar, depending on your flavour
of unix)?

Mark

Re: How to increase performance BerkeleyDB?

am 16.11.2007 22:28:29 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
palexvs@gmail.com
], who wrote in article <2e5fb23c-f0e6-43c0-a518-849cfb693c3d@a28g2000hsc.googlegroups.com>:

> Total Elapsed Time =3D 399.8157 Seconds
> User+System Time =3D 2.573587 Seconds

What doing `time' on your script returns? Looks like 99.5% time is
spent in iowait...

> #iostat -w 1
> CPU: use 0-1%
> HDD: 1-2MB/s

My iostat has no `-w'. What is the semantic of HDD: current usage, or
maximum recorded?

Yours,
Ilya

Re: How to increase performance BerkeleyDB?

am 17.11.2007 07:44:22 von Mark Clements

Mark Clements wrote:
> palexvs@gmail.com wrote:
>> On 15 ÎÏÑÂ, 23:03, Mark Clements
>> wrote:
>>> pale...@gmail.com wrote:
>>>> I have BDB with about 10 million records and every day add ~100K.
>>>> Now my perl-script that use it DB work very slow: program spend 4-5
>>>> minutes to find 40K records.
>>>> Script start every 10 minute, get client identification from some logs
>>>> (such as apache), find UUID in BDB and add if it not exists.
>>>
>>> I've just had a quick play with this and can do 40000 lookups on a 626MB
>>> BerkeleyDB file containing 10 million records (the keys being UUIDs) in
>>> a matter of seconds .
>>>
>>> I suggest you identify the bottlenecks in your code using
>>>
>>> Benchmark::Timer
>>> Devel::DProf
>> Total Elapsed Time = 399.8157 Seconds
>> User+System Time = 2.573587 Seconds
>> Exclusive Times
>
> OK - so there is a massive discrepancy between the time reported by
> dprofpp and the total elapsed time. You need to establish why this is
> the case.

Does another process have the db file open and locked, either with eg
flock or with BerkeleyDB's locking mechanism?

Re: How to increase performance BerkeleyDB?

am 17.11.2007 14:46:09 von palexvs

First start:
tie1 trial of tie (21.490ms total)
while1 trial of while (292.197s total)
get_key40085 trials of get_key (291.223s total), 7.265ms/trial
close1 trial of close (4.889ms total)

#iostat -w 2
tty mfid0 mfid1 cpu
tin tout KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 222 0.00 0 0.00 16.00 95 1.48 1 0 0 0 99
0 340 0.00 0 0.00 16.00 95 1.48 0 0 0 0 100
0 222 0.00 0 0.00 16.00 98 1.53 0 0 0 0 99
0 221 0.00 0 0.00 16.00 100 1.56 0 0 0 0 100

Second start:
tie1 trial of tie (4.096ms total)
while1 trial of while (2.595s total)
get_key40085 trials of get_key (1.712s total), 42us/trial
close1 trial of close (4.945ms total)

#iostat -w 1
tty mfid0 mfid1 cpu
tin tout KB/t tps MB/s KB/t tps MB/s us ni sy in id
13 5072 0.00 0 0.00 0.00 0 0.00 0 0 0 0 99
0 269 0.00 0 0.00 0.00 0 0.00 15 0 11 0 74
0 225 0.00 0 0.00 0.00 0 0.00 16 0 9 0 75
0 1383 0.00 0 0.00 0.00 0 0.00 13 0 7 0 80

In the first running the script CPU was not used and HDD was used only
on 1MB/s (when the max possible speed can do 40MB/s).
In the second running script was working very fast probably because
the BDB-file was cached either in system or in HHD cache.