Updating counts

am 30.03.2007 09:19:42 von gbills

Hi all;

Sorry if I'm asking a silly question - I'm very inexperienced with
databases. I'm playing around with using DBI / SQLite as a back-end for
parsing and storing log files - I have multiple columns like "log date",
"logged by", etc. I parse these in a Perl script, and then store them in
the SQLite database. I would like to increment a "count" column if I
parse the same log line twice: on the first time I see that log line, I
parse it and store it in its own row, and on subsequent occasions I
increment a "count" column to say that I have seen that log line "count"
amount of times. Some quick Googling has shown some stuff to do with
auto increments, but that seemed to be much more to do with primary keys
and wasn't guaranteed to be in order / contiguous. I've also seen MySQL
syntax like "ON DUPLICATE KEY UPDATE ..." which looks good, but I'd like
to avoid an extra dependency on a database that has to be set up and
maintained (hence SQLite).

My preference for a solution would be something that's simple and
maintainable primarily, followed by fast (for reads first, writes
second) - but I don't have the database experience to know what the best
solution is. Is there an easy way? Should I be storing duplicate rows
and counting them at display time?

Using hashes, this works something like this (but I have more data than
I can safely store in memory with a hash, and sorting and merging
subhashes on top of this makes things tricky):

my %lineCount;
while (<>) {
chomp();
$lineCount{$_}++;
}

Using DBI, the best I've been able to get has been this:
my $dbh = DBI->connect("DBI:SQLite:dbname=mydb.sqldb","","");
$dbh->do("CREATE TABLE $table (line CHAR(100) PRIMARY KEY,
count INT)");
# read data
while(<>) {
chomp();
my $seenCount = 0;
my $rowRef =
$dbh->selectall_arrayref("SELECT * FROM $table WHERE line = ?",
undef, $_);
unless (@$rowRef) {
# don't insert row if it's already inserted
$dbh->do("INSERT INTO $table VALUES(?, ?)",
undef, $_, $seenCount);
}
$dbh->do("UPDATE $table SET count = count + 1 WHERE line = ?",
undef, $_);
}

Thanks in advance for any help.

RE: Updating counts

am 30.03.2007 15:44:50 von Philip.Garrett

Hi George,

George Bills wrote:
> Sorry if I'm asking a silly question - I'm very inexperienced with
> databases. I'm playing around with using DBI / SQLite as a back-end
> for parsing and storing log files - I have multiple columns like "log
> date", "logged by", etc. I parse these in a Perl script, and then
> store them in the SQLite database. I would like to increment a "count"
> column if I parse the same log line twice: on the first time I see
> that log line, I parse it and store it in its own row, and on
> subsequent occasions I increment a "count" column to say that I have
> seen that log line "count" amount of times. Some quick Googling has
> shown some stuff to do with auto increments, but that seemed to be
> much more to do with primary keys and wasn't guaranteed to be in order
> / contiguous. I've also seen MySQL syntax like "ON DUPLICATE KEY
> UPDATE ..." which looks good, but I'd like to avoid an extra
> dependency on a database that has to be set up and maintained (hence
> SQLite).
>=20
> My preference for a solution would be something that's simple and
> maintainable primarily, followed by fast (for reads first, writes
> second) - but I don't have the database experience to know what the
> best solution is. Is there an easy way?

Yes, but it looks like you've already done it: try fetch, update if
successful, insert otherwise.

> Should I be storing duplicate rows and counting them at display time?

Probably not, but it depends on how many rows there are. Grouping lots
of rows to provide a count at display time would be slow. Might as well
get that step out of the way when you load the data.

> $dbh->selectall_arrayref("SELECT * FROM $table WHERE line =3D
> ?", undef, $_);

Since you don't actually need the data from that row, you might speed
this up a little by changing "SELECT *" to "SELECT 1". Some DBMSes
will optimize that to only check an index instead of looking in the
table itself.

Now for some "generic" optimization hints:

1) Since you'll be calling each of these statements multiple times, you
should prepare() the statements outside the loop, and just execute()
them inside.

2) Turn off AutoCommit. Then, commit() infrequently. In most cases, this
will speed up data loading considerably.

Regards,
Philip

RE: Updating counts

am 30.03.2007 15:55:23 von rjk-dbi

Garrett, Philip (MAN-Corporate) wrote:
>
> George Bills wrote:
> > I've also seen MySQL syntax like "ON DUPLICATE KEY
> > UPDATE ..." which looks good, but I'd like to avoid an extra
> > dependency on a database that has to be set up and maintained (hence
> > SQLite).
> >
> > My preference for a solution would be something that's simple and
> > maintainable primarily, followed by fast (for reads first, writes
> > second) - but I don't have the database experience to know what the
> > best solution is. Is there an easy way?
>
> Yes, but it looks like you've already done it: try fetch, update if
> successful, insert otherwise.

This will work as long as you don't have to worry about concurrent processes
inserting rows into the table. If you do have that, then you've created a race
condition. Two processes could try fetch at roughly the same time, fail, and
insert the same row, causing one of the processes to get a duplicate key error.

Instead you could insert while trapping errors, and update if a duplicate key
error occurs. However, that probably requires checking for a database-specific
error code or message, at which point you might as well just take advantage of
MySQL's ON DUPLICATE KEY UPDATE.

Ronald

Re: Updating counts

am 30.03.2007 17:47:26 von Tim.Bunce

On Fri, Mar 30, 2007 at 09:55:23AM -0400, Ronald Kimball wrote:
> Garrett, Philip (MAN-Corporate) wrote:
> >
> > George Bills wrote:
> > > I've also seen MySQL syntax like "ON DUPLICATE KEY
> > > UPDATE ..." which looks good, but I'd like to avoid an extra
> > > dependency on a database that has to be set up and maintained (hence
> > > SQLite).
> > >
> > > My preference for a solution would be something that's simple and
> > > maintainable primarily, followed by fast (for reads first, writes
> > > second) - but I don't have the database experience to know what the
> > > best solution is. Is there an easy way?
> >
> > Yes, but it looks like you've already done it: try fetch, update if
> > successful, insert otherwise.
>
> This will work as long as you don't have to worry about concurrent processes
> inserting rows into the table. If you do have that, then you've created a race
> condition. Two processes could try fetch at roughly the same time, fail, and
> insert the same row, causing one of the processes to get a duplicate key error.
>
> Instead you could insert while trapping errors, and update if a duplicate key
> error occurs. However, that probably requires checking for a database-specific
> error code or message, at which point you might as well just take advantage of
> MySQL's ON DUPLICATE KEY UPDATE.

Instead of "insert while trapping errors, and update if a duplicate key"
you can "update then insert if the update affected 0 rows". That avoids
the need to check for specific error codes. It also optimises for the
common case of the row already existing.

If the update affected 0 rows then the following insert may fail if
some other process inserts a matching record before you get to do your
insert. If your insert fails you can just retry the update again before
aborting.

Portable, safe, speedy.

Tim.

RE: Updating counts

am 30.03.2007 17:56:02 von imharisa

I don't know if that would be speedy. Updates are far more expensive
than an insert. It does sound portable and safe though.

-----Original Message-----
From: Tim Bunce [mailto:Tim.Bunce@pobox.com]=20
Sent: Friday, March 30, 2007 9:47 AM
To: Ronald Kimball
Cc: George Bills; dbi-users@perl.org
Subject: Re: Updating counts

On Fri, Mar 30, 2007 at 09:55:23AM -0400, Ronald Kimball wrote:
> Garrett, Philip (MAN-Corporate) wrote:
> >=20
> > George Bills wrote:
> > > I've also seen MySQL syntax like "ON DUPLICATE KEY UPDATE ..."=20
> > > which looks good, but I'd like to avoid an extra dependency on a=20
> > > database that has to be set up and maintained (hence SQLite).
> > >=20
> > > My preference for a solution would be something that's simple and=20
> > > maintainable primarily, followed by fast (for reads first, writes
> > > second) - but I don't have the database experience to know what=20
> > > the best solution is. Is there an easy way?
> >=20
> > Yes, but it looks like you've already done it: try fetch, update if=20
> > successful, insert otherwise.
>=20
> This will work as long as you don't have to worry about concurrent=20
> processes inserting rows into the table. If you do have that, then=20
> you've created a race condition. Two processes could try fetch at=20
> roughly the same time, fail, and insert the same row, causing one of
the processes to get a duplicate key error.
>=20
> Instead you could insert while trapping errors, and update if a=20
> duplicate key error occurs. However, that probably requires checking=20
> for a database-specific error code or message, at which point you=20
> might as well just take advantage of MySQL's ON DUPLICATE KEY UPDATE.

Instead of "insert while trapping errors, and update if a duplicate key"
you can "update then insert if the update affected 0 rows". That avoids
the need to check for specific error codes. It also optimises for the
common case of the row already existing.

If the update affected 0 rows then the following insert may fail if some
other process inserts a matching record before you get to do your
insert. If your insert fails you can just retry the update again before
aborting.

Portable, safe, speedy.

Tim.

RE: Updating counts

am 30.03.2007 18:06:59 von Chuck.Fox

An update that does not find a qualifying row to update executes in the
same time that a select would. The only penalty is the time needed to
verify that it does not exist. =20

Your Friendly Neighborhood DBA,

Chuck =20

-----Original Message-----
From: Ian Harisay [mailto:imharisa@nuskin.com]=20
Sent: Friday, March 30, 2007 11:56 AM
To: dbi-users@perl.org
Subject: RE: Updating counts

I don't know if that would be speedy. Updates are far more expensive
than an insert. It does sound portable and safe though.

-----Original Message-----
From: Tim Bunce [mailto:Tim.Bunce@pobox.com]=20
Sent: Friday, March 30, 2007 9:47 AM
To: Ronald Kimball
Cc: George Bills; dbi-users@perl.org
Subject: Re: Updating counts

On Fri, Mar 30, 2007 at 09:55:23AM -0400, Ronald Kimball wrote:
> Garrett, Philip (MAN-Corporate) wrote:
> >=20
> > George Bills wrote:
> > > I've also seen MySQL syntax like "ON DUPLICATE KEY UPDATE ..."=20
> > > which looks good, but I'd like to avoid an extra dependency on a=20
> > > database that has to be set up and maintained (hence SQLite).
> > >=20
> > > My preference for a solution would be something that's simple and=20
> > > maintainable primarily, followed by fast (for reads first, writes
> > > second) - but I don't have the database experience to know what=20
> > > the best solution is. Is there an easy way?
> >=20
> > Yes, but it looks like you've already done it: try fetch, update if=20
> > successful, insert otherwise.
>=20
> This will work as long as you don't have to worry about concurrent=20
> processes inserting rows into the table. If you do have that, then=20
> you've created a race condition. Two processes could try fetch at=20
> roughly the same time, fail, and insert the same row, causing one of
the processes to get a duplicate key error.
>=20
> Instead you could insert while trapping errors, and update if a=20
> duplicate key error occurs. However, that probably requires checking=20
> for a database-specific error code or message, at which point you=20
> might as well just take advantage of MySQL's ON DUPLICATE KEY UPDATE.

Instead of "insert while trapping errors, and update if a duplicate key"
you can "update then insert if the update affected 0 rows". That avoids
the need to check for specific error codes. It also optimises for the
common case of the row already existing.

If the update affected 0 rows then the following insert may fail if some
other process inserts a matching record before you get to do your
insert. If your insert fails you can just retry the update again before
aborting.

Portable, safe, speedy.

Tim.

Re: Updating counts

am 30.03.2007 20:12:05 von Tim.Bunce

On Fri, Mar 30, 2007 at 09:56:02AM -0600, Ian Harisay wrote:
> I don't know if that would be speedy. Updates are far more expensive
> than an insert.

That very debatable, but the point is that updates would generally be
more frequent than inserts in a typical "increment counter" scenario.

Tim.

> It does sound portable and safe though.

> -----Original Message-----
> From: Tim Bunce [mailto:Tim.Bunce@pobox.com]
> Sent: Friday, March 30, 2007 9:47 AM
> To: Ronald Kimball
> Cc: George Bills; dbi-users@perl.org
> Subject: Re: Updating counts
>
> On Fri, Mar 30, 2007 at 09:55:23AM -0400, Ronald Kimball wrote:
> > Garrett, Philip (MAN-Corporate) wrote:
> > >
> > > George Bills wrote:
> > > > I've also seen MySQL syntax like "ON DUPLICATE KEY UPDATE ..."
> > > > which looks good, but I'd like to avoid an extra dependency on a
> > > > database that has to be set up and maintained (hence SQLite).
> > > >
> > > > My preference for a solution would be something that's simple and
> > > > maintainable primarily, followed by fast (for reads first, writes
> > > > second) - but I don't have the database experience to know what
> > > > the best solution is. Is there an easy way?
> > >
> > > Yes, but it looks like you've already done it: try fetch, update if
> > > successful, insert otherwise.
> >
> > This will work as long as you don't have to worry about concurrent
> > processes inserting rows into the table. If you do have that, then
> > you've created a race condition. Two processes could try fetch at
> > roughly the same time, fail, and insert the same row, causing one of
> the processes to get a duplicate key error.
> >
> > Instead you could insert while trapping errors, and update if a
> > duplicate key error occurs. However, that probably requires checking
> > for a database-specific error code or message, at which point you
> > might as well just take advantage of MySQL's ON DUPLICATE KEY UPDATE.
>
> Instead of "insert while trapping errors, and update if a duplicate key"
> you can "update then insert if the update affected 0 rows". That avoids
> the need to check for specific error codes. It also optimises for the
> common case of the row already existing.
>
> If the update affected 0 rows then the following insert may fail if some
> other process inserts a matching record before you get to do your
> insert. If your insert fails you can just retry the update again before
> aborting.
>
> Portable, safe, speedy.
>
> Tim.

RE: Updating counts

am 30.03.2007 22:39:53 von imharisa

From what I see you may want more than one table. One that stores your
log information. This table might contain "log date","logged by","log
statement", "etc.". Then create another table that summarizes/counts
how many times you find whatever it is you wish to count. For instance,
if it is a particular error message you want to count, you don't want to
include "log date" and "logged by" information. If "log Date" measeures
time to the microsecond most likely all your counts will equal 1. I
don't really know what your logging purpose is though.=20

Have you looked at using log4perl? Log4perl will log directly to a
database. Then you could write a program to summarize your logs in
another table. I would look at that.

-----Original Message-----
From: George Bills [mailto:gbills@funnelback.com]=20
Sent: Friday, March 30, 2007 1:20 AM
To: dbi-users@perl.org
Subject: Updating counts

Hi all;

Sorry if I'm asking a silly question - I'm very inexperienced with
databases. I'm playing around with using DBI / SQLite as a back-end for
parsing and storing log files - I have multiple columns like "log date",
"logged by", etc. I parse these in a Perl script, and then store them in
the SQLite database. I would like to increment a "count" column if I
parse the same log line twice: on the first time I see that log line, I
parse it and store it in its own row, and on subsequent occasions I
increment a "count" column to say that I have seen that log line "count"

amount of times. Some quick Googling has shown some stuff to do with
auto increments, but that seemed to be much more to do with primary keys
and wasn't guaranteed to be in order / contiguous. I've also seen MySQL
syntax like "ON DUPLICATE KEY UPDATE ..." which looks good, but I'd like
to avoid an extra dependency on a database that has to be set up and
maintained (hence SQLite).

My preference for a solution would be something that's simple and
maintainable primarily, followed by fast (for reads first, writes
second) - but I don't have the database experience to know what the best
solution is. Is there an easy way? Should I be storing duplicate rows
and counting them at display time?

Using hashes, this works something like this (but I have more data than
I can safely store in memory with a hash, and sorting and merging
subhashes on top of this makes things tricky):

my %lineCount;
while (<>) {
chomp();
$lineCount{$_}++;
}

Using DBI, the best I've been able to get has been this:
my $dbh =3D DBI->connect("DBI:SQLite:dbname=3Dmydb.sqldb","","");
$dbh->do("CREATE TABLE $table (line CHAR(100) PRIMARY KEY,
count INT)"); # read data
while(<>) {
chomp();
my $seenCount =3D 0;
my $rowRef =3D
$dbh->selectall_arrayref("SELECT * FROM $table WHERE line =3D =
?",
undef, $_);
unless (@$rowRef) {
# don't insert row if it's already inserted
$dbh->do("INSERT INTO $table VALUES(?, ?)",
undef, $_, $seenCount);
}
$dbh->do("UPDATE $table SET count =3D count + 1 WHERE line =3D ?",
undef, $_);
}

Thanks in advance for any help.