Get an arbitrary hash key, quickly.

Get an arbitrary hash key, quickly.

am 24.01.2008 18:25:07 von xhoster

I need a work queue, but I don't really care about the order (LIFO, FIFO,
random) in which things come out of it. Either is pretty simple and
efficient with a Perl array, and either would suffice. But I want the
queue to not hold duplicate entries. I could use an array as a stack or
queue, with a parallel hash to be checked to prevent duplicates from being
entered. But why keep parallel data structures? Just use the hash as the
queue:

while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash
};

Well, except that this gets really slow on large hashes.

I suspect the each operator does a linear scan of all the buckets until it
finds an occupied one. With a hash that used to be big but now isn't (and
so has lots of empty buckets), this behaves poorly, essentially being N^2
to empty a large hash.

If I just use scalar %hash instead of scalar keys %hash, then the slow down
doesn't occur because "each" scans the buckets from where it left off last
time, instead of from the beginning each time. (At first I thought it was
scalar keys %hash that was being slow and was going to post about *that*,
but then I realized that keys' reseting of the iterator was causing "each"
to be slow). But you shouldn't change a hash at the same time you iterate
over it because things might get missed. Presumably, anything missed will
be found on the next time through the iterator, so this might work without
the slowdown:


while (%hash) { # does not reset iterator
my $to_do=each %hash;
next unless defined $to_do; # not a real hash key,
# signals end of iterator
## do stuff with $to_do, which might make new entries in %hash
};

If my speculation on the internals is right, then this would still
perform poorly if the hash first grows very large, then shrinks to
be only a few elements, but those last few elements keep triggering
the addition of just one more element each time. In my case, that
shouldn't be a problem.

Any comments on this code? Is there some less quirky way to do this
efficiently? Is there a simple (as simple as perl's internals can get)
way to fix "each" so that it doesn't have this degenerate case?

Thanks,

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 18:43:32 von it_says_BALLS_on_your forehead

On Jan 24, 12:25=A0pm, xhos...@gmail.com wrote:
> I need a work queue, but I don't really care about the order (LIFO, FIFO,
> random) in which things come out of it. =A0Either is pretty simple and
> efficient with a Perl array, and either would suffice. =A0But I want the
> queue to not hold duplicate entries. =A0I could use an array as a stack or=

> queue, with a parallel hash to be checked to prevent duplicates from being=

> entered. =A0But why keep parallel data structures? =A0Just use the hash as=
the
> queue:
>
> while (key %hash) {
> =A0 my $to_do=3Deach %hash;
> =A0 delete $hash{$to_do};
> =A0 ## do stuff with $to_do, which might make new entries in %hash
>
> };
>

if "do stuff" might make entries in %hash, how do you guarantee you
won't be operating on a duplicate entry, since you delete
$hash{$to_do}?

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 19:07:41 von xhoster

nolo contendere wrote:
> On Jan 24, 12:25=A0pm, xhos...@gmail.com wrote:
> > I need a work queue, but I don't really care about the order (LIFO,
> > FIFO, random) in which things come out of it. =A0Either is pretty
> > simple and efficient with a Perl array, and either would suffice.
> > =A0But I want the queue to not hold duplicate entries. =A0I could use
> > an array as a stack or=
>
> > queue, with a parallel hash to be checked to prevent duplicates from
> > being entered. But why keep parallel data structures? Just use the hash
> > as the queue:
> >
> > while (key %hash) {
> > my $to_do=3Deach %hash;
> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might make new entries in %hash
> >
> > };
> >
>
> if "do stuff" might make entries in %hash, how do you guarantee you
> won't be operating on a duplicate entry, since you delete
> $hash{$to_do}?

I want to guarantee that duplicate entries won't be in the queue *at the
same time*. Whether they can be re-entered at a later point is up to
the details of "do stuff", which vary from project to project and which
I am intentionally avoiding.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 19:25:10 von Uri Guttman

>>>>> "x" == xhoster writes:

x> I need a work queue, but I don't really care about the order (LIFO, FIFO,
x> random) in which things come out of it. Either is pretty simple and
x> efficient with a Perl array, and either would suffice. But I want the
x> queue to not hold duplicate entries. I could use an array as a stack or
x> queue, with a parallel hash to be checked to prevent duplicates from being
x> entered. But why keep parallel data structures? Just use the hash as the
x> queue:

why not use the hash/array pair? and how do you generate dup work
request keys anyway? why not use a ref to the work thing as the key as
that will be unique. then an array will be fine.

x> while (key %hash) {
x> my $to_do=each %hash;

why not do this instead?

while( my $todo = each %hash ) {

#do work
delete $hash{$to_do};
keys %hash ; # reset iterator

the docs say IIRC that keys in a void context will reset the
iterator. keys in a while condition is in a scalar/boolean context and
may just return the number of keys.

and as another poster says, adding hash elements in a loop may change the
iterator order and cause unknown results. you need to do a clean reset
of the hash iterator so the each will work correctly.

x> If I just use scalar %hash instead of scalar keys %hash, then the slow down
x> doesn't occur because "each" scans the buckets from where it left off last
x> time, instead of from the beginning each time. (At first I thought it was
x> scalar keys %hash that was being slow and was going to post about *that*,
x> but then I realized that keys' reseting of the iterator was causing "each"
x> to be slow). But you shouldn't change a hash at the same time you iterate
x> over it because things might get missed. Presumably, anything missed will
x> be found on the next time through the iterator, so this might work without
x> the slowdown:

and you may not see all the elements of the hash as the iterator could
skip over new entries.

the best asnwer is what i said above, do a proper iterator reset after
you process the current work (actually just after the each() call should
work fine too and that would be clearer).

uri

--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 19:59:24 von xhoster

Uri Guttman wrote:
> >>>>> "x" == xhoster writes:
>
> x> I need a work queue, but I don't really care about the order (LIFO,
> FIFO, x> random) in which things come out of it. Either is pretty
> simple and x> efficient with a Perl array, and either would suffice.
> But I want the x> queue to not hold duplicate entries. I could use an
> array as a stack or x> queue, with a parallel hash to be checked to
> prevent duplicates from being x> entered. But why keep parallel data
> structures? Just use the hash as the x> queue:
>
> why not use the hash/array pair?

Parallel data structures are ugly and easy to screw up because they always
need to be kept in sync. Also, they need to use more memory, which in some
cases may be important. But other than that, there is no reason not to.
That is what I often do, and it works, I just want something neater.


> and how do you generate dup work
> request keys anyway? why not use a ref to the work thing as the key as
> that will be unique. then an array will be fine.

I don't understand what you are saying. What stops an array from
containing many duplicate values, be they references or anything else?

>
> x> while (key %hash) {
> x> my $to_do=each %hash;
>
> why not do this instead?
>
> while( my $todo = each %hash ) {
>
> #do work
> delete $hash{$to_do};
> keys %hash ; # reset iterator

Effectively the same thing, just re-worded. Works, but has the
same slowness issue, presumably for the same reason.

>
> the docs say IIRC that keys in a void context will reset the
> iterator. keys in a while condition is in a scalar/boolean context and
> may just return the number of keys.

Nah, it resets in all contexts. In a void context, that is all it does.
(i.e. it doesn't build the return list and then through it away, or
something like that.)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 20:29:48 von Mark Clements

xhoster@gmail.com wrote:
> I need a work queue, but I don't really care about the order (LIFO, FIFO,
> random) in which things come out of it. Either is pretty simple and
> efficient with a Perl array, and either would suffice. But I want the
> queue to not hold duplicate entries. I could use an array as a stack or
> queue, with a parallel hash to be checked to prevent duplicates from being
> entered. But why keep parallel data structures? Just use the hash as the
> queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
> };
>
> Well, except that this gets really slow on large hashes.
>

Have you looked at one of the Heap modules, eg Heap::Simple?

Mark

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 20:33:50 von it_says_BALLS_on_your forehead

On Jan 24, 1:59=A0pm, xhos...@gmail.com wrote:
> Uri Guttman wrote:
> > >>>>> "x" == xhoster =A0 writes:
>
>
>
>
> > =A0 x> while (key %hash) {
> > =A0 x> =A0 my $to_do=3Deach %hash;
>
> > why not do this instead?
>
> > while( my $todo =3D each %hash ) {
>
> > =A0 =A0 =A0 =A0 #do work
> > =A0 =A0 =A0 =A0 delete $hash{$to_do};
> > =A0 =A0 =A0 =A0 keys %hash ; =A0 =A0# reset iterator
>
> Effectively the same thing, just re-worded. =A0Works, but has the
> same slowness issue, presumably for the same reason.
>

So if you remove the keys %hash in the above scenario, would the
condition to 'while' evaluate to true after the iterator reached then
end if keys were added? will the iterator "know" that it needs to loop
around again? it's unclear from the perldoc -f each:

When the hash is entirely read, a null array is
returned in list context (which when assigned
produces a false (0) value), and "undef" in scalar
context. The next call to "each" after that will
start iterating again. There is a single iterator
for each hash, shared by all "each", "keys", and
"values" function calls in the program; it can be
reset by reading all the elements from the hash, or
by evaluating "keys HASH" or "values HASH". If you
add or delete elements of a hash while you're
iterating over it, you may get entries skipped or
duplicated, so don't. Exception: It is always safe
to delete the item most recently returned by
"each()", which means that the following code will
work:

while (($key, $value) =3D each %hash) {
print $key, "\n";
delete $hash{$key}; # This is safe
}

It says undef will be returned if the hash is entirely read. But since
the hash is not empty, is it considered "entirely read"? Also, it says
it's always safe to delete the most recently returned item, but sounds
like it's UNsafe to add elements while iterating.

based on this, Uri's way (which is essentially the same as your
original slow way, if a tad bit cleaner), seems the correct way.

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 20:43:52 von 1usa

xhoster@gmail.com wrote in news:20080124122509.078$og@newsreader.com:

> I need a work queue, but I don't really care about the order (LIFO,
> FIFO, random) in which things come out of it. Either is pretty simple
> and efficient with a Perl array, and either would suffice. But I want
> the queue to not hold duplicate entries. I could use an array as a
> stack or queue, with a parallel hash to be checked to prevent
> duplicates from being entered. But why keep parallel data structures?
> Just use the hash as the queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
> };
>
> Well, except that this gets really slow on large hashes.

It is possible I am misunderstanding what you are doing, but I would
have written:

#!/usr/bin/perl

use strict;
use warnings;

my %queue;
$queue{ $_ } = undef for 1 .. 1_000_000;

while ( my @keys = keys %queue ) {
for my $task ( @keys ) {
delete $queue{ $task };
#
# Do something
#
# Add another task with a small probability
#
if ( 0.1 > rand ) {
$queue{ 1 + int(rand 1_000_000) } = undef;
}
}
}


C:\Home\asu1\Src\test\hash_queue> timethis hq

TimeThis : Command Line : hq
TimeThis : Start Time : Thu Jan 24 14:42:36 2008
TimeThis : End Time : Thu Jan 24 14:42:46 2008
TimeThis : Elapsed Time : 00:00:09.312

Sinan

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
clpmisc guidelines:

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 21:24:59 von it_says_BALLS_on_your forehead

On Jan 24, 2:43=A0pm, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
> xhos...@gmail.com wrote innews:20080124122509.078$og@newsreader.com:
>
> > I need a work queue, but I don't really care about the order (LIFO,
> > FIFO, random) in which things come out of it. =A0Either is pretty simple=

> > and efficient with a Perl array, and either would suffice. =A0But I want=

> > the queue to not hold duplicate entries. =A0I could use an array as a
> > stack or queue, with a parallel hash to be checked to prevent
> > duplicates from being entered. =A0But why keep parallel data structures?=

> > =A0Just use the hash as the queue:
>
> > while (key %hash) {
> > =A0 my $to_do=3Deach %hash;
> > =A0 delete $hash{$to_do};
> > =A0 ## do stuff with $to_do, which might make new entries in %hash
> > };
>
> > Well, except that this gets really slow on large hashes.
>
> It is possible I am misunderstanding what you are doing, but I would
> have written:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my %queue;
> $queue{ $_ } =3D undef for 1 .. 1_000_000;
>
> while ( my @keys =3D keys %queue ) {
> =A0 =A0 for my $task ( @keys ) {
> =A0 =A0 =A0 =A0 delete $queue{ $task };
> =A0 =A0 =A0 =A0 #
> =A0 =A0 =A0 =A0 # Do something
> =A0 =A0 =A0 =A0 #
> =A0 =A0 =A0 =A0 # Add another task with a small probability
> =A0 =A0 =A0 =A0 #
> =A0 =A0 =A0 =A0 if ( 0.1 > rand ) {
> =A0 =A0 =A0 =A0 =A0 =A0 $queue{ 1 + int(rand 1_000_000) } =3D undef;
> =A0 =A0 =A0 =A0 }
> =A0 =A0 }
>
> }
>
> C:\Home\asu1\Src\test\hash_queue> timethis hq
>
> TimeThis : =A0Command Line : =A0hq
> TimeThis : =A0 =A0Start Time : =A0Thu Jan 24 14:42:36 2008
> TimeThis : =A0 =A0 =A0End Time : =A0Thu Jan 24 14:42:46 2008
> TimeThis : =A0Elapsed Time : =A000:00:09.312
>

That was my initial thought as well (well, something like that
anyway), and I think that was Xho's initial thought. But I think he's
trying to devise a solution that doesn't require both %queue and
@keys.

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 21:43:33 von xhoster

nolo contendere wrote:
> On Jan 24, 1:59=A0pm, xhos...@gmail.com wrote:
> > Uri Guttman wrote:
> > > >>>>> "x" == xhoster =A0 writes:
> >
> >
> >
> >
> > > x> while (key %hash) {
> > > x> my $to_do=each %hash;
> >
> > > why not do this instead?
> >
> > > while( my $todo = each %hash ) {
> >
> > > #do work
> > > delete $hash{$to_do};
> > > keys %hash ; # reset iterator
> >
> > Effectively the same thing, just re-worded. Works, but has the
> > same slowness issue, presumably for the same reason.
> >
>
> So if you remove the keys %hash in the above scenario, would the
> condition to 'while' evaluate to true after the iterator reached then
> end if keys were added?

No. Once the iterator reached the end, it will return undef and the loop
will exit. So as far as the loop is concerned, there is no "after"
the iterator reaches the end, because the loop is done.

> will the iterator "know" that it needs to loop
> around again?

Only if you happen to invoke it once more after it already indicated
the it was done. But the loop, without the reset, doesn't invoke it
again.


> it's unclear from the perldoc -f each:
>
> When the hash is entirely read, a null array is
> returned in list context (which when assigned
> produces a false (0) value), and "undef" in scalar
> context. The next call to "each" after that will
> start iterating again. There is a single iterator
> for each hash, shared by all "each", "keys", and
> "values" function calls in the program; it can be
> reset by reading all the elements from the hash, or
> by evaluating "keys HASH" or "values HASH". If you
> add or delete elements of a hash while you're
> iterating over it, you may get entries skipped or
> duplicated, so don't. Exception: It is always safe
> to delete the item most recently returned by
> "each()", which means that the following code will
> work:
>
> while (($key, $value) =3D each %hash) {
> print $key, "\n";
> delete $hash{$key}; # This is safe
> }
>
> It says undef will be returned if the hash is entirely read. But since
> the hash is not empty, is it considered "entirely read"?

Of course. Otherwise, if you did an each without a delete (which
is the most common way to use each), then it would be infinite loop.

> Also, it says
> it's always safe to delete the most recently returned item, but sounds
> like it's UNsafe to add elements while iterating.

If you violate their warning about adding things to the hash while
iterating, then it may be deemed fully read even though some elements have
not been iterated over during this current round of iteration. So if you
violate that warning, then when each returns empty you have to use some
other way to make sure the hash itself is empty.

I'm pretty sure it is safe in the sense that it will not segfault or
anything like that, it is only unsafe towards your assumptions that
everything has been seen.


> based on this, Uri's way (which is essentially the same as your
> original slow way, if a tad bit cleaner), seems the correct way.

I don't see that. Both methods always reset the iterator between the time
additions are made and the time each is called again. (Well, Uri's doesn't
reset the iterator before the very first each, which presumably comes
after some insert which was done before the loop was reached, but surely
that is not a problem.)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 21:54:41 von it_says_BALLS_on_your forehead

On Jan 24, 3:43=A0pm, xhos...@gmail.com wrote:
> nolo contendere wrote:
> > On Jan 24, 1:59=3DA0pm, xhos...@gmail.com wrote:
> > > Uri Guttman wrote:
> > > > >>>>> "x" =3D3D=3D3D xhoster =3DA0 writes:
>

>
> > based on this, Uri's way (which is essentially the same as your
> > original slow way, if a tad bit cleaner), seems the correct way.
>
> I don't see that. =A0Both methods always reset the iterator between the ti=
me
> additions are made and the time each is called again. =A0(Well, Uri's does=
n't
> reset the iterator before the very first each, which presumably comes
> after some insert which was done before the loop was reached, but surely
> that is not a problem.)


True, the flaw in this algorithm is that it always resets the
iterator. Personally, I like Sinan's implementation.

Re: Get an arbitrary hash key, quickly.

am 24.01.2008 22:00:34 von 1usa

nolo contendere wrote in
news:36b5756e-5bfd-4d44-8109-978ef73db81d@t1g2000pra.googleg roups.com:

> On Jan 24, 2:43 pm, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
>> xhos...@gmail.com wrote innews:20080124122509.078$og@newsreader.com:
>>
>> > I need a work queue, but I don't really care about the order (LIFO,
>> > FIFO, random) in which things come out of it.  Either is pretty
>> > simple
>
....

>> > Well, except that this gets really slow on large hashes.
>>
>> It is possible I am misunderstanding what you are doing, but I would
>> have written:

Obviously, I was.

....

> That was my initial thought as well (well, something like that
> anyway), and I think that was Xho's initial thought. But I think he's
> trying to devise a solution that doesn't require both %queue and
> @keys.

Thank you for the clarification.

Sinan


--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
clpmisc guidelines:

Re: Get an arbitrary hash key, quickly.

am 25.01.2008 01:05:49 von Ben Morrow

Quoth xhoster@gmail.com:
> I need a work queue, but I don't really care about the order (LIFO, FIFO,
> random) in which things come out of it. Either is pretty simple and
> efficient with a Perl array, and either would suffice. But I want the
> queue to not hold duplicate entries. I could use an array as a stack or
> queue, with a parallel hash to be checked to prevent duplicates from being
> entered. But why keep parallel data structures? Just use the hash as the
> queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
> };
>
> Well, except that this gets really slow on large hashes.

I would have thought the obvious method would be

my (@queue, %done);

while (my $to_do = shift @queue) {
exists $done{$to_do} and next;
$done{$to_do} = 1;

process_stuff($to_do);
}

obviously, if you wanted to deliberately requeue something you would
need to know that, and delete the hash entry. I realize you're trying to
avoid parallel data structures, but $done is pretty much
self-maintaining, so I don't know if that applies here.

> I suspect the each operator does a linear scan of all the buckets until it
> finds an occupied one. With a hash that used to be big but now isn't (and
> so has lots of empty buckets), this behaves poorly, essentially being N^2
> to empty a large hash.

Presumably you could fix this with

%hash = %hash;

if that would be helpful. Of course, you'll copy everything.

> If I just use scalar %hash instead of scalar keys %hash, then the slow down
> doesn't occur because "each" scans the buckets from where it left off last
> time, instead of from the beginning each time. (At first I thought it was
> scalar keys %hash that was being slow and was going to post about *that*,
> but then I realized that keys' reseting of the iterator was causing "each"
> to be slow). But you shouldn't change a hash at the same time you iterate
> over it because things might get missed. Presumably, anything missed will
> be found on the next time through the iterator, so this might work without
> the slowdown:

Presumably also things might get found twice, depending on how the
buckets get reorganized. Since you're always deleting the current
element before calling each again that shouldn't be a problem here.

> Any comments on this code? Is there some less quirky way to do this
> efficiently? Is there a simple (as simple as perl's internals can get)
> way to fix "each" so that it doesn't have this degenerate case?

Presumably switching to something like BerkeleyDB::BTREE would also fix
this problem, since btrees are designed to be iterated over. I don't
know how much grief the tie overhead would cause in this case;
presumably an implementation could be made that used U/u magic directly
rather than tie magic, and so avoided the method calls.

Ben

Re: Get an arbitrary hash key, quickly.

am 25.01.2008 15:37:10 von it_says_BALLS_on_your forehead

On Jan 24, 7:05=A0pm, Ben Morrow wrote:

> Presumably you could fix this with
..
> Presumably also things might get found twice, depending on how the
..
>
> Presumably switching to something like BerkeleyDB::BTREE would also fix
..
> presumably an implementation could be made that used U/u magic directly
> rather than tie magic, and so avoided the method calls.
>

You're very presumptuous ;-).

Re: Get an arbitrary hash key, quickly.

am 25.01.2008 18:16:38 von xhoster

"A. Sinan Unur" <1usa@llenroc.ude.invalid> wrote:
> xhoster@gmail.com wrote in news:20080124122509.078$og@newsreader.com:
>
> > I need a work queue, but I don't really care about the order (LIFO,
> > FIFO, random) in which things come out of it. Either is pretty simple
> > and efficient with a Perl array, and either would suffice. But I want
> > the queue to not hold duplicate entries. I could use an array as a
> > stack or queue, with a parallel hash to be checked to prevent
> > duplicates from being entered. But why keep parallel data structures?
> > Just use the hash as the queue:
> >
> > while (key %hash) {
> > my $to_do=each %hash;
> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might make new entries in %hash
> > };
> >
> > Well, except that this gets really slow on large hashes.
>
> It is possible I am misunderstanding what you are doing, but I would
> have written:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my %queue;
> $queue{ $_ } = undef for 1 .. 1_000_000;
>
> while ( my @keys = keys %queue ) {
> for my $task ( @keys ) {
> delete $queue{ $task };
> #
> # Do something
> #
> # Add another task with a small probability
> #
> if ( 0.1 > rand ) {
> $queue{ 1 + int(rand 1_000_000) } = undef;
> }
> }
> }

Now we've changed from an extra parallel structure to an extra caching
structure, which of course works but has the same inelegance. (Grepping
back through years of code, I see that I've actually used this method many
years ago, too, but from the comments it looks like I didn't understand why
the original was slow, so I regarded it as trial and error magic.) It does
require that "do something" can only add and not delete things from %queue,
which is fine with my original specification but lacks maximal flexibility.

There are many ways to work around this degradation in performance, but it
seems like all of them impose other limitations or inelegances. Oh well.
if these things were too easy I guess anyone could do it and I'd be out of
a job :)


Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 25.01.2008 18:32:23 von xhoster

Ben Morrow wrote:
> Quoth xhoster@gmail.com:
> > I need a work queue, but I don't really care about the order (LIFO,
> > FIFO, random) in which things come out of it. Either is pretty simple
> > and efficient with a Perl array, and either would suffice. But I want
> > the queue to not hold duplicate entries. I could use an array as a
> > stack or queue, with a parallel hash to be checked to prevent
> > duplicates from being entered. But why keep parallel data structures?
> > Just use the hash as the queue:
> >
> > while (key %hash) {
> > my $to_do=each %hash;
> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might make new entries in %hash
> > };
> >
> > Well, except that this gets really slow on large hashes.
>
> I would have thought the obvious method would be
>
> my (@queue, %done);
>
> while (my $to_do = shift @queue) {
> exists $done{$to_do} and next;
> $done{$to_do} = 1;
>
> process_stuff($to_do);
> }

The problem here is that, under some uses, @queue can accumulate an fatal
number of replicates. While %done prevents them from being processed
more than once, they are still stored more than once. If each of 10_000
work-items build up on the queue 10_000 times each, that becomes a storage
problem. So you need to use %done to guard against adding replicates into
@queue to defend memory usage.

But replicates can still build up to unacceptable levels before the first
instance of one ever gets $done{}. So then you need another hash,
%queued_but_not_done, to reject duplicate entries at the earliest stage.
And then you say "This is a mess, let's just use one hash", then you
discover that "each" performs poorly on a shrinking hash, and then you post
a message here. :)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 25.01.2008 19:02:45 von it_says_BALLS_on_your forehead

On Jan 25, 12:32=A0pm, xhos...@gmail.com wrote:
..
>
> But replicates can still build up to unacceptable levels before the first
> instance of one ever gets $done{}. =A0So then you need another hash,
> %queued_but_not_done, to reject duplicate entries at the earliest stage.

=46rom this I infer that you are engaging in parallel processing. Is
this correct?

> And then you say "This is a mess, let's just use one hash", then you
> discover that "each" performs poorly on a shrinking hash, and then you pos=
t
> a message here. :)

Re: Get an arbitrary hash key, quickly.

am 25.01.2008 19:39:01 von Charles DeRykus

On Jan 24, 9:25 am, xhos...@gmail.com wrote:
> I need a work queue, but I don't really care about the order (LIFO, FIFO,
> random) in which things come out of it. Either is pretty simple and
> efficient with a Perl array, and either would suffice. But I want the
> queue to not hold duplicate entries. I could use an array as a stack or
> queue, with a parallel hash to be checked to prevent duplicates from being
> entered. But why keep parallel data structures? Just use the hash as the
> queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
>
> };
>
> Well, except that this gets really slow on large hashes.
>
> I suspect the each operator does a linear scan of all the buckets until it
> finds an occupied one. With a hash that used to be big but now isn't (and
> so has lots of empty buckets), this behaves poorly, essentially being N^2
> to empty a large hash.
>
> If I just use scalar %hash instead of scalar keys %hash, then the slow down
> doesn't occur because "each" scans the buckets from where it left off last
> time, instead of from the beginning each time. (At first I thought it was
> scalar keys %hash that was being slow and was going to post about *that*,
> but then I realized that keys' reseting of the iterator was causing "each"
> to be slow). But you shouldn't change a hash at the same time you iterate
> over it because things might get missed. Presumably, anything missed will
> be found on the next time through the iterator, so this might work without
> the slowdown:
>
> while (%hash) { # does not reset iterator
> my $to_do=each %hash;
> next unless defined $to_do; # not a real hash key,
> # signals end of iterator
> ## do stuff with $to_do, which might make new entries in %hash
>
> };
>
> If my speculation on the internals is right, then this would still
> perform poorly if the hash first grows very large, then shrinks to
> be only a few elements, but those last few elements keep triggering
> the addition of just one more element each time. In my case, that
> shouldn't be a problem.
>
> Any comments on this code? Is there some less quirky way to do this
> efficiently? Is there a simple (as simple as perl's internals can get)
> way to fix "each" so that it doesn't have this degenerate case?
>

Even with tie slowness, a Tie::IxHash solution would at least be more
straightforward since insertion order is maintained... I think.

tie my %hash, 'Tie::IxHash';
my $idx = 0;

while ( my @queue = keys %hash ) {
my $to_do = $queue[$idx++];
delete $hash{ $to_do };
# do stuff with $to_do, which might make
# new entries in %hash
}

--
Charles DeRykus

Re: Get an arbitrary hash key, quickly.

am 25.01.2008 19:57:15 von Charles DeRykus

On Jan 25, 10:39 am, "comp.llang.perl.moderated" sam-01.ca.boeing.com> wrote:
> On Jan 24, 9:25 am, xhos...@gmail.com wrote:
>
>
> ...
>
> Even with tie slowness, a Tie::IxHash solution would at least be more
> straightforward since insertion order is maintained... I think.
>
> tie my %hash, 'Tie::IxHash';
> my $idx = 0;
>
> while ( my @queue = keys %hash ) {
> my $to_do = $queue[$idx++];
> delete $hash{ $to_do };
> # do stuff with $to_do, which might make
> # new entries in %hash
>
> }

Well, it seemed like a good idea a minute ago.
But, you might need to tweak $idx for instance...
Maybe other problems too...

while ( my @queue = keys %hash ) {
$idx = $#queue if $idx > $#queue;
my $to_do = $queue[$idx++];

delete $hash{ $to_do };
# do stuff with $to_do, which might make
# new entries in %hash

}

--
Charles DeRykus

Re: Get an arbitrary hash key, quickly.

am 25.01.2008 23:35:33 von Ben Morrow

Quoth "comp.llang.perl.moderated" :
> On Jan 24, 9:25 am, xhos...@gmail.com wrote:
>
> Even with tie slowness, a Tie::IxHash solution would at least be more
> straightforward since insertion order is maintained... I think.

I have to admit I've never seen the point of Tie::IxHash. It's just the
same as the parallel array/hash Xho'd already rejected, but slower...

Ben

Re: Get an arbitrary hash key, quickly.

am 27.01.2008 02:05:03 von xhoster

nolo contendere wrote:
> On Jan 25, 12:32=A0pm, xhos...@gmail.com wrote:
> ..
> >
> > But replicates can still build up to unacceptable levels before the
> > first instance of one ever gets $done{}. =A0So then you need another
> > hash, %queued_but_not_done, to reject duplicate entries at the earliest
> > stage.
>
> From this I infer that you are engaging in parallel processing. Is
> this correct?

I have run into these constructs in parallel processing, but that was not
the immediate case here. I was computing a single-linkage hierarchical
clustering tree from a distance matrix that is way too large for the COTS
software I have access to handle. I've also ran into in certain traversals
of extremely large graphs.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 28.01.2008 16:01:34 von it_says_BALLS_on_your forehead

On Jan 26, 8:05=A0pm, xhos...@gmail.com wrote:
> nolo contendere wrote:
> > On Jan 25, 12:32=3DA0pm, xhos...@gmail.com wrote:
> > =3D2E..
>
> > > But replicates can still build up to unacceptable levels before the
> > > first instance of one ever gets $done{}. =3DA0So then you need another=

> > > hash, %queued_but_not_done, to reject duplicate entries at the earlies=
t
> > > stage.
>
> > From this I infer that you are engaging in parallel processing. Is
> > this correct?
>
> I have run into these constructs in parallel processing, but that was not
> the immediate case here. =A0I was computing a single-linkage hierarchical
> clustering tree from a distance matrix that is way too large for the COTS
> software I have access to handle. =A0I've also ran into in certain travers=
als
> of extremely large graphs.


yeah, well...that was my second guess.

Re: Get an arbitrary hash key, quickly.

am 28.01.2008 23:52:25 von Charles DeRykus

On Jan 24, 9:25 am, xhos...@gmail.com wrote:
> I need a work queue, but I don't really care about the order (LIFO, FIFO,
> random) in which things come out of it. Either is pretty simple and
> efficient with a Perl array, and either would suffice. But I want the
> queue to not hold duplicate entries. I could use an array as a stack or
> queue, with a parallel hash to be checked to prevent duplicates from being
> entered. But why keep parallel data structures? Just use the hash as the
> queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
>
> };
>
> Well, except that this gets really slow on large hashes.
>
> I suspect the each operator does a linear scan of all the buckets until it
> finds an occupied one. With a hash that used to be big but now isn't (and
> so has lots of empty buckets), this behaves poorly, essentially being N^2
> to empty a large hash.
>
> If I just use scalar %hash instead of scalar keys %hash, then the slow down
> doesn't occur because "each" scans the buckets from where it left off last
> time, instead of from the beginning each time. (At first I thought it was
> scalar keys %hash that was being slow and was going to post about *that*,
> but then I realized that keys' reseting of the iterator was causing "each"
> to be slow). But you shouldn't change a hash at the same time you iterate
> over it because things might get missed. Presumably, anything missed will
> be found on the next time through the iterator, so this might work without
> the slowdown:
>
> while (%hash) { # does not reset iterator
> my $to_do=each %hash;
> next unless defined $to_do; # not a real hash key,
> # signals end of iterator
> ## do stuff with $to_do, which might make new entries in %hash
>
> };
>
> If my speculation on the internals is right, then this would still
> perform poorly if the hash first grows very large, then shrinks to
> be only a few elements, but those last few elements keep triggering
> the addition of just one more element each time. In my case, that
> shouldn't be a problem.
>
> Any comments on this code? Is there some less quirky way to do this
> efficiently? Is there a simple (as simple as perl's internals can get)
> way to fix "each" so that it doesn't have this degenerate case?

Do you need 'each' since values don't seem
to be retrieved...Some simple benchmarks
suggest just looping over the keys would
be quite a bit faster if that's the case:

QUEUE: {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
redo QUEUE if keys %hash;
}

--
Charles DeRykus

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 00:40:51 von simon.chao

On Jan 28, 5:52=A0pm, "comp.llang.perl.moderated" sam-01.ca.boeing.com> wrote:
> On Jan 24, 9:25 am, xhos...@gmail.com wrote:
>
>
>
> > I need a work queue, but I don't really care about the order (LIFO, FIFO=
,
> > random) in which things come out of it. =A0Either is pretty simple and
> > efficient with a Perl array, and either would suffice. =A0But I want the=

> > queue to not hold duplicate entries. =A0I could use an array as a stack =
or
> > queue, with a parallel hash to be checked to prevent duplicates from bei=
ng
> > entered. =A0But why keep parallel data structures? =A0Just use the hash =
as the
> > queue:
>
> > while (key %hash) {
> > =A0 my $to_do=3Deach %hash;
> > =A0 delete $hash{$to_do};
> > =A0 ## do stuff with $to_do, which might make new entries in %hash
>
> > };
>
> > Well, except that this gets really slow on large hashes.
>
> > I suspect the each operator does a linear scan of all the buckets until =
it
> > finds an occupied one. With a hash that used to be big but now isn't (an=
d
> > so has lots of empty buckets), this behaves poorly, essentially being N^=
2
> > to empty a large hash.
>
> > If I just use scalar %hash instead of scalar keys %hash, then the slow d=
own
> > doesn't occur because "each" scans the buckets from where it left off la=
st
> > time, instead of from the beginning each time. =A0(At first I thought it=
was
> > scalar keys %hash that was being slow and was going to post about *that*=
,
> > but then I realized that keys' reseting of the iterator was causing "eac=
h"
> > to be slow). =A0But you shouldn't change a hash at the same time you ite=
rate
> > over it because things might get missed. =A0Presumably, anything missed =
will
> > be found on the next time through the iterator, so this might work witho=
ut
> > the slowdown:
>
> > while (%hash) { =A0 =A0# does not reset iterator
> > =A0 my $to_do=3Deach %hash;
> > =A0 next unless defined $to_do; =A0# not a real hash key,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0# signals=
end of iterator
> > =A0 ## do stuff with $to_do, which might make new entries in %hash
>
> > };
>
> > If my speculation on the internals is right, then this would still
> > perform poorly if the hash first grows very large, then shrinks to
> > be only a few elements, but those last few elements keep triggering
> > the addition of just one more element each time. =A0In my case, that
> > shouldn't be a problem.
>
> > Any comments on this code? Is there some less quirky way to do this
> > efficiently? =A0Is there a simple (as simple as perl's internals can get=
)
> > way to fix "each" so that it doesn't have this degenerate case?
>
> Do you need 'each' since values don't seem
> to be retrieved...Some simple benchmarks
> suggest just looping over the keys would
> be quite a bit faster if that's the case:
>
> QUEUE: {
> =A0 foreach my $to_do (keys %hash) {
> =A0 =A0 =A0delete $hash{$to_do};
> =A0 =A0 =A0## do stuff with $to_do, which might
> =A0 =A0 =A0 =A0 # make new entries in %hash
> =A0 }
> =A0 redo QUEUE if keys %hash;
>
> }
>

perldoc perlsyn:
...
If any part of LIST is an array, "foreach" will get very
confused if
you add or remove elements within the loop body, for example
with
"splice". So don't do that.

Xho taught me that one :-).

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 01:15:39 von Charles DeRykus

On Jan 28, 3:40 pm, nolo contendere wrote:
> On Jan 28, 5:52 pm, "comp.llang.perl.moderated" >
>
>
...
> > > Any comments on this code? Is there some less quirky way to do this
> > > efficiently? Is there a simple (as simple as perl's internals can get)
> > > way to fix "each" so that it doesn't have this degenerate case?
>
> > Do you need 'each' since values don't seem
> > to be retrieved...Some simple benchmarks
> > suggest just looping over the keys would
> > be quite a bit faster if that's the case:
>
> > QUEUE: {
> > foreach my $to_do (keys %hash) {
> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might
> > # make new entries in %hash
> > }
> > redo QUEUE if keys %hash;
>
> > }
>
> perldoc perlsyn:
> ...
> If any part of LIST is an array, "foreach" will get very
> confused if
> you add or remove elements within the loop body, for example
> with
> "splice". So don't do that.
>

Yes, that's true. Even delete's can be problematic
unless iterating with 'each' I think. But the
original code Xho demo'ed was adding elements
during the loop as well... so I'm not sure how
that could ever work..

--
Charles DeRykus

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 01:26:06 von simon.chao

On Jan 28, 7:15=A0pm, "comp.llang.perl.moderated" sam-01.ca.boeing.com> wrote:
> On Jan 28, 3:40 pm, nolo contendere wrote:
>
>
>
> > On Jan 28, 5:52 pm, "comp.llang.perl.moderated" >
> =A0...
> > > > Any comments on this code? Is there some less quirky way to do this
> > > > efficiently? =A0Is there a simple (as simple as perl's internals can=
get)
> > > > way to fix "each" so that it doesn't have this degenerate case?
>
> > > Do you need 'each' since values don't seem
> > > to be retrieved...Some simple benchmarks
> > > suggest just looping over the keys would
> > > be quite a bit faster if that's the case:
>
> > > QUEUE: {
> > > =A0 foreach my $to_do (keys %hash) {
> > > =A0 =A0 =A0delete $hash{$to_do};
> > > =A0 =A0 =A0## do stuff with $to_do, which might
> > > =A0 =A0 =A0 =A0 # make new entries in %hash
> > > =A0 }
> > > =A0 redo QUEUE if keys %hash;
>
> > > }
>
> > perldoc perlsyn:
> > =A0 =A0 =A0 =A0...
> > =A0 =A0 =A0 =A0If any part of LIST is an array, "foreach" will get very
> > confused if
> > =A0 =A0 =A0 =A0you add or remove elements within the loop body, for exam=
ple
> > with
> > =A0 =A0 =A0 =A0"splice". =A0 So don't do that.
>
> Yes, that's true. =A0Even delete's can be problematic
> unless iterating with 'each' I think. But the
> original code Xho demo'ed was adding elements
> during the loop as well... so I'm not sure how
> that could ever work..
>

Those were 'while' loops, which don't suffer from the same issue.

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 01:54:47 von Charles DeRykus

On Jan 28, 4:26 pm, nolo contendere wrote:
> On Jan 28, 7:15 pm, "comp.llang.perl.moderated" >
>
>
> sam-01.ca.boeing.com> wrote:
> > On Jan 28, 3:40 pm, nolo contendere wrote:
>
> > > On Jan 28, 5:52 pm, "comp.llang.perl.moderated" >
> > ...
> > > > > Any comments on this code? Is there some less quirky way to do this
> > > > > efficiently? Is there a simple (as simple as perl's internals can get)
> > > > > way to fix "each" so that it doesn't have this degenerate case?
>
> > > > Do you need 'each' since values don't seem
> > > > to be retrieved...Some simple benchmarks
> > > > suggest just looping over the keys would
> > > > be quite a bit faster if that's the case:
>
> > > > QUEUE: {
> > > > foreach my $to_do (keys %hash) {
> > > > delete $hash{$to_do};
> > > > ## do stuff with $to_do, which might
> > > > # make new entries in %hash
> > > > }
> > > > redo QUEUE if keys %hash;
>
> > > > }
>
> > > perldoc perlsyn:
> > > ...
> > > If any part of LIST is an array, "foreach" will get very
> > > confused if
> > > you add or remove elements within the loop body, for example
> > > with
> > > "splice". So don't do that.
>
> > Yes, that's true. Even delete's can be problematic
> > unless iterating with 'each' I think. But the
> > original code Xho demo'ed was adding elements
> > during the loop as well... so I'm not sure how
> > that could ever work..
>
> Those were 'while' loops, which don't suffer from the same issue.

No, I believe it's not just a 'while vs. foreach'
issue but rather ensuring that the iterator's reset
before the next each call.

--
Charles DeRykus

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 02:05:59 von Ben Morrow

Quoth nolo contendere :
> On Jan 28, 5:52 pm, "comp.llang.perl.moderated" > sam-01.ca.boeing.com> wrote:
> >
> > Do you need 'each' since values don't seem
> > to be retrieved...Some simple benchmarks
> > suggest just looping over the keys would
> > be quite a bit faster if that's the case:
> >
> > QUEUE: {
> >   foreach my $to_do (keys %hash) {
> >      delete $hash{$to_do};
> >      ## do stuff with $to_do, which might
> >         # make new entries in %hash
> >   }
> >   redo QUEUE if keys %hash;
> >
> > }
> >
>
> perldoc perlsyn:
> ...
> If any part of LIST is an array, "foreach" will get very
^^^^^^^^^^^
It isn't. keys returns a list. What *is* true in this case is that if
any entries you haven't got to yet are deleted from the hash, they will
still be in for's list and will be returned anyway; since that isn't the
case here, it doesn't matter.

However, I would have thought that as the number of keys gets larger,
this get slower, since it has to build a complete list of the keys each
time through QUEUE. Let's see...

#!/usr/bin/perl -l

use strict;
use warnings;

use Benchmark qw/cmpthese/;

my %h =
map { ($_, 1) }
map { join '', map { chr rand 256 } 1..10 }
1..100_000;

cmpthese -5, {
keys => sub {
for my $x (keys %h) { 1; }
},
leach => sub {
while (my ($k, $v) = each %h) { 1; }
},
seach => sub {
while (my $k = each %h) { 1; }
},
};

__END__

Rate leach keys seach
leach 3.76/s -- -28% -30%
keys 5.21/s 38% -- -3%
seach 5.37/s 43% 3% --

Nope, I'm wrong even on large hashes, but scalar each is faster again,
though even that seems to depend on your malloc: FreeBSD's perl is built
with perl's malloc by default, and a version built with the native
malloc has keys > seach > leach. Hmmm. :)

Ben

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 17:04:53 von it_says_BALLS_on_your forehead

On Jan 28, 8:05=A0pm, Ben Morrow wrote:
> Quoth nolo contendere :
>
> > On Jan 28, 5:52=A0pm, "comp.llang.perl.moderated" > > sam-01.ca.boeing.com> wrote:
>
> > > Do you need 'each' since values don't seem
> > > to be retrieved...Some simple benchmarks
> > > suggest just looping over the keys would
> > > be quite a bit faster if that's the case:
>
> > > QUEUE: {
> > > =A0 foreach my $to_do (keys %hash) {
> > > =A0 =A0 =A0delete $hash{$to_do};
> > > =A0 =A0 =A0## do stuff with $to_do, which might
> > > =A0 =A0 =A0 =A0 # make new entries in %hash
> > > =A0 }
> > > =A0 redo QUEUE if keys %hash;
>
> > > }
>
> > perldoc perlsyn:
> > =A0 =A0 =A0 =A0...
> > =A0 =A0 =A0 =A0If any part of LIST is an array, "foreach" will get very
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0^^^^^^^^^^^
> It isn't. keys returns a list. What *is* true in this case is that if
> any entries you haven't got to yet are deleted from the hash, they will

hmm, I thought it was unsafe to delete any entries other than the one
just accessed.

> still be in for's list and will be returned anyway; since that isn't the
> case here, it doesn't matter.
>

right. my understanding is that for (and foreach) take a snapshot of
the list, and iterate through that snapshot.

> However, I would have thought that as the number of keys gets larger,
> this get slower, since it has to build a complete list of the keys each
> time through QUEUE. Let's see...
>

wouldn't every subsequent list be smaller for the most part? unless
"do stuff" generated more keys than were deleted on average.

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 17:38:12 von Ben Morrow

Quoth nolo contendere :
> On Jan 28, 8:05 pm, Ben Morrow wrote:
> >
> > It isn't. keys returns a list. What *is* true in this case is that if
> > any entries you haven't got to yet are deleted from the hash, they will
>
> hmm, I thought it was unsafe to delete any entries other than the one
> just accessed.

No, this is a misconception (though one that is somewhat supported by
the docs). See http://blog.plover.com/prog/perl/undefined.html#3 .
Deleting the current entry is actually the potentially-unsafe case, but
perl special-cases it for you so it works correctly (the entry has to be
deleted lazily, after the iterator has moved on to the next entry).

In any case, none of this applies to

for (keys %h) { ... }

keys is in list context, so the complete list of keys is generated
before the loop even starts iterating.

> > still be in for's list and will be returned anyway; since that isn't the
> > case here, it doesn't matter.
>
> right. my understanding is that for (and foreach) take a snapshot of
> the list, and iterate through that snapshot.

Not quite; keys returns a list which is a snapshot of the current set of
keys. What you do with that list afterwards is irrelevant.

> > However, I would have thought that as the number of keys gets larger,
> > this get slower, since it has to build a complete list of the keys each
> > time through QUEUE. Let's see...
>
> wouldn't every subsequent list be smaller for the most part? unless
> "do stuff" generated more keys than were deleted on average.

No, that wasn't what I meant. I meant 'as the initial set of keys
becomes large, perhaps building a complete list of that set becomes
inefficient'. It seems it doesn't.

Ben

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 17:50:50 von xhoster

nolo contendere wrote:
> >
> > QUEUE: {
> > =A0 foreach my $to_do (keys %hash) {
> > =A0 =A0 =A0delete $hash{$to_do};
> > =A0 =A0 =A0## do stuff with $to_do, which might
> > =A0 =A0 =A0 =A0 # make new entries in %hash
> > =A0 }
> > =A0 redo QUEUE if keys %hash;
> >
> > }
> >
>
> perldoc perlsyn:
> ...
> If any part of LIST is an array, "foreach" will get very
> confused if
> you add or remove elements within the loop body, for example
> with
> "splice". So don't do that.

But in this case, no part of the LIST is an array. keys %hash is not
an array.


Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 18:00:48 von xhoster

"comp.llang.perl.moderated" wrote:
> >
> > while (%hash) { # does not reset iterator
> > my $to_do=each %hash;
> > next unless defined $to_do; # not a real hash key,
> > # signals end of iterator
> > ## do stuff with $to_do, which might make new entries in %hash
> >
> > };
> >
> > If my speculation on the internals is right, then this would still
> > perform poorly if the hash first grows very large, then shrinks to
> > be only a few elements, but those last few elements keep triggering
> > the addition of just one more element each time. In my case, that
> > shouldn't be a problem.
> >
> > Any comments on this code? Is there some less quirky way to do this
> > efficiently? Is there a simple (as simple as perl's internals can get)
> > way to fix "each" so that it doesn't have this degenerate case?
>
> Do you need 'each' since values don't seem
> to be retrieved...Some simple benchmarks
> suggest just looping over the keys would
> be quite a bit faster if that's the case:
>
> QUEUE: {
> foreach my $to_do (keys %hash) {
> delete $hash{$to_do};
> ## do stuff with $to_do, which might
> # make new entries in %hash
> }
> redo QUEUE if keys %hash;
> }

I like it. It does have a 2nd caching structure, but that structure is
implicit and entirely managed by perl as part of the foreach loop. I might
change the outer loop syntax somewhat, as I prefer to avoid labels
whenever possible.

while (keys %hash) {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
}
(Causes one extra scalar "keys %hash", but that shouldn't be a problem)

The "do stuff" can only add new entries, not remove entries (other than
$to_do itself) without causing problems. Deleting wasn't part of the
original specification, and that limitation is not a problem in this
particular case.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 18:20:23 von xhoster

nolo contendere wrote:
> On Jan 28, 8:05=A0pm, Ben Morrow wrote:
> > Quoth nolo contendere :
> >
> > > On Jan 28, 5:52=A0pm, "comp.llang.perl.moderated" > > > sam-01.ca.boeing.com> wrote:
> >
> > > > Do you need 'each' since values don't seem
> > > > to be retrieved...Some simple benchmarks
> > > > suggest just looping over the keys would
> > > > be quite a bit faster if that's the case:
> >
> > > > QUEUE: {
> > > > =A0 foreach my $to_do (keys %hash) {
> > > > =A0 =A0 =A0delete $hash{$to_do};
> > > > =A0 =A0 =A0## do stuff with $to_do, which might
> > > > =A0 =A0 =A0 =A0 # make new entries in %hash
> > > > =A0 }
> > > > =A0 redo QUEUE if keys %hash;
> >
> > > > }
> >
> > > perldoc perlsyn:
> > > =A0 =A0 =A0 =A0...
> > > =A0 =A0 =A0 =A0If any part of LIST is an array, "foreach" will get
> > > very
> >
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0^^^^^^^^^^^
> > It isn't. keys returns a list. What *is* true in this case is that if
> > any entries you haven't got to yet are deleted from the hash, they will
>
> hmm, I thought it was unsafe to delete any entries other than the one
> just accessed.

That is with "each %hash", not with "keys %hash".

>
> > still be in for's list and will be returned anyway; since that isn't
> > the case here, it doesn't matter.
> >
>
> right. my understanding is that for (and foreach) take a snapshot of
> the list, and iterate through that snapshot.

Except when the LIST of the foreach has an array in it, then it doesn't
take a snapshot--it does something else (presumably as an optimization),
the details of which are not documented but the effects of which are
(weirdness when changing the array).

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 18:31:56 von xhoster

Ben Morrow wrote:
> Quoth nolo contendere :
> > On Jan 28, 5:52 pm, "comp.llang.perl.moderated" > > sam-01.ca.boeing.com> wrote:
> > >
> > > Do you need 'each' since values don't seem
> > > to be retrieved...Some simple benchmarks
> > > suggest just looping over the keys would
> > > be quite a bit faster if that's the case:
> > >
> > > QUEUE: {
> > > foreach my $to_do (keys %hash) {
> > > delete $hash{$to_do};
> > > ## do stuff with $to_do, which might
> > > # make new entries in %hash
> > > }
> > > redo QUEUE if keys %hash;
> > >
> > > }
> > >
> >
> > perldoc perlsyn:
> > ...
> > If any part of LIST is an array, "foreach" will get very
> ^^^^^^^^^^^
> It isn't. keys returns a list. What *is* true in this case is that if
> any entries you haven't got to yet are deleted from the hash, they will
> still be in for's list and will be returned anyway; since that isn't the
> case here, it doesn't matter.
>
> However, I would have thought that as the number of keys gets larger,
> this get slower, since it has to build a complete list of the keys each
> time through QUEUE.

I'm sure you already know this, but just be clear for others, I think the
potential slowness you hypothetically referring to is just a constant
multiplier of slowness (and a rather small one at that), right? The
slowness that I originally posted about was of a different
sort--pathological slowness that seems to be around O(N**2) when it really
"should" be O(N).

Once I take care of the O(N**2)->O(N), I'm not so interested in
microoptimizing the rest, except as a theoretical curiosity.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 18:32:02 von Ben Morrow

Quoth xhoster@gmail.com:
>
> while (keys %hash) {
> foreach my $to_do (keys %hash) {
> delete $hash{$to_do};
> ## do stuff with $to_do, which might
> # make new entries in %hash
> }
> }
> (Causes one extra scalar "keys %hash", but that shouldn't be a problem)
>
> The "do stuff" can only add new entries, not remove entries (other than
> $to_do itself) without causing problems. Deleting wasn't part of the
> original specification, and that limitation is not a problem in this
> particular case.

Fixing that is just a matter of

delete $hash{$to_do} or next;

isn't it? Of course, if you delete lots of entries you're going to waste
time spinning round that loop.

Ben

Re: Get an arbitrary hash key, quickly.

am 29.01.2008 18:55:16 von Charles DeRykus

On Jan 29, 9:00 am, xhos...@gmail.com wrote:
> "comp.llang.perl.moderated" wrote:
>
> > > while (%hash) { # does not reset iterator
> > > my $to_do=each %hash;
> > > next unless defined $to_do; # not a real hash key,
> > > # signals end of iterator
> > > ## do stuff with $to_do, which might make new entries in %hash
>
> > > };
>
> > > If my speculation on the internals is right, then this would still
> > > perform poorly if the hash first grows very large, then shrinks to
> > > be only a few elements, but those last few elements keep triggering
> > > the addition of just one more element each time. In my case, that
> > > shouldn't be a problem.
>
> > > Any comments on this code? Is there some less quirky way to do this
> > > efficiently? Is there a simple (as simple as perl's internals can get)
> > > way to fix "each" so that it doesn't have this degenerate case?
>
> > Do you need 'each' since values don't seem
> > to be retrieved...Some simple benchmarks
> > suggest just looping over the keys would
> > be quite a bit faster if that's the case:
>
> > QUEUE: {
> > foreach my $to_do (keys %hash) {
> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might
> > # make new entries in %hash
> > }
> > redo QUEUE if keys %hash;
> > }
>
> I like it. It does have a 2nd caching structure, but that structure is
> implicit and entirely managed by perl as part of the foreach loop. I might
> change the outer loop syntax somewhat, as I prefer to avoid labels
> whenever possible.
>
> while (keys %hash) {
> foreach my $to_do (keys %hash) {
> delete $hash{$to_do};
> ## do stuff with $to_do, which might
> # make new entries in %hash
> }}
>
> (Causes one extra scalar "keys %hash", but that shouldn't be a problem)
>
> The "do stuff" can only add new entries, not remove entries (other than
> $to_do itself) without causing problems. Deleting wasn't part of the
> original specification, and that limitation is not a problem in this
> particular case.
>

Memory might be the only issue then. And it's
really nice to have Ben explain how it really
all works :)

--
Charles

Re: Get an arbitrary hash key, quickly.

am 30.01.2008 16:31:09 von it_says_BALLS_on_your forehead

On Jan 29, 11:38=A0am, Ben Morrow wrote:
> Quoth nolo contendere :
>
> > On Jan 28, 8:05=A0pm, Ben Morrow wrote:
>
> > > It isn't. keys returns a list. What *is* true in this case is that if
> > > any entries you haven't got to yet are deleted from the hash, they wil=
l
>
> > hmm, I thought it was unsafe to delete any entries other than the one
> > just accessed.
>
> No, this is a misconception (though one that is somewhat supported by
> the docs). Seehttp://blog.plover.com/prog/perl/undefined.html#3.
> Deleting the current entry is actually the potentially-unsafe case, but
> perl special-cases it for you so it works correctly (the entry has to be
> deleted lazily, after the iterator has moved on to the next entry).
>

cool!

> In any case, none of this applies to
>
> =A0 =A0 for (keys %h) { ... }
>
> keys is in list context, so the complete list of keys is generated
> before the loop even starts iterating.
>
> > > still be in for's list and will be returned anyway; since that isn't t=
he
> > > case here, it doesn't matter.
>
> > right. my understanding is that for (and foreach) take a snapshot of
> > the list, and iterate through that snapshot.
>
> Not quite; keys returns a list which is a snapshot of the current set of
> keys. What you do with that list afterwards is irrelevant.
>

right, that was my understanding.

> > > However, I would have thought that as the number of keys gets larger,
> > > this get slower, since it has to build a complete list of the keys eac=
h
> > > time through QUEUE. Let's see...
>
> > wouldn't every subsequent list be smaller for the most part? unless
> > "do stuff" generated more keys than were deleted on average.
>
> No, that wasn't what I meant. I meant 'as the initial set of keys
> becomes large, perhaps building a complete list of that set becomes
> inefficient'. It seems it doesn't.
>

oh.

Re: Get an arbitrary hash key, quickly.

am 30.01.2008 16:31:32 von it_says_BALLS_on_your forehead

On Jan 29, 11:50=A0am, xhos...@gmail.com wrote:
> nolo contendere wrote:
>
> > > QUEUE: {
> > > =3DA0 foreach my $to_do (keys %hash) {
> > > =3DA0 =3DA0 =3DA0delete $hash{$to_do};
> > > =3DA0 =3DA0 =3DA0## do stuff with $to_do, which might
> > > =3DA0 =3DA0 =3DA0 =3DA0 # make new entries in %hash
> > > =3DA0 }
> > > =3DA0 redo QUEUE if keys %hash;
>
> > > }
>
> > perldoc perlsyn:
> > =A0 =A0 =A0 =A0...
> > =A0 =A0 =A0 =A0If any part of LIST is an array, "foreach" will get very
> > confused if
> > =A0 =A0 =A0 =A0you add or remove elements within the loop body, for exam=
ple
> > with
> > =A0 =A0 =A0 =A0"splice". =A0 So don't do that.
>
> But in this case, no part of the LIST is an array. =A0keys %hash is not
> an array.
>

oops! newbie mistake, sorry!

Re: Get an arbitrary hash key, quickly.

am 30.01.2008 16:33:00 von it_says_BALLS_on_your forehead

On Jan 29, 12:20=A0pm, xhos...@gmail.com wrote:
> nolo contendere wrote:
> > On Jan 28, 8:05=3DA0pm, Ben Morrow wrote:
> > > Quoth nolo contendere :
>
> > > > On Jan 28, 5:52=3DA0pm, "comp.llang.perl.moderated" > > > > sam-01.ca.boeing.com> wrote:
>
> > > > > Do you need 'each' since values don't seem
> > > > > to be retrieved...Some simple benchmarks
> > > > > suggest just looping over the keys would
> > > > > be quite a bit faster if that's the case:
>
> > > > > QUEUE: {
> > > > > =3DA0 foreach my $to_do (keys %hash) {
> > > > > =3DA0 =3DA0 =3DA0delete $hash{$to_do};
> > > > > =3DA0 =3DA0 =3DA0## do stuff with $to_do, which might
> > > > > =3DA0 =3DA0 =3DA0 =3DA0 # make new entries in %hash
> > > > > =3DA0 }
> > > > > =3DA0 redo QUEUE if keys %hash;
>
> > > > > }
>
> > > > perldoc perlsyn:
> > > > =3DA0 =3DA0 =3DA0 =3DA0...
> > > > =3DA0 =3DA0 =3DA0 =3DA0If any part of LIST is an array, "foreach" wi=
ll get
> > > > very
>
> > > =3DA0 =3DA0 =3DA0 =3DA0 =3DA0 =3DA0 =3DA0 =3DA0 =3DA0 =3DA0 =3DA0 =3DA=
0 =3DA0 =3DA0 =3DA0^^^^^^^^^^^
> > > It isn't. keys returns a list. What *is* true in this case is that if
> > > any entries you haven't got to yet are deleted from the hash, they wil=
l
>
> > hmm, I thought it was unsafe to delete any entries other than the one
> > just accessed.
>
> That is with "each %hash", not with "keys %hash".
>
>
>
> > > still be in for's list and will be returned anyway; since that isn't
> > > the case here, it doesn't matter.
>
> > right. my understanding is that for (and foreach) take a snapshot of
> > the list, and iterate through that snapshot.
>
> Except when the LIST of the foreach has an array in it, then it doesn't
> take a snapshot--it does something else (presumably as an optimization),
> the details of which are not documented but the effects of which are
> (weirdness when changing the array).
>

hmm, could you, or anyone with >=3D your knowledge about this subject
expound?

Re: Get an arbitrary hash key, quickly.

am 30.01.2008 16:48:46 von xhoster

nolo contendere wrote:
> On Jan 29, 12:20=A0pm, xhos...@gmail.com wrote:
> >
> > > right. my understanding is that for (and foreach) take a snapshot of
> > > the list, and iterate through that snapshot.
> >
> > Except when the LIST of the foreach has an array in it, then it doesn't
> > take a snapshot--it does something else (presumably as an
> > optimization), the details of which are not documented but the effects
> > of which are (weirdness when changing the array).
> >
>
> hmm, could you, or anyone with >=3D your knowledge about this subject
> expound?

I think the link to plover that Ben Morrow posted already did that.
It isn't actually weird, I just treat as weird because it isn't documented
and so could in theory do something different in other versions.

Well, OK, it is weird, because the docs "If any part of LIST is an array"
but it seems that, to get the behavior shown in the plover link, the
entirety of the LIST has to be an array. if you do something like:

foreach ("foo",@x) {

Then it appears, at least on my version, that it does in fact take a
snapshot of @x (more properly, a snapshot of the addresses to the elements
in @x) and any splices made into it during the loop have no effect on the
iteration.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Get an arbitrary hash key, quickly.

am 30.01.2008 16:57:21 von it_says_BALLS_on_your forehead

On Jan 30, 10:48=A0am, xhos...@gmail.com wrote:
> nolo contendere wrote:
> > On Jan 29, 12:20=3DA0pm, xhos...@gmail.com wrote:
>
> > > > right. my understanding is that for (and foreach) take a snapshot of=

> > > > the list, and iterate through that snapshot.
>
> > > Except when the LIST of the foreach has an array in it, then it doesn'=
t
> > > take a snapshot--it does something else (presumably as an
> > > optimization), the details of which are not documented but the effects=

> > > of which are (weirdness when changing the array).
>
> > hmm, could you, or anyone with >=3D3D your knowledge about this subject
> > expound?
>
> I think the link to plover that Ben Morrow posted already did that.
> It isn't actually weird, I just treat as weird because it isn't documented=

> and so could in theory do something different in other versions.
>
> Well, OK, it is weird, because the docs "If any part of LIST is an array"
> but it seems that, to get the behavior shown in the plover link, the
> entirety of the LIST has to be an array. =A0if you do something like:
>
> foreach ("foo",@x) {
>
> Then it appears, at least on my version, that it does in fact take a
> snapshot of @x (more properly, a snapshot of the addresses to the elements=

> in @x) and any splices made into it during the loop have no effect on the
> iteration.
>

thanks Xho, I'll check out the plover site more thoroughly.