Index of first and last non-"/xff" in a long string

am 12.11.2007 19:22:56 von w.c.humann

I'm going through several PGM images, overlaying (i.e. ANDing) them
and would also like to determine the bounding box. For that I need to
find the first and last non-white (i.e. non-"\xff") pixel in every
line. Now I have one line in a string $data (one pixel per character),
possibly several 1000 charcters long. I tried 3 alternatives so far.
All 3 work, but there may be even faster ways to do this:

# slow
my $first = -1;
1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
my $last = $width;
1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
print STDERR "f: $first, l: $last, ";

# the match for $first2 is fast, but the one for $last2 is really slow
my $first2 = length( ($data =~ /^(\xff+)/)[0] );
my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
print STDERR "f2: $first2, l2: $last2, ";

# best solution so far. "tr" is the slowest part of this.
# Is there a way without the "tr"?
$data =~ tr|\x00-\xfe|\x00|;
my $first3 = index $data, "\x00";
my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
print STDERR "f3: $first3, l3: $last3, ";

print STDERR "\n";

Thanks,
Wolfram

Re: Index of first and last non-"/xff" in a long string

am 12.11.2007 20:56:50 von Ben Morrow

Quoth w.c.humann@arcor.de:
> I'm going through several PGM images, overlaying (i.e. ANDing) them
> and would also like to determine the bounding box. For that I need to
> find the first and last non-white (i.e. non-"\xff") pixel in every
> line. Now I have one line in a string $data (one pixel per character),
> possibly several 1000 charcters long. I tried 3 alternatives so far.
> All 3 work, but there may be even faster ways to do this:

On my machine, the benchmark below gives

Rate subst sloop chop match reverse index C
subst 403/s -- -90% -94% -96% -97% -97% -100%
sloop 4078/s 912% -- -36% -58% -73% -73% -96%
chop 6340/s 1474% 55% -- -35% -58% -59% -94%
match 9754/s 2322% 139% 54% -- -36% -37% -91%
reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
index 15376/s 3718% 277% 143% 58% 1% -- -85%
C 104065/s 25739% 2452% 1541% 967% 583% 577% --

so index is probably the best you're going to get without using C.

Ben

#!/usr/bin/perl

use Benchmark qw/cmpthese/;

my $str = ("\xff" x 160) . ("f" x 10_000) . ("\xff" x 150);
my $len = length $str;

use Inline C => <<'EOC';

IV
unindex(SV *sv, const char *str)
{
const char *pv;
IV len, i = 0;
const char chr = str[0];

pv = SvPV(sv, len);
while (pv[i] == chr) i++;

return i;
}

IV
unrindex(SV *sv, const char *str)
{
const char *pv;
IV len, i;
const char chr = str[0];

pv = SvPV(sv, len);
i = len - 1;
while (pv[i] == chr) i--;

return i;
}

EOC

cmpthese -3, {
match => sub {
local $_ = $str;
# this is the fastest single match I can come up with
/^((?>\xff*)).*[^\xff]((?>\xff*))$/;
161 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
10_160 == $-[2] or die "\$- failed: " . $-[2];
},
index => sub {
local $_ = $str;
tr,\x00-\xfe,\x00,;
161 == 1 + index $_, "\x00"
or die "index failed: " . 1 + index $_, "\x00";
10_160 == 1 + rindex $_, "\x00"
or die "rindex failed: " . 1 + rindex $_, "\x00";
},
subst => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
s/\xff+$//;
10_160 == length or die "subst failed: " . length;
},
sloop => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
1 while s/\xff$//;
10_160 == length or die "sloop failed: " . length;
},
chop => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
1 while "\xff" eq chop;
10_160 == 1 + length or die "chop failed: " . (1 + length);
},
reverse => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
$_ = reverse;
/[^\xff]/g;
10_160 == (1 + $len - pos)
or die "reverse failed: " . (1 + $len - pos);
},
C => sub {
local $_ = $str;
161 == 1 + unindex $_, "\xff"
or die "unindex failed: " . (1 + unindex $_, "\xff");
10_160 == 1 + unrindex $_, "\xff"
or die "unrindex failed: " . (1 + unrindex $_, "\xff");
},
};

Re: Index of first and last non-"/xff" in a long string

am 12.11.2007 20:57:10 von krahnj

w.c.humann@arcor.de wrote:
>
> I'm going through several PGM images, overlaying (i.e. ANDing) them
> and would also like to determine the bounding box. For that I need to
> find the first and last non-white (i.e. non-"\xff") pixel in every
> line. Now I have one line in a string $data (one pixel per character),
> possibly several 1000 charcters long. I tried 3 alternatives so far.
> All 3 work, but there may be even faster ways to do this:
>
> # slow
> my $first = -1;
> 1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
> my $last = $width;
> 1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
> print STDERR "f: $first, l: $last, ";
>
> # the match for $first2 is fast, but the one for $last2 is really slow
> my $first2 = length( ($data =~ /^(\xff+)/)[0] );
> my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
> print STDERR "f2: $first2, l2: $last2, ";
>
> # best solution so far. "tr" is the slowest part of this.
> # Is there a way without the "tr"?
> $data =~ tr|\x00-\xfe|\x00|;
> my $first3 = index $data, "\x00";
> my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
> print STDERR "f3: $first3, l3: $last3, ";
>
> print STDERR "\n";

Try this in your testing:

$data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
] - 1 );

John
--
use Perl;
program
fulfillment

Re: Index of first and last non-"/xff" in a long string

am 12.11.2007 21:17:39 von w.c.humann

On Nov 12, 8:57 pm, "John W. Krahn" wrote:
>
> Try this in your testing:
>
> $data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
> ] - 1 );

Thanks John,

at least a hundred times faster than my attempt at pattern matching --
but still several times slower than tr/index/rindex.

Wolfram

Re: Index of first and last non-"/xff" in a long string

am 12.11.2007 21:37:27 von w.c.humann

On Nov 12, 8:56 pm, Ben Morrow wrote:
>
> On my machine, the benchmark below gives
>
> Rate subst sloop chop match reverse index C
> subst 403/s -- -90% -94% -96% -97% -97% -100%
> sloop 4078/s 912% -- -36% -58% -73% -73% -96%
> chop 6340/s 1474% 55% -- -35% -58% -59% -94%
> match 9754/s 2322% 139% 54% -- -36% -37% -91%
> reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
> index 15376/s 3718% 277% 143% 58% 1% -- -85%
> C 104065/s 25739% 2452% 1541% 967% 583% 577% --
>
> so index is probably the best you're going to get without using C.

Hey, some great ideas here, thanks Ben.

Glad I had already found the fastest pure-perl solution :-)
(but 'reverse' is so close, the order might change per run...)

'Inline C' is great but less portable, and I'm mainly using this on
win32.

Wolfram

Re: Index of first and last non-"/xff" in a long string

am 12.11.2007 22:08:34 von Ben Morrow

Quoth w.c.humann@arcor.de:
> On Nov 12, 8:57 pm, "John W. Krahn" wrote:
> >
> > Try this in your testing:
> >
> > $data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
> > ] - 1 );

D'oh! I knew my match didn't need to do so much backtracking...

> at least a hundred times faster than my attempt at pattern matching --
> but still several times slower than tr/index/rindex.

Interesting... which version of perl? With
This is perl, v5.8.8 built for i386-freebsd-64int

and adding this

innerm => sub {
local $_ = $str;
/[^\xff].*[^\xff]/s;
161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
},

to my previous benchmark, I get

Rate match index innerm C
match 9609/s -- -41% -53% -91%
index 16398/s 71% -- -21% -85%
innerm 20641/s 115% 26% -- -81%
C 109225/s 1037% 566% 429% --

though seriously increasing the number of trailing "\xff"s causes both
'match' and 'innerm' to perform dramatically badly, so maybe this is an
artefact of my test string.

Ben

Re: Index of first and last non-"/xff" in a long string

am 13.11.2007 12:34:23 von w.c.humann

On Nov 12, 10:08 pm, Ben Morrow wrote:
> Interesting... which version of perl? With
> This is perl, v5.8.8 built for i386-freebsd-64int

Mine is:
This is perl, v5.8.7 built for MSWin32-x86-multi-thread

>
> and adding this
>
> innerm => sub {
> local $_ = $str;
> /[^\xff].*[^\xff]/s;
> 161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
> 10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
> },
>
> to my previous benchmark, I get
>
> Rate match index innerm C
> match 9609/s -- -41% -53% -91%
> index 16398/s 71% -- -21% -85%
> innerm 20641/s 115% 26% -- -81%
> C 109225/s 1037% 566% 429% --

Well, with your test-string I get:

Rate subst sloop chop match index reverse
innerm
subst 1306/s -- -91% -94% -94% -94% -95%
-97%
sloop 13917/s 965% -- -33% -41% -41% -52%
-72%
chop 20894/s 1499% 50% -- -11% -11% -28%
-59%
match 23417/s 1693% 68% 12% -- -1% -19%
-54%
index 23561/s 1704% 69% 13% 1% -- -18%
-53%
reverse 28902/s 2112% 108% 38% 23% 23% --
-43%
innerm 50510/s 3767% 263% 142% 116% 114% 75%
--

so indeed 'innerm' wins, but...

> though seriously increasing the number of trailing "\xff"s causes both
> 'match' and 'innerm' to perform dramatically badly, so maybe this is an
> artefact of my test string.

typical files have blank lines at top and bottom and if I modify your
script like this:

# lenght of left, middle and right part of string
#my ($l,$m,$r) = (160, 10_000, 150);
my ($l,$m,$r) = (5000, 2, 5000);

my $str = ("\xff" x $l) . ("f" x $m) . ("\xff" x $r);
my $len = length $str;

cmpthese -3, {
match => sub {
local $_ = $str;
# this is the fastest single match I can come up with
/^((?>\xff*)).*[^\xff]((?>\xff*))$/;
$l+1 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
$l+$m == $-[2] or die "\$- failed: " . $-[2];
},
etc.

the result is:

Rate subst sloop chop match innerm
reverse index
subst 1.44/s -- -100% -100% -100% -100%
-100% -100%
sloop 462/s 32117% -- -37% -75% -82%
-86% -98%
chop 737/s 51284% 59% -- -60% -71%
-78% -96%
match 1844/s 128383% 299% 150% -- -28%
-45% -91%
innerm 2575/s 179348% 457% 249% 40% --
-23% -87%
reverse 3352/s 233498% 625% 355% 82% 30%
-- -84%
index 20452/s 1424967% 4323% 2673% 1009% 694%
510% --

Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?

An optimization not considered so far: Once I've found a left and
right bound in a line, I only need to check from the edges up these
bounds in all following lines, because my bounding-box can only grow
(and never shrink) while checking further lines. As a faked test I've
used this:

index2 => sub {
my $left = substr($str,0,$l+500);
my $right = substr($str,$l+$m-500,$r+500);
$left =~ tr,\x00-\xfe,\x00,;
$right =~ tr,\x00-\xfe,\x00,;
$l+1 == 1 + index $left, "\x00"
or die "index failed: " . (1 + index $left, "\x00");
500 == 1 + rindex $right, "\x00"
or die "rindex failed: " . (1 + rindex $right, "\x00");
},

The potential savings are big but of course highly dependent on the
actual image:
Rate index reverse innerm index2
index 23924/s -- -17% -53% -82%
reverse 28749/s 20% -- -43% -79%
innerm 50574/s 111% 76% -- -63%
index2 136616/s 471% 375% 170% --

Wolfram

Re: Index of first and last non-"/xff" in a long string

am 13.11.2007 17:50:42 von krahnj

w.c.humann@arcor.de wrote:
>
> On Nov 12, 10:08 pm, Ben Morrow wrote:
> > Interesting... which version of perl? With
> > This is perl, v5.8.8 built for i386-freebsd-64int
> >
> > and adding this
> >
> > innerm => sub {
> > local $_ = $str;
> > /[^\xff].*[^\xff]/s;

[ SNIP ]

> Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?

Not strange at all. The pattern has to match at least two [^\xff]
characters.

John
--
use Perl;
program
fulfillment