Problem with reg expression

Problem with reg expression

am 05.09.2007 04:09:19 von Peter Jamieson

#I want my script to parse HTML tables such as the one included below:

#!/usr/bin/perl -w
use strict;
use warnings;

my $moggy = '









RADIO TAB 3-2 520M
class="TrackCond">FINE
GOOD
';

# I tried this

$_ = $moggy;
my ($d,$e,$f);
$d=''; $e=''; $f='';

($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";


This produces for $d
> 520M FINE class="TrackCondR">GOOD
and no value for $e or $f

I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD

Can anyone suggest why I don't get this and where I am going wrong here?
Any comments appreciated!

Re: Problem with reg expression

am 05.09.2007 04:35:41 von Lars Eighner

In our last episode, , the
lovely and talented Peter Jamieson broadcast on comp.lang.perl.misc:

> #I want my script to parse HTML tables such as the one included below:

> #!/usr/bin/perl -w
> use strict;
> use warnings;

> my $moggy = '


>
>
>
>
>
>

>
RADIO TAB 3-2 520M
> class="TrackCond">FINE
GOOD
';

> # I tried this

> $_ = $moggy;
> my ($d,$e,$f);
> $d=''; $e=''; $f='';

> ($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

> print "d ",$d," e ",$e," f ",$f,"\n";


> This produces for $d
> > 520M FINE > class="TrackCondR">GOOD
> and no value for $e or $f

> I would have expected
> $d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD

> Can anyone suggest why I don't get this and where I am going wrong here?
> Any comments appreciated!


First, regexes are extremely difficult to use to parse html. Use
the HTML:Parser module. (Yes, if you are a regex expert and know the
files you are working with, sometimes you can use quick and dirty
expressions for a particular ad hoc task, but if the nature of the files
change, your quick and dirty solution from last week is likely to be broken
this week.)

Second, regexes are naturally greedy. Left unmodified they will make the
largest match possible, which is to say .* will not stop at the first
occurrence of but will do everything up to the last .*. You
can consult the manual for ways of modifying this behavior, but it is still
not the way to parse HTML.

Third, what exactly did you think the values of $e and $f would be?
The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
you are parsing html or a grocery list.

--
Lars Eighner
Countdown: 503 days to go.
What do you do when you're debranded?

Re: Problem with reg expression

am 05.09.2007 05:06:57 von Gunnar Hjalmarsson

Peter Jamieson wrote:
> #I want my script to parse HTML tables such as the one included below:
>
> #!/usr/bin/perl -w
> use strict;
> use warnings;
>
> my $moggy = '


>
>
>
>
>
>
>
>
RADIO TAB 3-2 520M >
> class="TrackCond">FINE
GOOD
';
>
> # I tried this
>
> $_ = $moggy;
> my ($d,$e,$f);
> $d=''; $e=''; $f='';
>
> ($d,$e,$f) = /TrackCond(.*)<\/TD>/g;
>
> print "d ",$d," e ",$e," f ",$f,"\n";
>
>
> This produces for $d
> > 520M FINE > class="TrackCondR">GOOD
> and no value for $e or $f
>
> I would have expected
> $d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
>
> Can anyone suggest why I don't get this

Because regexes are greedy by default.

($d,$e,$f) = /TrackCond(.*?)<\/TD>/g;
------------------------------^

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: Problem with reg expression

am 05.09.2007 05:09:44 von Gunnar Hjalmarsson

Lars Eighner wrote:
> In our last episode, , the
> lovely and talented Peter Jamieson broadcast on comp.lang.perl.misc:
>>
>> I would have expected
>> $d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
>
> Third, what exactly did you think the values of $e and $f would be?

The OP already let us know that, didn't he?

> The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
> you are parsing html or a grocery list.

Why?

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: Problem with reg expression

am 05.09.2007 05:38:12 von Lars Eighner

In our last episode,
<5k6l00F283tnU2@mid.individual.net>,
the lovely and talented Gunnar Hjalmarsson
broadcast on comp.lang.perl.misc:

> Lars Eighner wrote:
>> In our last episode, , the
>> lovely and talented Peter Jamieson broadcast on comp.lang.perl.misc:
>>>
>>> I would have expected
>>> $d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
>>
>> Third, what exactly did you think the values of $e and $f would be?

> The OP already let us know that, didn't he?

>> The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
>> you are parsing html or a grocery list.

> Why?

because .* will eat everything that matches (if anything does) so
$e and $f will always be empty (and $d will be empty if there is no match).


--
Lars Eighner
Countdown: 503 days to go.
What do you do when you're debranded?

Re: Problem with reg expression

am 05.09.2007 05:47:23 von Peter Jamieson

>> #I want my script to parse HTML tables such as the one included below:
>
>> #!/usr/bin/perl -w
>> use strict;
>> use warnings;
>
>> my $moggy = '


>>
>>
>>
>>
>>
>>
>>
>
>>
RADIO TAB 3-2 >>class=Tips> 520M >
>> class="TrackCond">FINE
GOOD
';
>
>> # I tried this
>
>> $_ = $moggy;
>> my ($d,$e,$f);
>> $d=''; $e=''; $f='';
>
>> ($d,$e,$f) = /TrackCond(.*)<\/TD>/g;
>
>> print "d ",$d," e ",$e," f ",$f,"\n";
>
>
>> This produces for $d
>> > 520M FINE >> class="TrackCondR">GOOD
>> and no value for $e or $f
>
>> I would have expected
>> $d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
>
>> Can anyone suggest why I don't get this and where I am going wrong here?
>> Any comments appreciated!
>
>
> First, regexes are extremely difficult to use to parse html. Use
> the HTML:Parser module. (Yes, if you are a regex expert and know the
> files you are working with, sometimes you can use quick and dirty
> expressions for a particular ad hoc task, but if the nature of the files
> change, your quick and dirty solution from last week is likely to be
> broken
> this week.)

Thanks for the suggestion Lars, I will have a look at HTML::Parser module.
I have used my script for over 2 years, 62000 tables and this is oneof very
few failures
so not too unhappy with it. If HTML::Parser beats this then I'll be very
pleased.

> Second, regexes are naturally greedy. Left unmodified they will make the
> largest match possible, which is to say .* will not stop at the first
> occurrence of but will do everything up to the last .*. You
> can consult the manual for ways of modifying this behavior, but it is
> still
> not the way to parse HTML.

Yes I hear what you claim but my script has performed very well so far,
perhaps I was lucky.

> Third, what exactly did you think the values of $e and $f would be?

Perhaps you failed to read that part of the post?....I stated quite
explicitly what I thought
the values should be as a guide to any would-be helper.

> The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
> you are parsing html or a grocery list.

With "use strict" and "use warnings" enabled I have been getting no warning
messages
and output sent to my db is exactly as expected except for the one table
above amongst may thousands.
Cheers and thanks for the advice to use HTML::Parser. I will have a look at
it.

Re: Problem with reg expression

am 05.09.2007 09:20:34 von Mirco Wahab

Peter Jamieson wrote:
> ($d,$e,$f) = /TrackCond(.*)<\/TD>/g;
>
> print "d ",$d," e ",$e," f ",$f,"\n";
> I would have expected
> $d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
> Can anyone suggest why I don't get this and where I am going wrong here?

All has been said so far (all mysteries solved),
but I'd straighten up the whole thing a little bit:

...
my $moggy = '



...
...

';

my ($d, $e, $f) = ('','',''); # why is this necessary at all?

($d, $e, $f) = $moggy =~ /TrackCon[^>]+>\s*(.+?)<\/TD>/g;

print "d=>'$d', e=>'$e', f=>'$f'\n"; # expand scalars in quotes
...

You don't need to put things into $_ in order
to get regular expressions applied, a $var =~ /regex/
will do fine. Furthermor, you can use [^>]+> if
you want to jump to the end of the of
any "TrackCond" variation.


Regards

M.

Re: Problem with reg expression

am 05.09.2007 11:48:06 von Gunnar Hjalmarsson

Lars Eighner wrote:
> In our last episode,
> <5k6l00F283tnU2@mid.individual.net>,
> the lovely and talented Gunnar Hjalmarsson
> broadcast on comp.lang.perl.misc:
>> Lars Eighner wrote:
>>> The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
>>> you are parsing html or a grocery list.
>>
>> Why?
>
> because .* will eat everything that matches (if anything does) so
> $e and $f will always be empty (and $d will be empty if there is no match).

I thought you had covered the greediness thing in your "Second"
comment... The failure to make .* non-greedy doesn't make the whole
statement "nonsense" IMO.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: Problem with reg expression

am 05.09.2007 18:50:13 von Sherm Pendley

Gunnar Hjalmarsson writes:

> Lars Eighner wrote:
>
>> The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
>> you are parsing html or a grocery list.
>
> Why?

In list context, m// returns the captured subexpressions. There are three
elements in the list being assigned to, but only one captured subexpression
in the regex.

sherm--

--
Web Hosting by West Virginians, for West Virginians: http://wv-www.net
Cocoa programming in Perl: http://camelbones.sourceforge.net

Re: Problem with reg expression

am 05.09.2007 20:03:58 von Sherm Pendley

Sherm Pendley writes:

> Gunnar Hjalmarsson writes:
>
>> Lars Eighner wrote:
>>
>>> The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
>>> you are parsing html or a grocery list.
>>
>> Why?
>
> In list context, m// returns the captured subexpressions. There are three
> elements in the list being assigned to, but only one captured subexpression
> in the regex.

Sorry, my bad - I didn't notice the 'g' modifier. That will allow multiple
matches of the subexpression to be captured, and returned as a list.

Note to self: Never post before coffee.

sherm--

--
Web Hosting by West Virginians, for West Virginians: http://wv-www.net
Cocoa programming in Perl: http://camelbones.sourceforge.net

Re: Problem with reg expression

am 05.09.2007 20:27:23 von Lars Eighner

In our last episode, , the lovely and talented
Sherm Pendley broadcast on comp.lang.perl.misc:

> Sherm Pendley writes:

>> Gunnar Hjalmarsson writes:
>>
>>> Lars Eighner wrote:
>>>
>>>> The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
>>>> you are parsing html or a grocery list.
>>>
>>> Why?
>>
>> In list context, m// returns the captured subexpressions. There are three
>> elements in the list being assigned to, but only one captured subexpression
>> in the regex.

> Sorry, my bad - I didn't notice the 'g' modifier. That will allow multiple
> matches of the subexpression to be captured, and returned as a list.

Well, no, you were right the first time, if for the wrong reasons. Because
of .* being unmodified, this kind of expression can never produce more than
one match, not matter how many g's you stick on the end. That is why it is
nonsense: putting a g on the end of something that can match at most once is
nonsense.

Something(.*)anotherthing can produce at most one match. The usual culprit
is the . because it matches just anything. Many times it does not have to
be . and replacing . with a bracketed range will help. In this case, for
example [^<]* has a chance of producing several matches. They would not
necessarily be right because in HTML a different tag could be nested in the
TD element, but you would be right to think there could be more than one
match, so /g would make sense.

> Note to self: Never post before coffee.

--
Lars Eighner
Countdown: 502 days to go.
What do you do when you're debranded?

Re: Problem with reg expression

am 05.09.2007 22:13:15 von Gunnar Hjalmarsson

Lars Eighner wrote:
> Because of .* being unmodified, this kind of expression can never
> produce more than one match, not matter how many g's you stick on
> the end.



> Something(.*)anotherthing can produce at most one match. The usual
> culprit is the . because it matches just anything.

Those statements are not true. Without the /s modifier, the . matches
any character but a newline.

C:\home>type test.pl
my $list = < 1. Milk
2. Sugar
3. Apples
EOL

my @items = $list =~ /\d+\.\s+(.*)/g;

print join(', ', @items), "\n";

C:\home>test.pl
Milk, Sugar, Apples

C:\home>

> Many times it does not have to
> be . and replacing . with a bracketed range will help.

That, OTOH, is true.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: Problem with reg expression

am 05.09.2007 22:53:40 von Lars Eighner

In our last episode,
<5k8gv4F2ke1oU1@mid.individual.net>,
the lovely and talented Gunnar Hjalmarsson
broadcast on comp.lang.perl.misc:

> Lars Eighner wrote:
>> Because of .* being unmodified, this kind of expression can never
>> produce more than one match, not matter how many g's you stick on
>> the end.

>

>> Something(.*)anotherthing can produce at most one match. The usual
>> culprit is the . because it matches just anything.

> Those statements are not true. Without the /s modifier, the . matches
> any character but a newline.

> C:\home>type test.pl
> my $list = < > 1. Milk
> 2. Sugar
> 3. Apples
> EOL

The OP would not have been in trouble if he had convenient line breaks,
but

#!/usr/local/bin/perl

my $list = < 1. Milk 2. Sugar 3. Apples
EOL

my @items = $list =~ /\d+\.\s+(.*)/g;

foreach $thing (@items){
print "$thing |";
}
print "\n";

yeilds:

Milk 2. Sugar 3. Apples |

or in other words, only one match.


> my @items = $list =~ /\d+\.\s+(.*)/g;

> print join(', ', @items), "\n";

> C:\home>test.pl
> Milk, Sugar, Apples

> C:\home>

>> Many times it does not have to
>> be . and replacing . with a bracketed range will help.

> That, OTOH, is true.

--
Lars Eighner
Countdown: 502 days to go.
What do you do when you're debranded?

Re: Problem with reg expression

am 06.09.2007 05:11:23 von Peter Jamieson

"Gunnar Hjalmarsson" wrote in message
news:5k6kqpF283tnU1@mid.individual.net...
> Peter Jamieson wrote:
>> #I want my script to parse HTML tables such as the one included below:
>>
>> #!/usr/bin/perl -w
>> use strict;
>> use warnings;
>>
>> my $moggy = '


>>
>>
>>
>>
>>
>>
>>
>>
>>
RADIO TAB 3-2 >> class=Tips> 520M >>
>> class="TrackCond">FINE
GOOD
';
>>
>> # I tried this
>>
>> $_ = $moggy;
>> my ($d,$e,$f);
>> $d=''; $e=''; $f='';
>>
>> ($d,$e,$f) = /TrackCond(.*)<\/TD>/g;
>>
>> print "d ",$d," e ",$e," f ",$f,"\n";
>>
>>
>> This produces for $d
>> > 520M FINE >> class="TrackCondR">GOOD
>> and no value for $e or $f
>>
>> I would have expected
>> $d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
>>
>> Can anyone suggest why I don't get this
>
> Because regexes are greedy by default.
>
> ($d,$e,$f) = /TrackCond(.*?)<\/TD>/g;
> ------------------------------^

Thanks Gunnar! Fixed the errant table immediately....brilliant!....
case of cyber-beer on it's way!....I should have seen this .....alas
too much Merlot last nite.
Cheers and thanks again.

Re: Problem with reg expression

am 06.09.2007 05:30:37 von Peter Jamieson

"Mirco Wahab" wrote in message
news:fbllc1$19p$1@nserver.hrz.tu-freiberg.de...
> Peter Jamieson wrote:
>> ($d,$e,$f) = /TrackCond(.*)<\/TD>/g;
>>
>> print "d ",$d," e ",$e," f ",$f,"\n";
>> I would have expected
>> $d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
>> Can anyone suggest why I don't get this and where I am going wrong here?
>
> All has been said so far (all mysteries solved),
> but I'd straighten up the whole thing a little bit:
>
> ...
> my $moggy = '
>


>
> ...
> ...
>
>
';
>
> my ($d, $e, $f) = ('','',''); # why is this necessary at all?
>
> ($d, $e, $f) = $moggy =~ /TrackCon[^>]+>\s*(.+?)<\/TD>/g;
>
> print "d=>'$d', e=>'$e', f=>'$f'\n"; # expand scalars in quotes
> ...
>
> You don't need to put things into $_ in order
> to get regular expressions applied, a $var =~ /regex/
> will do fine. Furthermor, you can use [^>]+> if
> you want to jump to the end of the of
> any "TrackCond" variation.
>
>
> Regards
>
> M.

Thanks Mirco!
Your comments and code suggestions have been most helpful
and I will incorporate your ideas.
Despite what has been said by others my script has collected
approx 50K pages of data with only one or two failures
and no warnings.
I'm only a Perl newby. Your suggestions are instructive.
Thanks again!