Regular Expression

Regular Expression

am 07.09.2007 14:28:56 von fritz-bayer

Hi,

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag, for example as an attribute or tag name.

So, for example the following text should not match and be replaced:

....
WORD TO MATCH ..

but the following should be replaced

WORD TO MATCH

...

I guess I would have to use a positive lookahead or lookaround
construct to achieve this. I have tried, but could not come up with
anything that will do the job.

Can some pro help me out?

Fritz

Re: Regular Expression

am 07.09.2007 16:41:30 von Klaus

On Sep 7, 2:28 pm, "fritz-ba...@web.de" wrote:
> I 'm looking for a regular expression, which will find a certain word
> in a text and replace it, if and only if it does not appear inside an
> a html link or inside a tag

see Perlfaq 4 - How do I find matching/nesting anything?

==================================
This isn't something that can be done in one regular expression, no
matter how complicated. To find something between two single
characters, a pattern like /x([^x]*)x/ will get the intervening bits
in $1. For multiple ones, then something more like /alpha(.*?)omega/
would be needed. But none of these deals with nested patterns. For
balanced expressions using (, {, [ or < as delimiters, use the CPAN
module Regexp::Common, or see (??{ code }) in the perlre manpage. For
other cases, you'll have to write a parser.

If you are serious about writing a parser, there are a number of
modules or oddities that will make your life a lot easier. There are
the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
and the byacc program. Starting from perl 5.8 the Text::Balanced is
part of the standard distribution.

One simple destructive, inside-out approach that you might try is to
pull out the smallest nesting parts one at a time:

while (s/BEGIN((?:(?!BEGIN)(?!END).)*)END//gs) {
# do something with $1
}

A more complicated and sneaky approach is to make Perl's regular
expression engine do it for you. This is courtesy Dean Inada, and
rather has the nature of an Obfuscated Perl Contest entry, but it
really does work:

# $_ contains the string to parse
# BEGIN and END are the opening and closing markers for the
# nested text.

@( = ('(','');
@) = (')','');
($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
@$ = (eval{/$re/},$@!~/unmatched/i);
print join("\n",@$[0..$#$]) if( $$[-1] );
==================================

--
Klaus

Re: Regular Expression

am 07.09.2007 16:43:23 von Benoit Lefebvre

On Sep 7, 8:28 am, "fritz-ba...@web.de" wrote:
> Hi,
>
> I 'm looking for a regular expression, which will find a certain word
> in a text and replace it, if and only if it does not appear inside an
> a html link or inside a tag, for example as an attribute or tag name.
>
> So, for example the following text should not match and be replaced:
>
> ....
> WORD TO MATCH ..
>
> but the following should be replaced
>
>

WORD TO MATCH

...
>
> I guess I would have to use a positive lookahead or lookaround
> construct to achieve this. I have tried, but could not come up with
> anything that will do the job.
>
> Can some pro help me out?
>
> Fritz

I'm sure there is some WAY BETTER WAY to do this..

But here is a solutions that seems to work.

----------------8<--------------------------------------
#!/usr/bin/perl -w

use strict;

my $to_replace = "WORD";
my $replacement = "BLEH";

my @list = (" ....",
"WORD ..",
"

this is my WORD !

... ");

foreach my $line (@list) {
if ($line =~ m/>([^<]*$to_replace[^>]*) my $match = $1;
$match =~ s/$to_replace/$replacement/g;
$line =~ s/>([^<]*$to_replace[^>]*)$match }
print $line . "\n";
}
--------------------------------------------------------

output:
....
WORD ..

this is my BLEH !

...

Re: Regular Expression

am 07.09.2007 16:51:45 von fritz-bayer

On 7 Sep., 17:41, Klaus wrote:
> On Sep 7, 2:28 pm, "fritz-ba...@web.de" wrote:
>
> > I 'm looking for a regular expression, which will find a certain word
> > in a text and replace it, if and only if it does not appear inside an
> > a html link or inside a tag
>
> see Perlfaq 4 - How do I find matching/nesting anything?
>
> ==================================
> This isn't something that can be done in one regular expression, no
> matter how complicated. To find something between two single
> characters, a pattern like /x([^x]*)x/ will get the intervening bits
> in $1. For multiple ones, then something more like /alpha(.*?)omega/
> would be needed. But none of these deals with nested patterns. For
> balanced expressions using (, {, [ or < as delimiters, use the CPAN
> module Regexp::Common, or see (??{ code }) in the perlre manpage. For
> other cases, you'll have to write a parser.
>
> If you are serious about writing a parser, there are a number of
> modules or oddities that will make your life a lot easier. There are
> the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
> and the byacc program. Starting from perl 5.8 the Text::Balanced is
> part of the standard distribution.
>
> One simple destructive, inside-out approach that you might try is to
> pull out the smallest nesting parts one at a time:
>
> while (s/BEGIN((?:(?!BEGIN)(?!END).)*)END//gs) {
> # do something with $1
> }
>
> A more complicated and sneaky approach is to make Perl's regular
> expression engine do it for you. This is courtesy Dean Inada, and
> rather has the nature of an Obfuscated Perl Contest entry, but it
> really does work:
>
> # $_ contains the string to parse
> # BEGIN and END are the opening and closing markers for the
> # nested text.
>
> @( = ('(','');
> @) = (')','');
> ($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
> @$ = (eval{/$re/},$@!~/unmatched/i);
> print join("\n",@$[0..$#$]) if( $$[-1] );
> ==================================
>
> --
> Klaus


Well, I would know if it's possible, but positive and negative
lookaheads seem to be something to consider. The following shows how:

http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular -expression.html

Re: Regular Expression

am 07.09.2007 17:28:58 von Klaus

On Sep 7, 4:51 pm, "fritz-ba...@web.de" wrote:
> On 7 Sep., 17:41, Klaus wrote:
>
> > On Sep 7, 2:28 pm, "fritz-ba...@web.de" wrote:
>
> > > I 'm looking for a regular expression, which will find a certain word
> > > in a text and replace it, if and only if it does not appear inside an
> > > a html link or inside a tag
>
> > see Perlfaq 4 - How do I find matching/nesting anything?

[ snip contents of Perlfaq 4 ]

> Well, I would know if it's possible, but positive and negative
> lookaheads seem to be something to consider. The following shows how:
>
> http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular -expression.html

The document claims:
" [...] apparently there aren't many good HTML parsers available
for .NET [...] "

That might be true for .NET, but as far as Perl is concerned, there
are many HTML parsers available on CPAN, and HTML::Parser looks
perfect for the job (although I would have to admit that I haven't yet
tested it myself) :

http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm

========================================
Here is an extract from the HTML::Parser documentation:
========================================
HTML::Parser is not a generic SGML parser. We have tried to make it
able to deal with the HTML that is actually "out there", and it
normally parses as closely as possible to the way the popular web
browsers do it instead of strictly following one of the many HTML
specifications from W3C. Where there is disagreement, there is often
an option that you can enable to get the official behaviour.

The document to be parsed may be supplied in arbitrary chunks. This
makes on-the-fly parsing as documents are received from the network
possible.

If event driven parsing does not feel right for your application, you
might want to use HTML::PullParser. This is an HTML::Parser subclass
that allows a more conventional program structure.
========================================

--
Klaus

Re: Regular Expression

am 07.09.2007 20:17:49 von fritz-bayer

On 7 Sep., 18:28, Klaus wrote:
> On Sep 7, 4:51 pm, "fritz-ba...@web.de" wrote:
>
> > On 7 Sep., 17:41, Klaus wrote:
>
> > > On Sep 7, 2:28 pm, "fritz-ba...@web.de" wrote:
>
> > > > I 'm looking for a regular expression, which will find a certain word
> > > > in a text and replace it, if and only if it does not appear inside an
> > > > a html link or inside a tag
>
> > > see Perlfaq 4 - How do I find matching/nesting anything?
>
> [ snip contents of Perlfaq 4 ]
>
> > Well, I would know if it's possible, but positive and negative
> > lookaheads seem to be something to consider. The following shows how:
>
> >http://frank.vanpuffelen.net/2007/04/how-to-optimize-regula r-expressi...
>
> The document claims:
> " [...] apparently there aren't many good HTML parsers available
> for .NET [...] "
>
> That might be true for .NET, but as far as Perl is concerned, there
> are many HTML parsers available on CPAN, and HTML::Parser looks
> perfect for the job (although I would have to admit that I haven't yet
> tested it myself) :
>
> http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm
>
> ========================================
> Here is an extract from the HTML::Parser documentation:
> ========================================
> HTML::Parser is not a generic SGML parser. We have tried to make it
> able to deal with the HTML that is actually "out there", and it
> normally parses as closely as possible to the way the popular web
> browsers do it instead of strictly following one of the many HTML
> specifications from W3C. Where there is disagreement, there is often
> an option that you can enable to get the official behaviour.
>
> The document to be parsed may be supplied in arbitrary chunks. This
> makes on-the-fly parsing as documents are received from the network
> possible.
>
> If event driven parsing does not feel right for your application, you
> might want to use HTML::PullParser. This is an HTML::Parser subclass
> that allows a more conventional program structure.
> ========================================
>
> --
> Klaus

I'm looking for a regular expression, which is plattform independet
and works for java, perl or net.

Re: Regular Expression

am 08.09.2007 00:20:20 von Ben Morrow

Quoth "fritz-bayer@web.de" :
>
> I'm looking for a regular expression, [to parse HTML] which is
> plattform independet and works for java, perl or net.

Here we go again. Clpmisc is for discussing Perl. If you want to
discuss Java or .NET their newsgroups are -->thataway.

In any case, regular expressions (and Perl5 regexps, which are not quite
the same thing) are not an appropriate tool to parse HTML with. If you
have a limited set of documents you may be able to hack up something
that works, but it will be fragile.

Now, did you have a Perl question?

Ben

Re: Regular Expression

am 08.09.2007 00:41:26 von Tad McClellan

fritz-bayer@web.de wrote:

> I 'm looking for a regular expression, which will find a certain word
> in a text and replace it, if and only if it does not appear inside an
> a html link or inside a tag, for example as an attribute or tag name.

> Can some pro help me out?


Sure.

A regular expression is not the Right Tool for this job.

Use a real parser instead.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

Re: Regular Expression

am 08.09.2007 06:50:45 von Joe Smith

fritz-bayer@web.de wrote:

> I'm looking for a regular expression, which is plattform independet
> and works for java, perl or net.

I'd say you have an impossible task. The advanced parts of perl
regular expressions that almost do what you want are not implemented
the same way (if at all) on the other platforms.

-Joe

Re: Regular Expression

am 11.09.2007 11:05:23 von fritz-bayer

On 8 Sep., 07:50, Joe Smith wrote:
> fritz-ba...@web.de wrote:
> > I'm looking for a regular expression, which is plattform independet
> > and works for java, perl or net.
>
> I'd say you have an impossible task. The advanced parts of perl
> regular expressions that almost do what you want are not implemented
> the same way (if at all) on the other platforms.
>
> -Joe


What about finding all words which are not inside a href tag? So if
I'm looking for the word OUTSIDE, then it should match, if it's not
inside a href. So the following should not match


but this should match twice!

OUTSIDE OUTSIDE

Can somebody come up with a regular expression that does the job?

Re: Regular Expression

am 11.09.2007 13:47:29 von Tad McClellan

fritz-bayer@web.de wrote:
> On 8 Sep., 07:50, Joe Smith wrote:
>> fritz-ba...@web.de wrote:
>> > I'm looking for a regular expression, which is plattform independet
>> > and works for java, perl or net.
>>
>> I'd say you have an impossible task. The advanced parts of perl
>> regular expressions that almost do what you want are not implemented
>> the same way (if at all) on the other platforms.
>>
>> -Joe
>
>
> What about finding all words which are not inside a href tag? So if
> I'm looking for the word OUTSIDE, then it should match, if it's not
> inside a href. So the following should not match
>
>
> but this should match twice!
>
> OUTSIDE OUTSIDE


So the below should match twice also?



And the below should match once (since it doess not appear in an anchor)?




> Can somebody come up with a regular expression that does the job?


A regular expression is not the Right Tool for this job.

Use a real parser instead.

Strip all of the anchor elements, then match against what remains.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"