Libhtml parser 3.43 ??
am 27.12.2004 18:53:00 von thesaltydog
I am experiencing a "strange" behaviour on linhtml-parser-perl v.3.43=20
The "strange" behaviour is ONLY on
this web page:
http://communicator.virgilio.it
Take this simple script:
#################àà
#!/usr/bin/perl -w
use LWP::UserAgent;
use HTML::Form;
use Data::Dumper;
my $ua =3D LWP::UserAgent->new;
$ua->env_proxy;
my $url=3D"http://communicator.virgilio.it/";
my $response =3D $ua->get($url);
die "Can't get $url -- ", $response->status_line unless
$response->is_success;
my @forms=3DHTML::Form->parse($response->content, $response->base);
print Dumper(@forms);
###########################
It doesn't work with libhtml-parser-perl v.3.43 (but it works with
previous version) . If you change $url address to another (i.e.
www.altavista.com), it DOES work. What is
strange on that address concerning the new module?
Re: Libhtml parser 3.43 ??
am 28.12.2004 14:55:22 von gisle
The Saltydog writes:
> I am experiencing a "strange" behaviour on linhtml-parser-perl v.3.43
>
> The "strange" behaviour is ONLY on
> this web page:
>
> http://communicator.virgilio.it
HTML::Parser got confused about how quoted strings nest when parsing
one of the script tags. This made it assume large parts of the
document to be the script element.
This buggy behaviour was introduced in v3.40 (v3.39_91). The
following patch fixes this problem and will be present in v3.44 when
ready. I expect that to happen soonish.
Regards,
Gisle
Index: hparser.c
============================================================ =======
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.118
retrieving revision 2.119
diff -u -p -u -r2.118 -r2.119
--- hparser.c 2 Dec 2004 11:52:32 -0000 2.118
+++ hparser.c 28 Dec 2004 13:47:44 -0000 2.119
@@ -1,4 +1,4 @@
-/* $Id: hparser.c,v 2.118 2004/12/02 11:52:32 gisle Exp $
+/* $Id: hparser.c,v 2.119 2004/12/28 13:47:44 gisle Exp $
*
* Copyright 1999-2004, Gisle Aas
* Copyright 1999-2000, Michael A. Chase
@@ -1522,7 +1522,7 @@ parse_buf(pTHX_ PSTATE* p_state, char *b
inside_quote = 0;
else if (*s == '\r' || *s == '\n')
inside_quote = 0;
- else if (*s == '"' || *s == '\'')
+ else if (!inside_quote && (*s == '"' || *s == '\''))
inside_quote = *s;
}
}
Re: Libhtml parser 3.43 ??
am 28.12.2004 16:45:19 von thesaltydog
Ok. Thanks.
In the meanwhile, waiting for v. 3.44, is there any way to let my
program work on that url?
It runs on more than 20 different computers in different part of the
world, so it is hard to patch all of them.
If the new release will come very soon, I could wait for it.
On 28 Dec 2004 05:55:22 -0800, Gisle Aas wrote:
> The Saltydog writes:
>
> > I am experiencing a "strange" behaviour on linhtml-parser-perl v.3.43
> >
> > The "strange" behaviour is ONLY on
> > this web page:
> >
> > http://communicator.virgilio.it
>
> HTML::Parser got confused about how quoted strings nest when parsing
> one of the script tags. This made it assume large parts of the
> document to be the script element.
>
> This buggy behaviour was introduced in v3.40 (v3.39_91). The
> following patch fixes this problem and will be present in v3.44 when
> ready. I expect that to happen soonish.
>
> Regards,
> Gisle
>
> Index: hparser.c
> ============================================================ =======
> RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
> retrieving revision 2.118
> retrieving revision 2.119
> diff -u -p -u -r2.118 -r2.119
> --- hparser.c 2 Dec 2004 11:52:32 -0000 2.118
> +++ hparser.c 28 Dec 2004 13:47:44 -0000 2.119
> @@ -1,4 +1,4 @@
> -/* $Id: hparser.c,v 2.118 2004/12/02 11:52:32 gisle Exp $
> +/* $Id: hparser.c,v 2.119 2004/12/28 13:47:44 gisle Exp $
> *
> * Copyright 1999-2004, Gisle Aas
> * Copyright 1999-2000, Michael A. Chase
> @@ -1522,7 +1522,7 @@ parse_buf(pTHX_ PSTATE* p_state, char *b
> inside_quote = 0;
> else if (*s == '\r' || *s == '\n')
> inside_quote = 0;
> - else if (*s == '"' || *s == '\'')
> + else if (!inside_quote && (*s == '"' || *s == '\''))
> inside_quote = *s;
> }
> }
>
>