Re: HTML::HeadParser
am 16.11.2004 00:12:02 von gisle"David Hofmann"
> I'm currently using your perl module for processing input from a
> spider I wrote.
>
> The problem I'm encountering is some pages have <> in the title.
>
> Example HTML:
>
>
> Changes Are Made
>
> The result I get back is "XL: "Save Changes in ". Also the
> description, keywords and last-modified come back bank on these pages
> if they were after the title on the page.
It looks like most other browsers parse
HTML::Parser sources call literal mode. I've now applied the
following patch to my sources, but I'm not really sure this is a good
idea. I might still decide to revert it before release.
Index: hparser.c
============================================================ =======
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.98
retrieving revision 2.99
diff -u -p -u -r2.98 -r2.99
--- hparser.c 11 Nov 2004 10:12:51 -0000 2.98
+++ hparser.c 15 Nov 2004 22:19:49 -0000 2.99
@@ -1,4 +1,4 @@
-/* $Id: hparser.c,v 2.98 2004/11/11 10:12:51 gisle Exp $
+/* $Id: hparser.c,v 2.99 2004/11/15 22:19:49 gisle Exp $
*
* Copyright 1999-2002, Gisle Aas
* Copyright 1999-2000, Michael A. Chase
@@ -27,6 +27,7 @@ literal_mode_elem[] =
{5, "style", 1},
{3, "xmp", 1},
{9, "plaintext", 1},
+ {5, "title", 0},
{8, "textarea", 0},
{0, 0, 0}
};
The problem here is that other browsers seems to switch into a mode
where tags inside
was found in the document. HTML-Parser does not have brains to do
something like this. It tries to parse the document in a stream-like
fashion, and buffering of it all to figure out what quirk-mode to
parse in does not seem attractive.
Regards,
Gisle