Re: HTML::HeadParser

am 16.11.2004 00:12:02 von gisle

"David Hofmann" writes:

> I'm currently using your perl module for processing input from a
> spider I wrote.
>
> The problem I'm encountering is some pages have <> in the title.
>
> Example HTML:
>
> 274500 - XL: "Save Changes in <Bookname>" Prompt Even If No > Changes Are Made
>
> The result I get back is "XL: "Save Changes in ". Also the
> description, keywords and last-modified come back bank on these pages
> if they were after the title on the page.

It looks like most other browsers parse stuff in what the HTML::Parser sources call literal mode. I've now applied the following patch to my sources, but I'm not really sure this is a good idea. I might still decide to revert it before release. Index: hparser.c ============================================================ ======= RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v retrieving revision 2.98 retrieving revision 2.99 diff -u -p -u -r2.98 -r2.99 --- hparser.c 11 Nov 2004 10:12:51 -0000 2.98 +++ hparser.c 15 Nov 2004 22:19:49 -0000 2.99 @@ -1,4 +1,4 @@ -/* $Id: hparser.c,v 2.98 2004/11/11 10:12:51 gisle Exp $ +/* $Id: hparser.c,v 2.99 2004/11/15 22:19:49 gisle Exp $ * * Copyright 1999-2002, Gisle Aas * Copyright 1999-2000, Michael A. Chase @@ -27,6 +27,7 @@ literal_mode_elem[] = {5, "style", 1}, {3, "xmp", 1}, {9, "plaintext", 1}, + {5, "title", 0}, {8, "textarea", 0}, {0, 0, 0} }; The problem here is that other browsers seems to switch into a mode where tags inside <title> is still recognized if no end tag
was found in the document. HTML-Parser does not have brains to do
something like this. It tries to parse the document in a stream-like
fashion, and buffering of it all to figure out what quirk-mode to
parse in does not seem attractive.

Regards,
Gisle