Re: HTML::HeadParser

Re: HTML::HeadParser

am 16.11.2004 00:12:02 von gisle

"David Hofmann" writes:

> I'm currently using your perl module for processing input from a
> spider I wrote.
>
> The problem I'm encountering is some pages have <> in the title.
>
> Example HTML:
>
> 274500 - XL: "Save Changes in <Bookname>" Prompt Even If No<br /> > Changes Are Made
>
> The result I get back is "XL: "Save Changes in ". Also the
> description, keywords and last-modified come back bank on these pages
> if they were after the title on the page.

It looks like most other browsers parse stuff in what the<br /> HTML::Parser sources call literal mode. I've now applied the<br /> following patch to my sources, but I'm not really sure this is a good<br /> idea. I might still decide to revert it before release.<br /> <br /> Index: hparser.c<br /> ============================================================ =======<br /> RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v<br /> retrieving revision 2.98<br /> retrieving revision 2.99<br /> diff -u -p -u -r2.98 -r2.99<br /> --- hparser.c 11 Nov 2004 10:12:51 -0000 2.98<br /> +++ hparser.c 15 Nov 2004 22:19:49 -0000 2.99<br /> @@ -1,4 +1,4 @@<br /> -/* $Id: hparser.c,v 2.98 2004/11/11 10:12:51 gisle Exp $<br /> +/* $Id: hparser.c,v 2.99 2004/11/15 22:19:49 gisle Exp $<br /> *<br /> * Copyright 1999-2002, Gisle Aas<br /> * Copyright 1999-2000, Michael A. Chase<br /> @@ -27,6 +27,7 @@ literal_mode_elem[] =<br /> {5, "style", 1},<br /> {3, "xmp", 1},<br /> {9, "plaintext", 1},<br /> + {5, "title", 0},<br /> {8, "textarea", 0},<br /> {0, 0, 0}<br /> };<br /> <br /> The problem here is that other browsers seems to switch into a mode<br /> where tags inside <title> is still recognized if no end tag
was found in the document. HTML-Parser does not have brains to do
something like this. It tries to parse the document in a stream-like
fashion, and buffering of it all to figure out what quirk-mode to
parse in does not seem attractive.

Regards,
Gisle