HTML::FormatText problem

HTML::FormatText problem

am 06.05.2006 22:09:11 von EmmettPower

Hi,

I have a curious problem with HTML::FormatText and I wonder if anybody
can help me.

I have a bunch of patent documents in a local directory from which I am
extracting the title, abstract, etc for each patent to insert into a
MySQL database. The core lines of the script where I am having problems
are:

use HTML::FormatText;
......
my $plain_page =
HTML::FormatText->new->format(parse_htmlfile($local_patent_f ile))
....do regex stuff with $plain_page...

This works fine - except - it seems - when the patent document contains
the string "##STR1##" which is used in the patent documents to
represent a complex formula. This seems to kill HTML::FormatText, in
other words $plain_page is undefined.

Obviously '#' is used in Perl to represent a comment but I'm surprised
if it affects HTML::FormatText is such a simple way. Maybe ##X## does
something, I honestly don't know.

If anybody had any suggestions, opinions, work-arounds or alternative
suggestions I'd be very grateful.

Thanks

Emmett