Angle Brackets remain when tags removed using HTML::TreeBuilder
am 13.06.2006 20:50:47 von DMcGovernHello -
Hopefully, this is an easy one.
I have some ugly HTML like this:
style=3D"font-family: Arial;"
face=3DArial>MMCM4
I am trying to get rid of the tags using HTML::TreeBuilder. =20
Here is my script:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $filename =3D "test.htm";
open OUT, ">", "output.txt" || die "Can't open $!";
my $root =3D HTML::TreeBuilder->new;
$root->ignore_text(0);
$root->ignore_ignorable_whitespace(0);
$root->no_space_compacting(1);
$root->parse_file($filename);
my @fonts =3D $root->look_down('_tag', 'font');
foreach my $font (@fonts) {
$font->tag(undef);
$font->attr('face',undef);
$font->attr('style',undef);
}
print OUT $root->as_HTML("","",{});
$root->delete();
And here is what the output looks like:
<>MMCM4>
The problem is that although the font tags/attributes themselves are
removed, the angle bracket pairs <> and >
are left behind. This causes the starting <> to be rendered in the
browser.
I've tried using $font->detach and $font->delete, but these methods also
delete the text content which must=20
be preserved. =20
It seems there must be something obvious I am missing.
Thanks
Dave