Angle Brackets remain when tags removed using HTML::TreeBuilder

Angle Brackets remain when tags removed using HTML::TreeBuilder

am 13.06.2006 20:50:47 von DMcGovern

Hello -
Hopefully, this is an easy one.

I have some ugly HTML like this:

style=3D"font-family: Arial;"
face=3DArial>MMCM4


I am trying to get rid of the tags using HTML::TreeBuilder. =20

Here is my script:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $filename =3D "test.htm";
open OUT, ">", "output.txt" || die "Can't open $!";

my $root =3D HTML::TreeBuilder->new;
$root->ignore_text(0);
$root->ignore_ignorable_whitespace(0);
$root->no_space_compacting(1);
$root->parse_file($filename);

my @fonts =3D $root->look_down('_tag', 'font');

foreach my $font (@fonts) {
$font->tag(undef);
$font->attr('face',undef);
$font->attr('style',undef);
}
print OUT $root->as_HTML("","",{});

$root->delete();

And here is what the output looks like:

<>MMCM4

The problem is that although the font tags/attributes themselves are
removed, the angle bracket pairs <> and
are left behind. This causes the starting <> to be rendered in the
browser.

I've tried using $font->detach and $font->delete, but these methods also
delete the text content which must=20
be preserved. =20

It seems there must be something obvious I am missing.

Thanks
Dave

RE: Angle Brackets remain when tags removed using HTML::TreeBuilder

am 13.06.2006 21:00:20 von Forrest.Cahoon

=20

> -----Original Message-----
> From: DMcGovern@sungardfutures.com=20
> [mailto:DMcGovern@sungardfutures.com]=20
> Sent: Tuesday, June 13, 2006 1:51 PM
> To: libwww@perl.org
> Subject: Angle Brackets remain when tags removed using=20
> HTML::TreeBuilder
>=20
> Hello -
> Hopefully, this is an easy one.
>=20
> I have some ugly HTML like this:
>=20
> > style=3D"font-family: Arial;"
> face=3DArial>MMCM4

>=20
> I am trying to get rid of the tags using HTML::TreeBuilder. =20
>=20
> Here is my script:
>=20
> #!/usr/bin/perl
> use strict;
> use warnings;
> use HTML::TreeBuilder;
>=20
> my $filename =3D "test.htm";
> open OUT, ">", "output.txt" || die "Can't open $!";
>=20
> my $root =3D HTML::TreeBuilder->new;
> $root->ignore_text(0);
> $root->ignore_ignorable_whitespace(0);
> $root->no_space_compacting(1);
> $root->parse_file($filename);
>=20
> my @fonts =3D $root->look_down('_tag', 'font');
>=20
> foreach my $font (@fonts) {
> $font->tag(undef);
> $font->attr('face',undef);
> $font->attr('style',undef);
> }

try

foreach my $font (@fonts) { $font->replace_with_content->delete; }

That's untested, but I think it will do what you want.

Forrest Cahoon
not speaking for merrill corporation
=20