utf8 and HTML Entities

utf8 and HTML Entities

am 19.09.2007 14:59:02 von Nick Gerber

Hi

I'm lost :-(

I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.

How do I translate the HTML Entities into proper utf-8?

Thanks

Re: utf8 and HTML Entities

am 19.09.2007 15:58:40 von benkasminbullock

On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:


> I have a string encodet in utf8 with part HTML Entities and part
> characters in utf-8.
>
> How do I translate the HTML Entities into proper utf-8?

Since this must be a commonly encountered problem, my first guess would be
to try cpan to save myself the bother of writing it myself. I rapidly found:

http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entit ies.pm

Please note that I can't vouch for this software since I have not tried it.

As far as utf8 goes you need to use the "Encode" module.

Re: utf8 and HTML Entities

am 20.09.2007 16:49:29 von Nick Gerber

I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
me that could not make it to do the conversion for me. I'll try again.

Thanks

Ben Bullock wrote:
> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:
>
>
>> I have a string encodet in utf8 with part HTML Entities and part
>> characters in utf-8.
>>
>> How do I translate the HTML Entities into proper utf-8?
>
> Since this must be a commonly encountered problem, my first guess would be
> to try cpan to save myself the bother of writing it myself. I rapidly found:
>
> http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entit ies.pm
>
> Please note that I can't vouch for this software since I have not tried it.
>
> As far as utf8 goes you need to use the "Encode" module.

Re: utf8 and HTML Entities

am 21.09.2007 03:31:44 von sln

On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:

>Hi
>
>I'm lost :-(
>
>I have a string encodet in utf8 with part HTML Entities and part
>characters in utf-8.
>
>How do I translate the HTML Entities into proper utf-8?
>
>Thanks

Should be enough here to get you going:



sub convertEntities
{
my ($self, $str_ref, $opts) = @_;
my $alt_str = '';
my $res = 0;
my ($entchr);

# Usage info:
# Option bitmask: 1=char reference, 2=general reference, 4=parameter reference
# Default option is char and general references (&)
# Ignore Parameter references (%) in Attvalue and Content
# Process PE's in DTD and Entity decls

$opts = 3 unless defined $opts;

while ($$str_ref =~ /$self->{'RxEntConv'}/gc)
{
# Unicode character reference
if (defined $4) {
# decimal
if (($opts & 1) && defined ($entchr = getEntityUchar($self, $4))) {
$alt_str .= "$1$entchr";
$res = 1;
} else {
$alt_str .= "$1$2#$4;";
}
} elsif (defined $5) {
# hex
if (($opts & 1) && length($5) < 9 && defined ($entchr = getEntityUchar($self, hex($5)))) {
$alt_str .= "$1$entchr";
$res = 1;
} else {
$alt_str .= "$1$2#$5;";
}
}
else {
# General reference
if ($2 eq '&') {
if (($opts & 2) && exists $self->{'general_ent_subst'}->{$3}) {
$alt_str .= $1;

# expand general references,
# bypass if seen in the recursion ring
# ----
if (defined $self->{'ring_ent_subst'}->{$3}) {
$alt_str .= "$1$2$3;";
} else {
# recurse expansion
# ----
my ($entname, $alt_entval) = ($3, undef);
my $entval = $self->{'general_ent_subst'}->{$entname};
$self->{'ring_ent_subst'}->{$entname} = 1;

if (defined ($alt_entval = convertEntities ($self, \$entval, 2))) {
$alt_str .= $$alt_entval;
} else {
$alt_str .= $self->{'general_ent_subst'}->{$entname};
}
$self->{'ring_ent_subst'}->{$entname} = undef;
$res = 1;
}
} else {
$alt_str .= "$1$2$3;";
}
} else {
# Parameter reference
if (($opts & 4) && exists $self->{'parameter_ent_subst'}->{$3}) {
$alt_str .= "$1$self->{'parameter_ent_subst'}->{$3}";
$res = 1;
} else {
$alt_str .= "$1$2$3;";
}
}
}
}
if ($res) {
$alt_str .= substr $$str_ref, pos($$str_ref);
return \$alt_str;
}
return undef;
}

sub getEntityUchar
{
my ($self, $code) = @_;
if (($code >= 0x01 && $code <= 0xD7FF) ||
($code >= 0xE000 && $code <= 0xFFFD) ||
($code >= 0x10000 && $code <= 0x10FFFF)) {
return chr($code);
}
return undef;
}

sub addEntity
{
my ($self, $peflag, $entname, $entval) = @_;

# Non-normalized, internal entities only
# (no external defs yet, ie:SYSTEM/PUBLIC/NDATA)
return undef unless
($entval =~ s/^\s*'([^']*?)'\s*$/$1/s || $entval =~ s/^\s*"([^"]*?)"\s*$/$1/s);

# Replacement text: convert parameter and character references only
my ($alt_entval);
if (defined ($alt_entval = convertEntities ($self, \$entval, 5))) {
$entval = $$alt_entval;
}
my $enttype = 'general_ent_subst';
$enttype = 'parameter_ent_subst' if ($peflag);

if (exists $self->{'$enttype'}->{$entname}) {
# warn, pre-existing ent name
return undef;
}
$self->{$enttype}->{$entname} = $entval;
$self->{'Entities'} .= "|(?:$entname)";
# recompile regexp
$self->{'RxEntConv'} = qr/(.*?)(&|%)($self->{'Entities'});/s;
return \$entval;
}



@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
@UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
$Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
$Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
$Name = "(?:$Nstrt$Nchar*?)";

$RxENTITY = qr/^\s+(?:($Name)|(?:%\s+($Name)))\s+(.*?)$/s;

Re: utf8 and HTML Entities

am 21.09.2007 07:27:16 von Helmut Wollmersdorfer

Nick Gerber wrote:
> I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
> me that could not make it to do the conversion for me. I'll try again.

That's my way which works for millions of HTML (or XML) files:

use HTML::Entities;

my $ENCODING = 'utf8'; # or iso-8859-7, CP1250 etc.

open (HTML, "<:encoding($ENCODING)", "$DIR/$file")
or die "Can't open: $1!";

my $data = ;

my $content = decode_entities($data);

binmode(STDOUT, ":utf8");

print "$content\n";

It is also save (in most cases) to use

my $content = decode_entities(decode_entities($data));

which decodes something like

&amp;



| $ perl -version
| This is perl, v5.8.8 built for i486-linux-gnu-thread-multi

Helmut Wollmersdorfer

Re: utf8 and HTML Entities

am 21.09.2007 08:36:05 von paduille.4061.mumia.w+nospam

On 09/20/2007 08:31 PM, sln@netherlands.co wrote:
> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:
>
>> Hi
>>
>> I'm lost :-(
>>
>> I have a string encodet in utf8 with part HTML Entities and part
>> characters in utf-8.
>>
>> How do I translate the HTML Entities into proper utf-8?
>>
>> Thanks
>
> Should be enough here to get you going:
>
> [ long program snipped ]

No, that's too much.

Mr. Gerber didn't post any code or data, and so he didn't get many
responses because no one knew exactly what he was talking about.

As Mr. Bullock said, HTML::Entities should do it. Here is an example:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;

binmode(STDOUT, ':utf8');
local $/;
my $data = ;

$data = decode_entities($data);

print $data, "\n";

__DATA__
膄 膅 膆
á é í ó ú
ä ë ï ö ü

Re: utf8 and HTML Entities

am 25.09.2007 12:14:14 von Nick Gerber

Thanks all.

Nick

Mumia W. wrote:
> On 09/20/2007 08:31 PM, sln@netherlands.co wrote:
>> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:
>>
>>> Hi
>>>
>>> I'm lost :-(
>>>
>>> I have a string encodet in utf8 with part HTML Entities and part
>>> characters in utf-8.
>>>
>>> How do I translate the HTML Entities into proper utf-8?
>>>
>>> Thanks
>>
>> Should be enough here to get you going:
>>
>> [ long program snipped ]
>
> No, that's too much.
>
> Mr. Gerber didn't post any code or data, and so he didn't get many
> responses because no one knew exactly what he was talking about.
>
> As Mr. Bullock said, HTML::Entities should do it. Here is an example:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
> use HTML::Entities;
>
> binmode(STDOUT, ':utf8');
> local $/;
> my $data = ;
>
> $data = decode_entities($data);
>
> print $data, "\n";
>
> __DATA__
> 膄 膅 膆
> á é í ó ú
> ä ë ï ö ü
>