utf8 and HTML Entities
am 19.09.2007 14:59:02 von Nick GerberHi
I'm lost :-(
I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.
How do I translate the HTML Entities into proper utf-8?
Thanks
Hi
I'm lost :-(
I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.
How do I translate the HTML Entities into proper utf-8?
Thanks
On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:
> I have a string encodet in utf8 with part HTML Entities and part
> characters in utf-8.
>
> How do I translate the HTML Entities into proper utf-8?
Since this must be a commonly encountered problem, my first guess would be
to try cpan to save myself the bother of writing it myself. I rapidly found:
http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entit ies.pm
Please note that I can't vouch for this software since I have not tried it.
As far as utf8 goes you need to use the "Encode" module.
I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
me that could not make it to do the conversion for me. I'll try again.
Thanks
Ben Bullock wrote:
> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:
>
>
>> I have a string encodet in utf8 with part HTML Entities and part
>> characters in utf-8.
>>
>> How do I translate the HTML Entities into proper utf-8?
>
> Since this must be a commonly encountered problem, my first guess would be
> to try cpan to save myself the bother of writing it myself. I rapidly found:
>
> http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entit ies.pm
>
> Please note that I can't vouch for this software since I have not tried it.
>
> As far as utf8 goes you need to use the "Encode" module.
On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber
>Hi
>
>I'm lost :-(
>
>I have a string encodet in utf8 with part HTML Entities and part
>characters in utf-8.
>
>How do I translate the HTML Entities into proper utf-8?
>
>Thanks
Should be enough here to get you going:
sub convertEntities
{
my ($self, $str_ref, $opts) = @_;
my $alt_str = '';
my $res = 0;
my ($entchr);
# Usage info:
# Option bitmask: 1=char reference, 2=general reference, 4=parameter reference
# Default option is char and general references (&)
# Ignore Parameter references (%) in Attvalue and Content
# Process PE's in DTD and Entity decls
$opts = 3 unless defined $opts;
while ($$str_ref =~ /$self->{'RxEntConv'}/gc)
{
# Unicode character reference
if (defined $4) {
# decimal
if (($opts & 1) && defined ($entchr = getEntityUchar($self, $4))) {
$alt_str .= "$1$entchr";
$res = 1;
} else {
$alt_str .= "$1$2#$4;";
}
} elsif (defined $5) {
# hex
if (($opts & 1) && length($5) < 9 && defined ($entchr = getEntityUchar($self, hex($5)))) {
$alt_str .= "$1$entchr";
$res = 1;
} else {
$alt_str .= "$1$2#$5;";
}
}
else {
# General reference
if ($2 eq '&') {
if (($opts & 2) && exists $self->{'general_ent_subst'}->{$3}) {
$alt_str .= $1;
# expand general references,
# bypass if seen in the recursion ring
# ----
if (defined $self->{'ring_ent_subst'}->{$3}) {
$alt_str .= "$1$2$3;";
} else {
# recurse expansion
# ----
my ($entname, $alt_entval) = ($3, undef);
my $entval = $self->{'general_ent_subst'}->{$entname};
$self->{'ring_ent_subst'}->{$entname} = 1;
if (defined ($alt_entval = convertEntities ($self, \$entval, 2))) {
$alt_str .= $$alt_entval;
} else {
$alt_str .= $self->{'general_ent_subst'}->{$entname};
}
$self->{'ring_ent_subst'}->{$entname} = undef;
$res = 1;
}
} else {
$alt_str .= "$1$2$3;";
}
} else {
# Parameter reference
if (($opts & 4) && exists $self->{'parameter_ent_subst'}->{$3}) {
$alt_str .= "$1$self->{'parameter_ent_subst'}->{$3}";
$res = 1;
} else {
$alt_str .= "$1$2$3;";
}
}
}
}
if ($res) {
$alt_str .= substr $$str_ref, pos($$str_ref);
return \$alt_str;
}
return undef;
}
sub getEntityUchar
{
my ($self, $code) = @_;
if (($code >= 0x01 && $code <= 0xD7FF) ||
($code >= 0xE000 && $code <= 0xFFFD) ||
($code >= 0x10000 && $code <= 0x10FFFF)) {
return chr($code);
}
return undef;
}
sub addEntity
{
my ($self, $peflag, $entname, $entval) = @_;
# Non-normalized, internal entities only
# (no external defs yet, ie:SYSTEM/PUBLIC/NDATA)
return undef unless
($entval =~ s/^\s*'([^']*?)'\s*$/$1/s || $entval =~ s/^\s*"([^"]*?)"\s*$/$1/s);
# Replacement text: convert parameter and character references only
my ($alt_entval);
if (defined ($alt_entval = convertEntities ($self, \$entval, 5))) {
$entval = $$alt_entval;
}
my $enttype = 'general_ent_subst';
$enttype = 'parameter_ent_subst' if ($peflag);
if (exists $self->{'$enttype'}->{$entname}) {
# warn, pre-existing ent name
return undef;
}
$self->{$enttype}->{$entname} = $entval;
$self->{'Entities'} .= "|(?:$entname)";
# recompile regexp
$self->{'RxEntConv'} = qr/(.*?)(&|%)($self->{'Entities'});/s;
return \$entval;
}
@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
@UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
$Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
$Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
$Name = "(?:$Nstrt$Nchar*?)";
$RxENTITY = qr/^\s+(?:($Name)|(?:%\s+($Name)))\s+(.*?)$/s;
Nick Gerber wrote:
> I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
> me that could not make it to do the conversion for me. I'll try again.
That's my way which works for millions of HTML (or XML) files:
use HTML::Entities;
my $ENCODING = 'utf8'; # or iso-8859-7, CP1250 etc.
open (HTML, "<:encoding($ENCODING)", "$DIR/$file")
or die "Can't open: $1!";
my $data = ;
my $content = decode_entities($data);
binmode(STDOUT, ":utf8");
print "$content\n";
It is also save (in most cases) to use
my $content = decode_entities(decode_entities($data));
which decodes something like
&
| $ perl -version
| This is perl, v5.8.8 built for i486-linux-gnu-thread-multi
Helmut Wollmersdorfer
On 09/20/2007 08:31 PM, sln@netherlands.co wrote:
> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber
>
>> Hi
>>
>> I'm lost :-(
>>
>> I have a string encodet in utf8 with part HTML Entities and part
>> characters in utf-8.
>>
>> How do I translate the HTML Entities into proper utf-8?
>>
>> Thanks
>
> Should be enough here to get you going:
>
> [ long program snipped ]
No, that's too much.
Mr. Gerber didn't post any code or data, and so he didn't get many
responses because no one knew exactly what he was talking about.
As Mr. Bullock said, HTML::Entities should do it. Here is an example:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;
binmode(STDOUT, ':utf8');
local $/;
my $data = ;
$data = decode_entities($data);
print $data, "\n";
__DATA__
膄 膅 膆
á é í ó ú
ä ë ï ö ü
Thanks all.
Nick
Mumia W. wrote:
> On 09/20/2007 08:31 PM, sln@netherlands.co wrote:
>> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber
>>
>>> Hi
>>>
>>> I'm lost :-(
>>>
>>> I have a string encodet in utf8 with part HTML Entities and part
>>> characters in utf-8.
>>>
>>> How do I translate the HTML Entities into proper utf-8?
>>>
>>> Thanks
>>
>> Should be enough here to get you going:
>>
>> [ long program snipped ]
>
> No, that's too much.
>
> Mr. Gerber didn't post any code or data, and so he didn't get many
> responses because no one knew exactly what he was talking about.
>
> As Mr. Bullock said, HTML::Entities should do it. Here is an example:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
> use HTML::Entities;
>
> binmode(STDOUT, ':utf8');
> local $/;
> my $data = ;
>
> $data = decode_entities($data);
>
> print $data, "\n";
>
> __DATA__
> 膄 膅 膆
> á é í ó ú
> ä ë ï ö ü
>