encoding and PDF::API2

am 07.10.2011 09:54:58 von oleber

Hi all

I'm trying to get the info from a PDF with a code like:

#######################################################

....
use Data::Dumper;
use PDF::API2;
....
my $pdf = PDF::API2->open('/home/.../PDF.pdf');
print Dumper +{ $pdf->info() };

#######################################################

This code gets me something like:

#######################################################

$VAR1 = {
'Subject' => 'my subject',
'CreationDate' => 'D:20111006161347+02\'00\'',
'Producer' => 'LibreOffice 3.3',
'Creator' => 'Writer',
'Author' => 'Marcos Rebelo',
'Title' => 'my title',
'Keywords' => 'my keywords'
};

#######################################################

Unfortunatly someone has the code: < use encoding 'utf8'; >

and now I get:

#######################################################

$VAR1 = {
'Subject' => "\x{fffd}\x{fffd}my subject",
'CreationDate' => 'D:20111006161347+02\'00\'',
'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
'Creator' => "\x{fffd}\x{fffd}Writer",
'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
'Title' => "\x{fffd}\x{fffd}my title",
'Keywords' => "\x{fffd}\x{fffd}my keywords"
};

#######################################################

I can't remove the < use encoding 'utf8'; >, but I need to clean the hash.

How can I clean the hash?

Best Regards
Marcos Rebelo

--
Marcos Rebelo
http://www.oleber.com/
Webmaster of http://perl5notebook.oleber.com

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: encoding and PDF::API2

am 07.10.2011 10:39:10 von Igor Dovgiy

--0016e6d78530088e6704aeb160e7
Content-Type: text/plain; charset=ISO-8859-1

Hi Marcos,

my %pdf_info = $pdf->info();
foreach (keys $pdf_info) {
$pdf_info{$_} =~ s/[^\x00-\xFF]//g;
}

Perhaps that'll do? )

-- iD

2011/10/7 marcos rebelo

> Hi all
>
> I'm trying to get the info from a PDF with a code like:
>
> #######################################################
>
> ...
> use Data::Dumper;
> use PDF::API2;
> ...
> my $pdf = PDF::API2->open('/home/.../PDF.pdf');
> print Dumper +{ $pdf->info() };
>
> #######################################################
>
> This code gets me something like:
>
> #######################################################
>
> $VAR1 = {
> 'Subject' => 'my subject',
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => 'LibreOffice 3.3',
> 'Creator' => 'Writer',
> 'Author' => 'Marcos Rebelo',
> 'Title' => 'my title',
> 'Keywords' => 'my keywords'
> };
>
> #######################################################
>
> Unfortunatly someone has the code: < use encoding 'utf8'; >
>
> and now I get:
>
> #######################################################
>
> $VAR1 = {
> 'Subject' => "\x{fffd}\x{fffd}my subject",
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
> 'Creator' => "\x{fffd}\x{fffd}Writer",
> 'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
> 'Title' => "\x{fffd}\x{fffd}my title",
> 'Keywords' => "\x{fffd}\x{fffd}my keywords"
> };
>
> #######################################################
>
> I can't remove the < use encoding 'utf8'; >, but I need to clean the hash.
>
> How can I clean the hash?
>
>
> Best Regards
> Marcos Rebelo
>
> --
> Marcos Rebelo
> http://www.oleber.com/
> Webmaster of http://perl5notebook.oleber.com
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
>

--0016e6d78530088e6704aeb160e7--

Re: encoding and PDF::API2

am 07.10.2011 17:19:40 von Brandon McCaig

On Fri, Oct 7, 2011 at 4:39 AM, Igor Dovgiy wrote:
>> $VAR1 =3D {
>> Â Â Â Â Â 'Subject' =3D> "\x{fffd}\x{fffd}my sub=
ject",
>> Â Â Â Â Â 'CreationDate' =3D> 'D:20111006161347+=
02\'00\'',
>> Â Â Â Â Â 'Producer' =3D> "\x{fffd}\x{fffd}Libre=
Office 3.3",
>> Â Â Â Â Â 'Creator' =3D> "\x{fffd}\x{fffd}Writer=
",
>> Â Â Â Â Â 'Author' =3D> "\x{fffd}\x{fffd}Marcos =
Rebelo",
>> Â Â Â Â Â 'Title' =3D> "\x{fffd}\x{fffd}my title=
",
>> Â Â Â Â Â 'Keywords' =3D> "\x{fffd}\x{fffd}my ke=
ywords"
>> Â Â Â Â };
*snip*
>> How can I clean the hash?
>>

I know next to nothing about Unicode programming (in any
language), but it seems to always be the same prefix. Printing
this out in Windows' cmd shell seems to yield the same prefix
that I see in UTF-8 files with a BOM (byte-order mark). Oddly,
your data seems to have two of them, which I can't explain, but I
digress. Could you not just remove those two characters with a
s///?

my $info =3D $pdf->info();

for my $key (keys %{$info})
{
next if ref $info->{$key};
$info->{$key} =3D~ s/^\x{fffd}+//;
}

(Untested)

Note that I didn't bother traversing beyond the first level of
the data structure, but you may want to if the data can be more
complex than that... I don't know.

I'm sure that this is a bad way to handle Unicode, but perhaps it
will be "good enough" for now.

Maybe look here for some possibly better advice:

http://ahinea.com/en/tech/perl-unicode-struggle.html

--=20
Brandon McCaig
Castopulence Software
Blog
perl -E '$_=3Dq{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: encoding and PDF::API2

am 07.10.2011 18:41:49 von Brian Fraser

--001517588b4c21e43c04aeb81eac
Content-Type: text/plain; charset=UTF-8

On Fri, Oct 7, 2011 at 4:54 AM, marcos rebelo wrote:

> Hi all
>
> I'm trying to get the info from a PDF with a code like:
>
> #######################################################
>
> ...
> use Data::Dumper;
> use PDF::API2;
> ...
> my $pdf = PDF::API2->open('/home/.../PDF.pdf');
> print Dumper +{ $pdf->info() };
>
> #######################################################
>
> This code gets me something like:
>
> #######################################################
>
> $VAR1 = {
> 'Subject' => 'my subject',
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => 'LibreOffice 3.3',
> 'Creator' => 'Writer',
> 'Author' => 'Marcos Rebelo',
> 'Title' => 'my title',
> 'Keywords' => 'my keywords'
> };
>
> #######################################################
>
> Unfortunatly someone has the code: < use encoding 'utf8'; >
>
> and now I get:
>
> #######################################################
>
> $VAR1 = {
> 'Subject' => "\x{fffd}\x{fffd}my subject",
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
> 'Creator' => "\x{fffd}\x{fffd}Writer",
> 'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
> 'Title' => "\x{fffd}\x{fffd}my title",
> 'Keywords' => "\x{fffd}\x{fffd}my keywords"
> };
>
> #######################################################
>
> I can't remove the < use encoding 'utf8'; >, but I need to clean the hash.
>
>
Find a way. Seriously! use encoding ...; is broken. A drop-in replacement
(mostly) would be 'use utf8; use open qw( :encoding(UTF-8) );'

If that's not feasible, well.. I haven't tried this, but localizing the
${^ENCODING} variable might make do:

{
local ${^ENCODING};
my $pdf = PDF::API2->open('/home/.../PDF.pdf');
}

How can I clean the hash?
>

use charnames qw( :full );
use PDF::API2;
....

tr/\N{REPLACEMENT CHARACTER}//d for values %{$pdf->info()};

--001517588b4c21e43c04aeb81eac--

Re: encoding and PDF::API2

am 07.10.2011 18:55:10 von Brian Fraser

--000e0cd1fadad90f0204aeb84d2d
Content-Type: text/plain; charset=UTF-8

On Fri, Oct 7, 2011 at 12:19 PM, Brandon McCaig wrote:

>
> I know next to nothing about Unicode programming (in any
> language), but it seems to always be the same prefix. Printing
> this out in Windows' cmd shell seems to yield the same prefix
> that I see in UTF-8 files with a BOM (byte-order mark). Oddly,
> your data seems to have two of them, which I can't explain, but I
> digress. Could you not just remove those two characters with a
> s///?
>
>
Well, it's a replacement character, not a BOM. UTF-8 files aren't supposed
to have BOMs (they can, but it's a no-op -- byte order only matters for
UTF-16 and UTF-32), and in any case BOMs are supposed to be the first couple
of bytes in a file, not in every line.

I have no idea how those came to be though. But I'll gladly just blame use
encoding ...; until proven otherwise : )

>
> Maybe look here for some possibly better advice:
>
> http://ahinea.com/en/tech/perl-unicode-struggle.html
>
>
That's an alright introduction, but Unicode is so much more complex than
that.

I'm only beginning to grasp this stuff myself, but basically any search
result that contains "Unicode" and "Tom Christiansen" is a must read these
days. These two links should get you started:

http://stackoverflow.com/questions/6162484/why-does-modern-p erl-avoid-utf-8-by-default
and
http://98.245.80.27/tcpc/OSCON2011/index.html

And since he is the author of the new camel (coming out in December!), I'm
assuming that the Unicode chapter there should also be kept in mind.

--000e0cd1fadad90f0204aeb84d2d--

Re: encoding and PDF::API2

am 07.10.2011 19:00:09 von Brian Fraser

--0015174c1d64ad50cc04aeb85f90
Content-Type: text/plain; charset=UTF-8

On Fri, Oct 7, 2011 at 5:39 AM, Igor Dovgiy wrote:

> Hi Marcos,

> my %pdf_info = $pdf->info();
> foreach (keys $pdf_info) {
> $pdf_info{$_} =~ s/[^\x00-\xFF]//g;
> }
>

Perhaps that'll do? )
>
>
Nope. That'll restrict the text to the latin-1 charset.

--0015174c1d64ad50cc04aeb85f90--

Re: encoding and PDF::API2

am 07.10.2011 19:41:28 von John Delacour

At 09:54 +0200 7/10/11, marcos rebelo wrote:

>Unfortunatly someone has the code: < use encoding 'utf8'; >
>
>and now I get:
>
>#######################################################
>
>$VAR1 = {
> 'Subject' => "\x{fffd}\x{fffd}my subject",
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
> 'Creator' => "\x{fffd}\x{fffd}Writer",
> 'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
> 'Title' => "\x{fffd}\x{fffd}my title",
> 'Keywords' => "\x{fffd}\x{fffd}my keywords"
> };
>
>#######################################################
>
>I can't remove the < use encoding 'utf8'; >, but I need to clean the hash.
>
>How can I clean the hash?

Without reiterating the demerits of encoding.pm, if the only unicode
character you are getting is \x{fffd} (REPLACEMENT CHARACTER),then
you just need to get rid of it by looping through the hash -- or are
you getting other spurious characters?

#!/usr/local/bin/perl
use strict;
use encoding 'utf8';
use Data::Dumper;
my $VAR1 = {
'Subject' => "\x{fffd}\x{fffd}my subject",
'CreationDate' => 'D:20111006161347+02\'00\'',
'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
'Creator' => "\x{fffd}\x{fffd}Writer",
'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
'Title' => "\x{fffd}\x{fffd}my title",
'Keywords' => "\x{fffd}\x{fffd}my keywords"
};
my %pdf_hash = %$VAR1;
for (keys %pdf_hash){ $pdf_hash{$_} =~ s~\x{fffd =}~~g }
print Dumper \%pdf_hash;
__END__

Result:
$VAR1 = {
'Subject' => 'my subject',
'CreationDate' => 'D:20111006161347+02\'00\'',
'Producer' => 'LibreOffice 3.3',
'Creator' => 'Writer',
'Author' => 'Marcos Rebelo',
'Title' => 'my title',
'Keywords' => 'my keywords'
};

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/