Bookmarks

Yahoo Gmail Google Facebook Delicious Twitter Reddit Stumpleupon Myspace Digg

Search queries

192.168.1.41:8000, Www.xxxcon., %s wwwxxxcon, %s wwwxxxcon, www.xxxcon, xxxxdup, bitlord outgoing port settings, ckowwwxxx, 192.168.1.41:8000/nor-482.html, php.ini "Unable to initialize module"

Links

XODOX
Impressum

#1: encoding and PDF::API2

Posted on 2011-10-07 09:54:58 by oleber

Hi all

I'm trying to get the info from a PDF with a code like:

#######################################################

....
use Data::Dumper;
use PDF::API2;
....
my $pdf = PDF::API2->open('/home/.../PDF.pdf');
print Dumper +{ $pdf->info() };

#######################################################

This code gets me something like:

#######################################################

$VAR1 = {
'Subject' => 'my subject',
'CreationDate' => 'D:20111006161347+02\'00\'',
'Producer' => 'LibreOffice 3.3',
'Creator' => 'Writer',
'Author' => 'Marcos Rebelo',
'Title' => 'my title',
'Keywords' => 'my keywords'
};

#######################################################

Unfortunatly someone has the code: < use encoding 'utf8'; >

and now I get:

#######################################################

$VAR1 = {
'Subject' => "\x{fffd}\x{fffd}my subject",
'CreationDate' => 'D:20111006161347+02\'00\'',
'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
'Creator' => "\x{fffd}\x{fffd}Writer",
'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
'Title' => "\x{fffd}\x{fffd}my title",
'Keywords' => "\x{fffd}\x{fffd}my keywords"
};

#######################################################

I can't remove the < use encoding 'utf8'; >, but I need to clean the hash.

How can I clean the hash?


Best Regards
Marcos Rebelo

--
Marcos Rebelo
http://www.oleber.com/
Webmaster of http://perl5notebook.oleber.com

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Report this message

#2: Re: encoding and PDF::API2

Posted on 2011-10-07 10:39:10 by Igor Dovgiy

--0016e6d78530088e6704aeb160e7
Content-Type: text/plain; charset=ISO-8859-1

Hi Marcos,

my %pdf_info = $pdf->info();
foreach (keys $pdf_info) {
$pdf_info{$_} =~ s/[^\x00-\xFF]//g;
}

Perhaps that'll do? )

-- iD

2011/10/7 marcos rebelo <oleber@gmail.com>

> Hi all
>
> I'm trying to get the info from a PDF with a code like:
>
> #######################################################
>
> ...
> use Data::Dumper;
> use PDF::API2;
> ...
> my $pdf = PDF::API2->open('/home/.../PDF.pdf');
> print Dumper +{ $pdf->info() };
>
> #######################################################
>
> This code gets me something like:
>
> #######################################################
>
> $VAR1 = {
> 'Subject' => 'my subject',
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => 'LibreOffice 3.3',
> 'Creator' => 'Writer',
> 'Author' => 'Marcos Rebelo',
> 'Title' => 'my title',
> 'Keywords' => 'my keywords'
> };
>
> #######################################################
>
> Unfortunatly someone has the code: < use encoding 'utf8'; >
>
> and now I get:
>
> #######################################################
>
> $VAR1 = {
> 'Subject' => "\x{fffd}\x{fffd}my subject",
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
> 'Creator' => "\x{fffd}\x{fffd}Writer",
> 'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
> 'Title' => "\x{fffd}\x{fffd}my title",
> 'Keywords' => "\x{fffd}\x{fffd}my keywords"
> };
>
> #######################################################
>
> I can't remove the < use encoding 'utf8'; >, but I need to clean the hash.
>
> How can I clean the hash?
>
>
> Best Regards
> Marcos Rebelo
>
> --
> Marcos Rebelo
> http://www.oleber.com/
> Webmaster of http://perl5notebook.oleber.com
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
>

--0016e6d78530088e6704aeb160e7--

Report this message

#3: Re: encoding and PDF::API2

Posted on 2011-10-07 17:19:40 by Brandon McCaig

On Fri, Oct 7, 2011 at 4:39 AM, Igor Dovgiy <ivd.privat@gmail.com> wrote:
>> $VAR1 =3D {
>> à à à à à'Subject' =3D> "\x{fffd}\x{fffd}my sub=
ject",
>> à à à à à'CreationDate' =3D> 'D:20111006161347+=
02\'00\'',
>> à à à à à'Producer' =3D> "\x{fffd}\x{fffd}Libre=
Office 3.3",
>> à à à à à'Creator' =3D> "\x{fffd}\x{fffd}Writer=
",
>> à à à à à'Author' =3D> "\x{fffd}\x{fffd}Marcos =
Rebelo",
>> à à à à à'Title' =3D> "\x{fffd}\x{fffd}my title=
",
>> à à à à à'Keywords' =3D> "\x{fffd}\x{fffd}my ke=
ywords"
>> à à à à};
*snip*
>> How can I clean the hash?
>>

I know next to nothing about Unicode programming (in any
language), but it seems to always be the same prefix. Printing
this out in Windows' cmd shell seems to yield the same prefix
that I see in UTF-8 files with a BOM (byte-order mark). Oddly,
your data seems to have two of them, which I can't explain, but I
digress. Could you not just remove those two characters with a
s///?

my $info =3D $pdf->info();

for my $key (keys %{$info})
{
next if ref $info->{$key};
$info->{$key} =3D~ s/^\x{fffd}+//;
}

(Untested)

Note that I didn't bother traversing beyond the first level of
the data structure, but you may want to if the data can be more
complex than that... I don't know.

I'm sure that this is a bad way to handle Unicode, but perhaps it
will be "good enough" for now.

Maybe look here for some possibly better advice:

http://ahinea.com/en/tech/perl-unicode-struggle.html


--=20
Brandon McCaig <bamccaig@gmail.com> <bamccaig@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bamccaig.com/>
perl -E '$_=3Dq{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Report this message

#4: Re: encoding and PDF::API2

Posted on 2011-10-07 18:41:49 by Brian Fraser

--001517588b4c21e43c04aeb81eac
Content-Type: text/plain; charset=UTF-8

On Fri, Oct 7, 2011 at 4:54 AM, marcos rebelo <oleber@gmail.com> wrote:

> Hi all
>
> I'm trying to get the info from a PDF with a code like:
>
> #######################################################
>
> ...
> use Data::Dumper;
> use PDF::API2;
> ...
> my $pdf = PDF::API2->open('/home/.../PDF.pdf');
> print Dumper +{ $pdf->info() };
>
> #######################################################
>
> This code gets me something like:
>
> #######################################################
>
> $VAR1 = {
> 'Subject' => 'my subject',
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => 'LibreOffice 3.3',
> 'Creator' => 'Writer',
> 'Author' => 'Marcos Rebelo',
> 'Title' => 'my title',
> 'Keywords' => 'my keywords'
> };
>
> #######################################################
>
> Unfortunatly someone has the code: < use encoding 'utf8'; >
>
> and now I get:
>
> #######################################################
>
> $VAR1 = {
> 'Subject' => "\x{fffd}\x{fffd}my subject",
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
> 'Creator' => "\x{fffd}\x{fffd}Writer",
> 'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
> 'Title' => "\x{fffd}\x{fffd}my title",
> 'Keywords' => "\x{fffd}\x{fffd}my keywords"
> };
>
> #######################################################
>
> I can't remove the < use encoding 'utf8'; >, but I need to clean the hash.
>
>
Find a way. Seriously! use encoding ...; is broken. A drop-in replacement
(mostly) would be 'use utf8; use open qw( :encoding(UTF-8) );'

If that's not feasible, well.. I haven't tried this, but localizing the
${^ENCODING} variable might make do:

{
local ${^ENCODING};
my $pdf = PDF::API2->open('/home/.../PDF.pdf');
}


How can I clean the hash?
>

use charnames qw( :full );
use PDF::API2;
....

tr/\N{REPLACEMENT CHARACTER}//d for values %{$pdf->info()};

--001517588b4c21e43c04aeb81eac--

Report this message

#5: Re: encoding and PDF::API2

Posted on 2011-10-07 18:55:10 by Brian Fraser

--000e0cd1fadad90f0204aeb84d2d
Content-Type: text/plain; charset=UTF-8

On Fri, Oct 7, 2011 at 12:19 PM, Brandon McCaig <bamccaig@gmail.com> wrote:

>
> I know next to nothing about Unicode programming (in any
> language), but it seems to always be the same prefix. Printing
> this out in Windows' cmd shell seems to yield the same prefix
> that I see in UTF-8 files with a BOM (byte-order mark). Oddly,
> your data seems to have two of them, which I can't explain, but I
> digress. Could you not just remove those two characters with a
> s///?
>
>
Well, it's a replacement character, not a BOM. UTF-8 files aren't supposed
to have BOMs (they can, but it's a no-op -- byte order only matters for
UTF-16 and UTF-32), and in any case BOMs are supposed to be the first couple
of bytes in a file, not in every line.

I have no idea how those came to be though. But I'll gladly just blame use
encoding ...; until proven otherwise : )



>
> Maybe look here for some possibly better advice:
>
> http://ahinea.com/en/tech/perl-unicode-struggle.html
>
>
That's an alright introduction, but Unicode is so much more complex than
that.

I'm only beginning to grasp this stuff myself, but basically any search
result that contains "Unicode" and "Tom Christiansen" is a must read these
days. These two links should get you started:

http://stackoverflow.com/questions/6162484/why-does-modern-p erl-avoid-utf-8-by-default
and
http://98.245.80.27/tcpc/OSCON2011/index.html

And since he is the author of the new camel (coming out in December!), I'm
assuming that the Unicode chapter there should also be kept in mind.

--000e0cd1fadad90f0204aeb84d2d--

Report this message

#6: Re: encoding and PDF::API2

Posted on 2011-10-07 19:00:09 by Brian Fraser

--0015174c1d64ad50cc04aeb85f90
Content-Type: text/plain; charset=UTF-8

On Fri, Oct 7, 2011 at 5:39 AM, Igor Dovgiy <ivd.privat@gmail.com> wrote:

> Hi Marcos,


> my %pdf_info = $pdf->info();
> foreach (keys $pdf_info) {
> $pdf_info{$_} =~ s/[^\x00-\xFF]//g;
> }
>

Perhaps that'll do? )
>
>
Nope. That'll restrict the text to the latin-1 charset.

--0015174c1d64ad50cc04aeb85f90--

Report this message

#7: Re: encoding and PDF::API2

Posted on 2011-10-07 19:41:28 by John Delacour

At 09:54 +0200 7/10/11, marcos rebelo wrote:

>Unfortunatly someone has the code: < use encoding 'utf8'; >
>
>and now I get:
>
>#######################################################
>
>$VAR1 = {
> 'Subject' => "\x{fffd}\x{fffd}my subject",
> 'CreationDate' => 'D:20111006161347+02\'00\'',
> 'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
> 'Creator' => "\x{fffd}\x{fffd}Writer",
> 'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
> 'Title' => "\x{fffd}\x{fffd}my title",
> 'Keywords' => "\x{fffd}\x{fffd}my keywords"
> };
>
>#######################################################
>
>I can't remove the < use encoding 'utf8'; >, but I need to clean the hash.
>
>How can I clean the hash?

Without reiterating the demerits of encoding.pm, if the only unicode
character you are getting is \x{fffd} (REPLACEMENT CHARACTER),then
you just need to get rid of it by looping through the hash -- or are
you getting other spurious characters?



#!/usr/local/bin/perl
use strict;
use encoding 'utf8';
use Data::Dumper;
my $VAR1 = {
'Subject' => "\x{fffd}\x{fffd}my subject",
'CreationDate' => 'D:20111006161347+02\'00\'',
'Producer' => "\x{fffd}\x{fffd}LibreOffice 3.3",
'Creator' => "\x{fffd}\x{fffd}Writer",
'Author' => "\x{fffd}\x{fffd}Marcos Rebelo",
'Title' => "\x{fffd}\x{fffd}my title",
'Keywords' => "\x{fffd}\x{fffd}my keywords"
};
my %pdf_hash = %$VAR1;
for (keys %pdf_hash){ $pdf_hash{$_} =~ s~\x{fffd =}~~g }
print Dumper \%pdf_hash;
__END__

Result:
$VAR1 = {
'Subject' => 'my subject',
'CreationDate' => 'D:20111006161347+02\'00\'',
'Producer' => 'LibreOffice 3.3',
'Creator' => 'Writer',
'Author' => 'Marcos Rebelo',
'Title' => 'my title',
'Keywords' => 'my keywords'
};

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Report this message