POSTing content with accented characters

POSTing content with accented characters

am 07.04.2005 01:46:24 von johnblumel

I'm new to using libwww (and to this list) and only moderately
experienced with Perl. I'm currently working on a set of scripts to
bulk load some data into a MediaWiki wiki that I'm working with several
other people to set up. For the curious, the wiki focus is the writings
of Patrick O'Brian. We have a number of resources that people have
developed over the years, including a glossary of all non-English words
used in O'Brian's books. The glossary includes terms from a number of
languages, some of which use accented characters, esp. French & Irish.

I've written a bot, using LWP, to upload articles extracted from the
glossary and it works fine for those that don't contain accented
characters. Unfortunately, articles containing accented characters have
those characters corrupted when they are uploaded. I've been able to
deal with these characters when they end up in the URL as the page
title in the wiki (by converting them to '%xx', although URI:Escape
doesn't seem to work for this) but I can't seem to figure out how to
get the article content with these characters up to the wiki without
corruption.

OS = Mac OS X 10.3, perl version = 5.8.1, LWP = latest from CPAN

Here's my function to POST the articles:

sub SubmitArticle
{
# get params
my ($refArticle) = @_;

# retrieve an Edit page for the new article and get the edit token
my $url = $gWikiURL . $refArticle->{title} . $gActionEdit;
my $response = $gBot->request(GET $url);
my ($editToken) =
($response->content =~ m/.*value="(.*?)".*name="wpEditToken"/s);

# create & send the submission request
$url = $gWikiURL . $refArticle->{title} . $gActionSubmit;
$response = $gBot->request(POST $url,
Content_Type => 'form-data',
Content => [wpSave => "Save page",
wpSection => "",
wpEdittime => "",
wpEditToken => $editToken,
wpSummary => $gSubmissionComment,
wpTextbox1 => $refArticle->{wikitext}]);


# return the outcome based on the response status
return $response->is_error?$FALSE:$TRUE;
}

The text causing the problems is in $refArticle->{wikitext} and I get
the following message from perl when running the script:

"Parsing of undecoded UTF-8 will give garbage when decoding
entities at /Library/Perl/5.8.1/LWP/Protocol.pm line 114."

But, of course, I don't know how to correct this problem.

Any help would be greatly appreciated, and, of course, I'd like to
figure out the problem with URI::Escape not escaping these same
characters in the URLs -- these are in,

$refArticle->{title}

also seen in the code above.


John Blumel

Re: POSTing content with accented characters

am 09.04.2005 01:35:50 von rho

On Wed, Apr 06, 2005 at 07:46:24PM -0400, John Blumel wrote:
> I've written a bot, using LWP, to upload articles extracted from the
> glossary and it works fine for those that don't contain accented
> characters. Unfortunately, articles containing accented characters have
> those characters corrupted when they are uploaded.

John,

Without looking at your code, this could be a number of things, some
of them, that you never tell the server the encoding you are using
when uploading, or the server messing up things when storing the
content. Or the server messing up things when it offers the content.

I found the intro at the Perl XML FAQ

http://perl-xml.sourceforge.net/faq/#encodings

quite helpful to understand the relevant issues.

\rho

Re: POSTing content with accented characters

am 09.04.2005 01:54:05 von johnblumel

On Apr 8, 2005, at 7:35pm, Robert Barta wrote:

> On Wed, Apr 06, 2005 at 07:46:24PM -0400, John Blumel wrote:
>> I've written a bot, using LWP, to upload articles extracted from the
>> glossary and it works fine for those that don't contain accented
>> characters. Unfortunately, articles containing accented characters
>> have
>> those characters corrupted when they are uploaded.
>
> Without looking at your code, this could be a number of things, some
> of them, that you never tell the server the encoding you are using
> when uploading, or the server messing up things when storing the
> content. Or the server messing up things when it offers the content.

Thanks for your response. I finally solved the problem, although I
still don't understand why it works this way.

I stumbled across a fix while I was in the midst of trying out various
encoding options. I was trying the file in UTF-8 "one last time" and,
after removing some encoding statements from my bot's source file,
forgot to save the last changes before running it. (I had saved some
earlier changes.) As it turned out it worked, although, I don't really
understand why -- it could be a weird MediWiki or Mac OS X quirk -- but
here's what did work.

The input files have to be saved in UTF-8 -- not a problem since
TextEdit (I'm on Mac OS X 10.3) can save in any of the many encodings
supported by the system. Then the files must be read in with just a
normal open() with no special encoding parameter. Then the strange
part. Once read in, I must encode the file contents as 'latin1' before
submitting the article (and obviously, I have to do this to the title
*before* escaping it). If I don't do the latin1 encoding, it doesn't
work, which I don't understand, since I'm submitting to a UTF-8 server
application, and which might mean I've got something else not quite
right.

Someone else suggested that they had eliminated the message

"Parsing of undecoded UTF-8 will give garbage when decoding
entities at /Library/Perl/5.8.1/LWP/Protocol.pm line 114."

by upgrading to a newer version of Perl (I'm at 5.8.1). I'm not too
worried about this at the moment (it's "unrelated" to the submission
since it occurs when retrieving an edit page to get a "token" and the
program keeps going) although, I may look into a Perl upgrade for Mac
OS X at my earliest opportunity.


John Blumel

Re: POSTing content with accented characters

am 10.04.2005 06:20:47 von jalotta

On Apr 8, 2005, at 6:54 PM, John Blumel wrote:

> On Apr 8, 2005, at 7:35pm, Robert Barta wrote:
>
>> On Wed, Apr 06, 2005 at 07:46:24PM -0400, John Blumel wrote:
>>> I've written a bot, using LWP, to upload articles extracted from the
>>> glossary and it works fine for those that don't contain accented
>>> characters. Unfortunately, articles containing accented characters
>>> have
>>> those characters corrupted when they are uploaded.
>>
>> Without looking at your code, this could be a number of things, some
>> of them, that you never tell the server the encoding you are using
>> when uploading, or the server messing up things when storing the
>> content. Or the server messing up things when it offers the content.
>
> Thanks for your response. I finally solved the problem, although I
> still don't understand why it works this way.
>
> I stumbled across a fix while I was in the midst of trying out various
> encoding options. I was trying the file in UTF-8 "one last time" and,
> after removing some encoding statements from my bot's source file,
> forgot to save the last changes before running it. (I had saved some
> earlier changes.) As it turned out it worked, although, I don't really
> understand why -- it could be a weird MediWiki or Mac OS X quirk --
> but here's what did work.
>
> The input files have to be saved in UTF-8 -- not a problem since
> TextEdit (I'm on Mac OS X 10.3) can save in any of the many encodings
> supported by the system. Then the files must be read in with just a
> normal open() with no special encoding parameter. Then the strange
> part. Once read in, I must encode the file contents as 'latin1' before
> submitting the article (and obviously, I have to do this to the title
> *before* escaping it). If I don't do the latin1 encoding, it doesn't
> work, which I don't understand, since I'm submitting to a UTF-8 server
> application, and which might mean I've got something else not quite
> right.
>
> Someone else suggested that they had eliminated the message
>
> "Parsing of undecoded UTF-8 will give garbage when decoding
> entities at /Library/Perl/5.8.1/LWP/Protocol.pm line 114."
>
> by upgrading to a newer version of Perl (I'm at 5.8.1). I'm not too
> worried about this at the moment (it's "unrelated" to the submission
> since it occurs when retrieving an edit page to get a "token" and the
> program keeps going) although, I may look into a Perl upgrade for Mac
> OS X at my earliest opportunity.
>

Wow. This sure sounds like alchemy. Are you sure this is programming?

peajoe.

Re: POSTing content with accented characters

am 12.04.2005 19:18:37 von johnblumel

On Apr 10, 2005, at 12:20am, Joseph Alotta wrote:

>> The input files have to be saved in UTF-8 -- not a problem since
>> TextEdit (I'm on Mac OS X 10.3) can save in any of the many encodings
>> supported by the system. Then the files must be read in with just a
>> normal open() with no special encoding parameter. Then the strange
>> part. Once read in, I must encode the file contents as 'latin1'
>> before submitting the article (and obviously, I have to do this to
>> the title *before* escaping it). If I don't do the latin1 encoding,
>> it doesn't work, which I don't understand, since I'm submitting to a
>> UTF-8 server application, and which might mean I've got something
>> else not quite right.
>
> Wow. This sure sounds like alchemy. Are you sure this is programming?

I think it may have something to do with the BOM not being correctly
set by TextEdit, although, I haven't had the opportunity to
conclusively determine this. (I've had some issues in other scripts
related to whether I need to chop() once or twice to remove newline
characters.)

In the interim, I've stopped sacrificing chickens before each run.


John Blumel