utf8 urls

utf8 urls

am 19.03.2008 13:06:59 von Eli Shemer

This is a multipart message in MIME format.

--Boundary_(ID_kKUiYFPEHfWfqVOTztupgA)
Content-type: text/plain; charset=windows-1255
Content-transfer-encoding: quoted-printable

Hey there

=20

For some reason the following test doesn=92t print anything out to the =
screen

Do I need to change something in the apache configuration, or =
mod_perl=92s ?

=20

/articles_read.pl?id=çåæøú

=20

## get http parameters

$r =3D shift;

$apr =3D Apache2::Request->new($r);

print $apr->param('id');

=20

=20

thanks in advance.

=20


Internal Virus Database is out-of-date.
Checked by AVG Free Edition.=20
Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: =
22/11/2007
18:55
=20

--Boundary_(ID_kKUiYFPEHfWfqVOTztupgA)
Content-type: text/html; charset=windows-1255
Content-transfer-encoding: quoted-printable

xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40">


charset=3Dwindows-1255">











Hey there



 



For some reason the following test doesn=92t print =
anything
out to the screen



Do I need to change something in the apache =
configuration,
or mod_perl=92s ?



 



/articles_read.pl?id=3D style=3D'font-family:
"Arial","sans-serif"'>çåæø=FA



style=3D'font-family:"Arial","sans-serif"'> 



## =
get http
parameters



$r =3D shift;



$apr =3D =
Apache2::Request->new($r);



print =A0$apr->param('id');



 



 



thanks in advance.



 











Internal Virus Database is out-of-date.

Checked by AVG Free Edition.

Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: =
22/11/2007 18:55



--Boundary_(ID_kKUiYFPEHfWfqVOTztupgA)--

Re: utf8 urls

am 19.03.2008 13:18:47 von John ORourke

This is a multi-part message in MIME format.
--------------070401030504050708080000
Content-Type: text/plain; charset=windows-1255; format=flowed
Content-Transfer-Encoding: quoted-printable

Eli Shemer wrote:
>
> For some reason the following test doesn=92t print anything out to the =

> screen
>
> Do I need to change something in the apache configuration, or mod_perl=92=
s ?
>
> =20
>
> /articles_read.pl?id=çåæøú
>
> =20
>
> ## get http parameters
>
> $r =3D shift;
>
> $apr =3D Apache2::Request->new($r);
>
> print $apr->param('id');
>

I'm not sure why you get nothing, but I can tell you strings read from=20
Apache objects come through as octets and need to be decoded before=20
use. We're using UTF-8 chars in URLs but I've never used one in a GET=20
request parameter.

hope that helps,
John


> =20
>
> =20
>
> thanks in advance.
>
> =20
>
>
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition.
> Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date:=20
> 22/11/2007 18:55
>


--------------070401030504050708080000
Content-Type: text/html; charset=windows-1255
Content-Transfer-Encoding: quoted-printable




http-equiv=3D"Content-Type">


Eli Shemer wrote:

type=3D"cite">

">



For some reason the following test doesn=92t pri=
nt
anything
out to the screen


Do I need to change something in the apache
configuration,
or mod_perl=92s ?


=A0


/articles_read.pl?id=3D style=3D"font-family: "Arial","sans-serif";" lang=3D=
"HE">çåæø=FA


style=3D"font-family: "Arial","sans-serif";" lang=3D=
"HE">=A0


n>##
get http
parameters


$r =3D shift;


$apr =3D Apache2::Request->new($r); >


print =A0$apr->param('id');






I'm not sure why you get nothing, but I can tell you strings read from
Apache objects come through as octets and need to be decoded before
use.=A0 We're using UTF-8 chars in URLs but I've never used one in a GET
request parameter.



hope that helps,

John





type=3D"cite">


=A0


=A0


thanks in advance.


=A0





Internal Virus Database is out-of-date.

Checked by AVG Free Edition.

Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date:
22/11/2007 18:55








--------------070401030504050708080000--

Re: utf8 urls

am 19.03.2008 13:42:14 von aw

From a previous message by Adam Prime in this same list :
[...]
SetHandler modperl doesn't bind 'print' to '$r->print'. Try SetHandler
perl-script, or change your code to pass in the request object and use
$r->print instead of print.
[...]

or, more verbously and explicitly :
if in your Apache configuration for this "location", you used

SetHandler modperl

then, you should not assume that print() sends its output to the
browser. But if you did (like you did)

$r = shift; # get the Apache::RequestRec object

then $r->print() does go back as a response to the browser.
You should probably at least set a content-type header though,
like

$r->content_type('text/plain');
$r->print $apr->param('id');

and, in your case, it might also be a good idea to send back a header
indicating which is the character set used (presumably UTF-8), since the
default HTTP character set is iso-8859-1, and the string you send back
doesn't look as being printable in that charset.

But I don't know exactly how to do that best in mod_perl.
Would the following work ?
$r->content_type('text/plain; charset="UTF-8"');

Also, the previous message talking about how to handle your (apparently)
UTF-8 request should be taken into account.


André


Eli Shemer wrote:
> Hey there
>
>
>
> For some reason the following test doesn’t print anything out to the screen
>
> Do I need to change something in the apache configuration, or mod_perl’s ?
>
>
>
> /articles_read.pl?id=חוזרת
>
>
>
> ## get http parameters
>
> $r = shift;
>
> $apr = Apache2::Request->new($r);
>
> print $apr->param('id');
>
>
>
>
>
> thanks in advance.
>
>
>
>
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition.
> Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 22/11/2007
> 18:55
>
>

Re: utf8 urls

am 19.03.2008 13:54:08 von Geoffrey Young

John ORourke wrote:
> Eli Shemer wrote:
>>
>> For some reason the following test doesn’t print anything out to the
>> screen
>>
>> Do I need to change something in the apache configuration, or
>> mod_perl’s ?
>>
>>
>>
>> /articles_read.pl?id=çåæøú
>>
>>
>>
>> ## get http parameters
>>
>> $r = shift;
>>
>> $apr = Apache2::Request->new($r);
>>
>> print $apr->param('id');
>>
>
> I'm not sure why you get nothing, but I can tell you strings read from
> Apache objects come through as octets and need to be decoded before
> use. We're using UTF-8 chars in URLs but I've never used one in a GET
> request parameter.

I can't say why it doesn't work, but I'm surprised it would in either
case - the only characters explicitly allowed in a uri are us-ascii.
from rfc2396:

2.4. Escape Sequences

Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.

I bit of googling turned up this cpan module:

http://search.cpan.org/dist/URI-Find-UTF8/lib/URI/Find/UTF8. pm

where the docs point to a ja.wikipedia.org page. for me (firefox 2.0)
clicking on the "original" uri (the one with the japanese characters)
opens up a uri with the uri-escaped character sequence. it's like magic ;)

anyway, my point wasn't to get into some huge debate on whether people
are (successfully) using utf-8 characters in uris, etc. rather, it is
that mod_perl is (mostly) merely a wrapper around apache, and if
something is improper wrt an official rfc apache generally dismisses it
rather than bending to a behavior which people may be using anyway.

so, if it works, great. if not, try making your urls conform to 2396
and see if you have better results.

--Geoff

Re: utf8 urls

am 19.03.2008 13:54:21 von torsten.foertsch

On Wed 19 Mar 2008, Eli Shemer wrote:
> For some reason the following test doesnâ€=99t print anything out to =
the screen
>
> Do I need to change something in the apache configuration, or mod_perl=E2=
€™s ?
>
>  
>
> /articles_read.pl?id=חוזר×=AA

This is probably a bug in libapreq2. I have tried this handler:

sub {
my $r=3D$_[0];
$r->content_type('text/html; charset=3DUTF-8');
my $x=3DApache2::Request->new($r);
$r->print("\nargs=3D".$r->args."\nparam(x)=3D". =20
$x->param('x')."\n\n");
return Apache2::Const::OK;
}

http://localhost/test?x=חוזר×=AA entered in FF chan=
ges on the fly into
http://localhost/test?x=3D%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.

But on the command line with curl it doesn't:

$ curl 'http://localhost/test?x=חוזר×=AA' -v
* About to connect() to localhost port 80 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /test?x=חוזר×=AA HTTP/1.1
> User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.=
8e=20
zlib/1.2.3 libidn/1.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Wed, 19 Mar 2008 12:45:29 GMT
< Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5=
=20
mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=3DUTF-8
<

args=3Dx=חוזר×=AA
param(x)=3D

* Connection #0 to host localhost left intact
* Closing connection #0

Torsten

Re: utf8 urls

am 19.03.2008 14:13:43 von John ORourke

Geoffrey Young wrote:
> John ORourke wrote:
>> Eli Shemer wrote:
>>>
>>> For some reason the following test doesnâ€=99t print anything out=
to the=20
>>> screen
>>>
>> I'm not sure why you get nothing, but I can tell you strings read=20
>> from Apache objects come through as octets and need to be decoded=20
>> before use. We're using UTF-8 chars in URLs but I've never used one=20
>> in a GET request parameter.
>
> I can't say why it doesn't work, but I'm surprised it would in either=20
> case - the only characters explicitly allowed in a uri are us-ascii.=20
> from rfc2396:
>

My bad memory there - you are quite correct. The way we do it is the=20
accepted way - to URL-encode the UTF-8 encoded text, and that will work=20
with URLs and parameters.

eg:

http://www....../categories/name/ty%C3%B6kalut-lamput

is the correct form of:

http://www....../categories/name/työkalut-lamput


encode before printing:

$octets =3D utf8_encode($my_utf8_string); # make octets
$octets =3D~ s/([^\041-\177])/sprintf("%%%02X",ord($1))/ge; # URL-encode =

non-ASCII chars
$r->print($octets);
(the above is simplified - you'll also need to encode question marks etc)=


decode after reading:

$url =3D utf8_decode ( $r->uri() );
or
$param =3D utf8_decode ( $r->param('info') );

cheers
John

Re: utf8 urls

am 19.03.2008 14:32:58 von aw

I think that these things can get very confused and confusing very
quickly, unless one steps through them one step at a time.
Let me try a first iteration :

1) URI's, as sent to the HTTP server, should contain only US-ASCII
characters (and no spaces). If there are other characters, they should
be encoded using the appropriate RFC-dictated URI-encoding scheme.
2) Whether Firefox is smart enough to automatically encode a URI
properly, when it notices that it contains non-US-ASCII characters, is a
nice aspect of Firefox if it does, but should not confuse the main issue.
In other words, if you send a non-ASCII URI to a server (via curl or
lwp-request e.g.), then you should arrange yourself to URI-encode the
request.
3) According to a previous response, at the receiving side, when Apache
gets a properly-encoded request URI containing non-ASCII characters, it
leaves it encoded and passes it "as is" (or "as bytes") to the
processing layer, which in this case is mod_perl.
4) mod_perl parses the URI and makes it accessible in several ways to
the modules running under it (in this case a request handler or a script).
Question : does mod_perl decode the URI string prior to passing it in
bits and pieces to the handler/script, or not ?
(From another response, it would seem that it doesn't)
5) the handler/script obtains the URI parts from mod_perl, possibly
through the RequestRec or Request object.
If such URI parts contained non-ASCII characters, do these modules
perform any translation, or does the handler/script still receive them
as URI-encoded ?
(From another response, it would seem that they don't, and it does)
6) Now the handler/script has the value of the (for instance) query
parameter "id" (and assume it contains non-ASCII characters), and it
wants to output it back to the browser.
To do that, it must arrange to send to the browser a HTTP header that
will tell the browser in which character set this response is encoded,
since by default the HTTP protocol says it is iso-8859-1.
And it seems that in order to do that, it should use, as minimum

$param = $apr->param('id');
$r->content_type('text/plain; charset="UTF-8"');
$r->print $param;

There are a couple of aspects not mentioned above, such as
- how does the handler/script "know" which decoding it should apply to
the URI elements ? Is it certain that it is UTF-8 ?


Another go, anyone ?

André





Torsten Foertsch wrote:
> On Wed 19 Mar 2008, Eli Shemer wrote:
>> For some reason the following test doesn’t print anything out to the screen
>>
>> Do I need to change something in the apache configuration, or mod_perl’s ?
>>
>>
>>
>> /articles_read.pl?id=חוזרת
>
> This is probably a bug in libapreq2. I have tried this handler:
>
> sub {
> my $r=$_[0];
> $r->content_type('text/html; charset=UTF-8');
> my $x=Apache2::Request->new($r);
> $r->print("\nargs=".$r->args."\nparam(x)=".
> $x->param('x')."\n\n");
> return Apache2::Const::OK;
> }
>
> http://localhost/test?x=חוזרת entered in FF changes on the fly into
> http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.
>
> But on the command line with curl it doesn't:
>
> $ curl 'http://localhost/test?x=חוזרת' -v
> * About to connect() to localhost port 80 (#0)
> * Trying 127.0.0.1... connected
> * Connected to localhost (127.0.0.1) port 80 (#0)
>> GET /test?x=חוזרת HTTP/1.1
>> User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e
> zlib/1.2.3 libidn/1.0
>> Host: localhost
>> Accept: */*
>>
> < HTTP/1.1 200 OK
> < Date: Wed, 19 Mar 2008 12:45:29 GMT
> < Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5
> mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8
> < Transfer-Encoding: chunked
> < Content-Type: text/html; charset=UTF-8
> <
>
> args=x=חוזרת
> param(x)=
>
> * Connection #0 to host localhost left intact
> * Closing connection #0
>
> Torsten
>