upper and UTF-8

upper and UTF-8

am 26.07.2010 23:03:54 von Benjamin Krajmalnik

This is a multi-part message in MIME format.

------_=_NextPart_001_01CB2D06.1B2BBA46
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

I just used the upper(text) function on a database which is utf8 encoded
and which has spanish text.

All of the regular characters were properly converted, except for
characters which had accents.

=20


------_=_NextPart_001_01CB2D06.1B2BBA46
Content-Type: text/html;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:x=3D"urn:schemas-microsoft-com:office:excel" =
xmlns:p=3D"urn:schemas-microsoft-com:office:powerpoint" =
xmlns:a=3D"urn:schemas-microsoft-com:office:access" =
xmlns:dt=3D"uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" =
xmlns:s=3D"uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" =
xmlns:rs=3D"urn:schemas-microsoft-com:rowset" xmlns:z=3D"#RowsetSchema" =
xmlns:b=3D"urn:schemas-microsoft-com:office:publisher" =
xmlns:ss=3D"urn:schemas-microsoft-com:office:spreadsheet" =
xmlns:c=3D"urn:schemas-microsoft-com:office:component:spread sheet" =
xmlns:odc=3D"urn:schemas-microsoft-com:office:odc" =
xmlns:oa=3D"urn:schemas-microsoft-com:office:activation" =
xmlns:html=3D"http://www.w3.org/TR/REC-html40" =
xmlns:q=3D"http://schemas.xmlsoap.org/soap/envelope/" =
xmlns:rtc=3D"http://microsoft.com/officenet/conferencing" =
xmlns:D=3D"DAV:" xmlns:Repl=3D"http://schemas.microsoft.com/repl/" =
xmlns:mt=3D"http://schemas.microsoft.com/sharepoint/soap/mee tings/" =
xmlns:x2=3D"http://schemas.microsoft.com/office/excel/2003/x ml" =
xmlns:ppda=3D"http://www.passport.com/NameSpace.xsd" =
xmlns:ois=3D"http://schemas.microsoft.com/sharepoint/soap/oi s/" =
xmlns:dir=3D"http://schemas.microsoft.com/sharepoint/soap/di rectory/" =
xmlns:ds=3D"http://www.w3.org/2000/09/xmldsig#" =
xmlns:dsp=3D"http://schemas.microsoft.com/sharepoint/dsp" =
xmlns:udc=3D"http://schemas.microsoft.com/data/udc" =
xmlns:xsd=3D"http://www.w3.org/2001/XMLSchema" =
xmlns:sub=3D"http://schemas.microsoft.com/sharepoint/soap/20 02/1/alerts/"=
xmlns:ec=3D"http://www.w3.org/2001/04/xmlenc#" =
xmlns:sp=3D"http://schemas.microsoft.com/sharepoint/" =
xmlns:sps=3D"http://schemas.microsoft.com/sharepoint/soap/" =
xmlns:xsi=3D"http://www.w3.org/2001/XMLSchema-instance" =
xmlns:udcs=3D"http://schemas.microsoft.com/data/udc/soap" =
xmlns:udcxf=3D"http://schemas.microsoft.com/data/udc/xmlfile " =
xmlns:udcp2p=3D"http://schemas.microsoft.com/data/udc/partto part" =
xmlns:wf=3D"http://schemas.microsoft.com/sharepoint/soap/wor kflow/" =
xmlns:dsss=3D"http://schemas.microsoft.com/office/2006/digsi g-setup" =
xmlns:dssi=3D"http://schemas.microsoft.com/office/2006/digsi g" =
xmlns:mdssi=3D"http://schemas.openxmlformats.org/package/200 6/digital-sig=
nature" =
xmlns:mver=3D"http://schemas.openxmlformats.org/markup-compa tibility/2006=
" xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns:mrels=3D"http://schemas.openxmlformats.org/package/200 6/relationshi=
ps" xmlns:spwp=3D"http://microsoft.com/sharepoint/webpartpages" =
xmlns:ex12t=3D"http://schemas.microsoft.com/exchange/service s/2006/types"=
=
xmlns:ex12m=3D"http://schemas.microsoft.com/exchange/service s/2006/messag=
es" =
xmlns:pptsl=3D"http://schemas.microsoft.com/sharepoint/soap/ SlideLibrary/=
" =
xmlns:spsl=3D"http://microsoft.com/webservices/SharePointPor talServer/Pub=
lishedLinksService" xmlns:Z=3D"urn:schemas-microsoft-com:" =
xmlns:st=3D"" xmlns=3D"http://www.w3.org/TR/REC-html40">


charset=3Dus-ascii">









I just used the upper(text) function on a database =
which is
utf8 encoded and which has spanish text.



All of the regular characters were properly =
converted, except
for characters which had accents.



 









------_=_NextPart_001_01CB2D06.1B2BBA46--

Re: upper and UTF-8

am 26.07.2010 23:17:16 von Scott Marlowe

On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik wrote:
> I just used the upper(text) function on a database which is utf8 encoded and
> which has spanish text.
>
> All of the regular characters were properly converted, except for characters
> which had accents.

What are your various LC_* variables for that database?

--
To understand recursion, one must first understand recursion.

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 26.07.2010 23:18:15 von Benjamin Krajmalnik

CREATE DATABASE ishield
WITH OWNER =3D postgres
ENCODING =3D 'UTF8'
LC_COLLATE =3D 'C'
LC_CTYPE =3D 'C'
CONNECTION LIMIT =3D -1;


> -----Original Message-----
> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
> Sent: Monday, July 26, 2010 3:17 PM
> To: Benjamin Krajmalnik
> Cc: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] upper and UTF-8
>=20
> On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik
> wrote:
> > I just used the upper(text) function on a database which is utf8
> encoded and
> > which has spanish text.
> >
> > All of the regular characters were properly converted, except for
> characters
> > which had accents.
>=20
> What are your various LC_* variables for that database?
>=20
> --
> To understand recursion, one must first understand recursion.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 26.07.2010 23:39:09 von Scott Marlowe

I'd try creating a db with en_US or even better whatever is spanish
encoding for lc_collate and see what happens.

On Mon, Jul 26, 2010 at 3:18 PM, Benjamin Krajmalnik w=
rote:
> CREATE DATABASE ishield
> =A0WITH OWNER =3D postgres
> =A0 =A0 =A0 ENCODING =3D 'UTF8'
> =A0 =A0 =A0 LC_COLLATE =3D 'C'
> =A0 =A0 =A0 LC_CTYPE =3D 'C'
> =A0 =A0 =A0 CONNECTION LIMIT =3D -1;
>
>
>> -----Original Message-----
>> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
>> Sent: Monday, July 26, 2010 3:17 PM
>> To: Benjamin Krajmalnik
>> Cc: pgsql-admin@postgresql.org
>> Subject: Re: [ADMIN] upper and UTF-8
>>
>> On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik
>> wrote:
>> > I just used the upper(text) function on a database which is utf8
>> encoded and
>> > which has spanish text.
>> >
>> > All of the regular characters were properly converted, except for
>> characters
>> > which had accents.
>>
>> What are your various LC_* variables for that database?
>>
>> --
>> To understand recursion, one must first understand recursion.
>



--=20
To understand recursion, one must first understand recursion.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 26.07.2010 23:47:13 von Benjamin Krajmalnik

Unfortunately, the database has to accept data in multiple languages, since=
it is a SaaS offering.
It is not a big deal - I just found it interesting that it did not uppercas=
e the accented letters.
The reason I came across it is that I created a table of all the ISO countr=
ies. I had found a NySQL script which created it, and it had the fields in=
both upper case and mixed case. Since our platform is multi-lingual, we e=
xpanded the table to add the language code and started adding the translati=
on. After I finished the translation, I figured for consistency I would up=
per case the one field into the other, and this is where I saw the inconsis=
tency.
Operationally, it does not affect me in any way - but I found it strange th=
at it did not handle the accented characters.
For now we are keeping the column to facilitate the translation to other la=
nguages - ultimately it will be dropped.


> -----Original Message-----
> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
> Sent: Monday, July 26, 2010 3:39 PM
> To: Benjamin Krajmalnik
> Cc: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] upper and UTF-8
>=20
> I'd try creating a db with en_US or even better whatever is spanish
> encoding for lc_collate and see what happens.
>=20
> On Mon, Jul 26, 2010 at 3:18 PM, Benjamin Krajmalnik
> wrote:
> > CREATE DATABASE ishield
> > =A0WITH OWNER =3D postgres
> > =A0 =A0 =A0 ENCODING =3D 'UTF8'
> > =A0 =A0 =A0 LC_COLLATE =3D 'C'
> > =A0 =A0 =A0 LC_CTYPE =3D 'C'
> > =A0 =A0 =A0 CONNECTION LIMIT =3D -1;
> >
> >
> >> -----Original Message-----
> >> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
> >> Sent: Monday, July 26, 2010 3:17 PM
> >> To: Benjamin Krajmalnik
> >> Cc: pgsql-admin@postgresql.org
> >> Subject: Re: [ADMIN] upper and UTF-8
> >>
> >> On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik
> >> wrote:
> >> > I just used the upper(text) function on a database which is utf8
> >> encoded and
> >> > which has spanish text.
> >> >
> >> > All of the regular characters were properly converted, except for
> >> characters
> >> > which had accents.
> >>
> >> What are your various LC_* variables for that database?
> >>
> >> --
> >> To understand recursion, one must first understand recursion.
> >
>=20
>=20
>=20
> --
> To understand recursion, one must first understand recursion.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 26.07.2010 23:51:58 von Scott Marlowe

On Mon, Jul 26, 2010 at 3:47 PM, Benjamin Krajmalnik wrote:
> Unfortunately, the database has to accept data in multiple languages, since it is a SaaS offering.

The encoding determines that, not the collation. UTF-8 allows you to
insert various languages in that encoding.

> It is not a big deal - I just found it interesting that it did not uppercase the accented letters.

Just tested it and the lc_collate seems to make the difference.

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 26.07.2010 23:58:26 von Scott Marlowe

On Mon, Jul 26, 2010 at 3:51 PM, Scott Marlowe wr=
ote:
> On Mon, Jul 26, 2010 at 3:47 PM, Benjamin Krajmalnik =
wrote:
>> Unfortunately, the database has to accept data in multiple languages, si=
nce it is a SaaS offering.
>
> The encoding determines that, not the collation. =A0UTF-8 allows you to
> insert various languages in that encoding.
>
>> It is not a big deal - I just found it interesting that it did not upper=
case the accented letters.
>
> Just tested it and the lc_collate seems to make the difference.

To be more specific, when my lc_collate is en_US, it works properly.
I didn't have to use a spanish collation to make it work. Note that
changing collation will change sort order, and some matching rules and
things like that. Also, a db is usually noticeably faster working
with text in locale of C, because it then treats the data mostly as
though it's in byte order.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 27.07.2010 04:09:36 von alvherre

Excerpts from Benjamin Krajmalnik's message of lun jul 26 17:03:54 -0400 =
2010:
> I just used the upper(text) function on a database which is utf8 encode=
d
> and which has spanish text.
>=20
> All of the regular characters were properly converted, except for
> characters which had accents.

FWIW it works fine for me:

alvherre=3D# show lc_collate ;
lc_collate=20
------------
es_CL.utf8
(1 fila)

alvherre=3D# select upper('benjamín');
upper =20
----------
BENJAMÍN
(1 fila)



I suspect that the problem is an incorrect client_encoding setting.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 27.07.2010 05:12:08 von Scott Marlowe

On Mon, Jul 26, 2010 at 8:09 PM, Alvaro Herrera
wrote:
> Excerpts from Benjamin Krajmalnik's message of lun jul 26 17:03:54 -0400 =
2010:
>> I just used the upper(text) function on a database which is utf8 encoded
>> and which has spanish text.
>>
>> All of the regular characters were properly converted, except for
>> characters which had accents.
>
> FWIW it works fine for me:
>
> alvherre=3D# show lc_collate ;
> =A0lc_collate
> ------------
> =A0es_CL.utf8
> (1 fila)
>
> alvherre=3D# select upper('benjam=EDn');
> =A0upper
> ----------
> =A0BENJAM=CDN
> (1 fila)
>
> I suspect that the problem is an incorrect client_encoding setting.

Yeah, OP had set lc_collate to C under the mistaken impression that
collation controlled the character sets you could insert into the
database. If you create a db with lc_collate=3D'C' then the upper only
works on basic ascii characters near as I can tell.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 27.07.2010 05:36:33 von alvherre

Excerpts from Scott Marlowe's message of lun jul 26 23:12:08 -0400 2010:
> On Mon, Jul 26, 2010 at 8:09 PM, Alvaro Herrera
> wrote:

> > I suspect that the problem is an incorrect client_encoding setting.
>=20
> Yeah, OP had set lc_collate to C under the mistaken impression that
> collation controlled the character sets you could insert into the
> database. If you create a db with lc_collate=3D'C' then the upper only
> works on basic ascii characters near as I can tell.

Makes sense. The code seems to say that it's lc_ctype that's important
though, see str_toupper in formatting.c. So I think you could still set
collation to C and use a language-specific lc_ctype.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: upper and UTF-8

am 27.07.2010 16:54:15 von Michael Gould

--b1_7b16f36ed403d2ac22ac1c93073c41ed
Content-Type: text/plain; charset = "iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Benjamin,


We're using the contrib module citext for all text columns so that we can
do case insensitive searches and so far we haven't found any that it
doesn't find.


Best Regards


Mike Gould


=A0


"Benjamin Krajmalnik" wrote:



>
>
>I just used the upper(text) function on a database which is utf8 encoded
>and which has spanish text.
>
>
>All of the regular characters were properly converted, except for
>characters which had accents.
>
>
>=A0
>
>
>

=A0




--b1_7b16f36ed403d2ac22ac1c93073c41ed
Content-Type: text/html; charset = "iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Benjamin,


We're using the contrib module citext for all text columns so that we =
can do case insensitive searches and so far we haven't found any that it =
doesn't find.


Best Regards


Mike Gould


 


"Benjamin Krajmalnik" <kraj@servoyant.com> wrote:




I just used the upper(text) function on a database which is utf8 =
encoded and which has spanish text.


All of the regular characters were properly converted, except for =
characters which had accents.


 




 





--b1_7b16f36ed403d2ac22ac1c93073c41ed--