Fun with UTF-8 and Oracle

am 23.08.2007 16:20:16 von hjp

--VUDLurXRWRKrGuMn
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

While investigating a performance problem I found a curious behaviour.

I have an oracle database (10.2.0.3) and matching client, DBI 1.50 and
DBD::Oracle 1.19. Database and client are set to use the AL32UTF8
charset.=20

Now I do a simple query on one of the tables:

select id, name from datasets ds where name =3D ?

The table has about 300k rows and an index on the name column.=20

The name I'm searching for contains only ASCII characters, but depending
on the source, the string may have the utf8 flag set.=20

So my testcase currently looks like this:

1 #!/usr/bin/perl
=20
2 use warnings;
3 use strict;
4 use utf8;
=20
5 use DBI;
6 use Time::HiRes qw(time);
=20
=20
7 my $dbh =3D db_connect();
=20
8 datasets_by_name($dbh, 'CN-03021200-import-1000 kg-CMX-rs4');
=20
9 datasets_by_name($dbh, 'CN-03021200-import-1000 ECU-CMX-rs4');
=20
10 my $s =3D'CN-03021200-import-1000 ECU-CMX-rs5';
11 utf8::upgrade $s;
12 datasets_by_name($dbh, $s);
=20
13 datasets_by_name($dbh, 'CN-03021200-import-1000 ECU-CMX-rs6');
=20
14 datasets_by_name($dbh, 'CN-03021200-import-1000 kg-CMX-rs6');
=20
=20
15 sub read_cred {
16 my ($fn) =3D @_;
=20
17 open(FN, "<$fn") or die "cannot open $fn: $!";
18 my $line =3D ;=20
19 close(FN);
20 my @cred =3D split(/[\s\n]+/, $line);=20
21 return @cred;
22 }
=20
23 sub db_connect {
24 my $dbi_credential_file =3D $ENV{WZRP_CONN};
25 my ($data_source, $username, $auth) =3D read_cred($dbi_credenti=
al_file);
26 $dbh =3D DBI->connect($data_source, $username, $auth,
27 { AutoCommit =3D> 0,
28 PrintError =3D> 0,
29 RaiseError =3D> 1
30 }
31 );
32 $dbh->{FetchHashKeyName} =3D 'NAME_lc';
33 return $dbh;
34 }
=20
35 my $sth;
=20
36 sub datasets_by_name {
37 my ($dbh, $name) =3D @_;
38 my $t0 =3D time;
39 unless ($ENV{PREPARE_ONCE} && $sth) {
40 $sth =3D $dbh->prepare("select id, name from datasets ds wh=
ere name =3D ?");
41 }
42 my $r =3D $dbh->selectall_hashref($sth, 'id', {}, $name);
43 my $t1 =3D time;
44 print $t1-$t0, "\n";
45 }

The function datasets_by_name invokes the query above. If it prepares
the query on each request, the timings look like this:

0.00154304504394531
0.00111699104309082
0.27887487411499
0.00116086006164551
0.00110697746276855

I.e., searching for a byte string takes about 1 millisecond and
searching for a utf8 string takes about 280 milliseconds! Oracle
enterprise manager tells me that in the slow case the index is ignored
and a full table scan is performed.=20

But wait, it gets even better. If the statement handle for the query is
cached and reused, the timings look like this:

0.00197005271911621
0.000653982162475586
0.280965089797974
0.2483971118927
0.250221967697144

I.e., the information that the index can't be used (which I don't
understand in the first place) seems to be "sticky".

hp

--=20
_ | Peter J. Holzer | If I wanted to be "academically correct",
|_|_) | Sysadmin WSR | I'd be programming in Java.
| | | hjp@wsr.ac.at | I don't, and I'm not.
__/ | http://www.hjp.at/ | -- Jesse Erlbaum on dbi-users

--VUDLurXRWRKrGuMn
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGzZegMdFfQa64PCwRAnrqAJ0f8yPYokz0+b0lJrz7M3cd/4wARwCe OSSR
1bPv7fyXeaAA0GTPriDrmlQ=
=daMG
-----END PGP SIGNATURE-----

--VUDLurXRWRKrGuMn--

Re: Fun with UTF-8 and Oracle

am 24.08.2007 09:26:30 von rvtol+news

I have no idea about why the query with the utf8-upgraded name-value
takes so much longer.

In your code there is:

my @cred = split(/[\s\n]+/, $line);

which I think is better written as

my @cred = split ' ', $line;

--
Affijn, Ruud

"Gewoon is een tijger."
""Peter J. Holzer"" wrote in message
news:20070823142016.GE32586@wsr.ac.at...

Re: Fun with UTF-8 and Oracle

am 03.09.2007 14:03:37 von Tim.Bunce

Oh the joys of Oracle character set support.

See the dbd_rebind_ph() code in dbdimp.c. Especially from the line
"csform = phs->csform;" onwards.

You can explicitly set the csform and csid using bind_param(..., { ... })

Have fun!

But do please write up what you find as a patch the the Oracle docs.
It's entirely possible that the code is doing the wrong thing.

Tim.

On Thu, Aug 23, 2007 at 04:20:16PM +0200, Peter J. Holzer wrote:
> While investigating a performance problem I found a curious behaviour.
>
> I have an oracle database (10.2.0.3) and matching client, DBI 1.50 and
> DBD::Oracle 1.19. Database and client are set to use the AL32UTF8 charset.
>
> Now I do a simple query on one of the tables:
> select id, name from datasets ds where name = ?
> The table has about 300k rows and an index on the name column.
>
> The name I'm searching for contains only ASCII characters, but depending
> on the source, the string may have the utf8 flag set.

> 36 sub datasets_by_name {
> 37 my ($dbh, $name) = @_;
> 38 my $t0 = time;
> 39 unless ($ENV{PREPARE_ONCE} && $sth) {
> 40 $sth = $dbh->prepare("select id, name from datasets ds where name = ?");
> 41 }
> 42 my $r = $dbh->selectall_hashref($sth, 'id', {}, $name);
> 43 my $t1 = time;
> 44 print $t1-$t0, "\n";
> 45 }
>
> The function datasets_by_name invokes the query above. If it prepares
> the query on each request, the timings look like this:
>
> 0.00154304504394531
> 0.00111699104309082
> 0.27887487411499
> 0.00116086006164551
> 0.00110697746276855
>
> I.e., searching for a byte string takes about 1 millisecond and
> searching for a utf8 string takes about 280 milliseconds! Oracle
> enterprise manager tells me that in the slow case the index is ignored
> and a full table scan is performed.
>
> But wait, it gets even better. If the statement handle for the query is
> cached and reused, the timings look like this:
>
> 0.00197005271911621
> 0.000653982162475586
> 0.280965089797974
> 0.2483971118927
> 0.250221967697144
>
> I.e., the information that the index can't be used (which I don't
> understand in the first place) seems to be "sticky".

Re: Fun with UTF-8 and Oracle

am 03.09.2007 19:20:42 von hjp

--CNK/L7dwKXQ4Ub8J
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2007-09-03 13:03:37 +0100, Tim Bunce wrote:
> Oh the joys of Oracle character set support.
>=20
> See the dbd_rebind_ph() code in dbdimp.c. Especially from the line
> "csform =3D phs->csform;" onwards.
>=20
> You can explicitly set the csform and csid using bind_param(..., { ... })
>=20
> Have fun!

I had :-). You gave me a push into the right direction, and I think I
found the problem:

1375 csform =3D phs->csform;
1376=20
1377 if (!csform && SvUTF8(phs->sv)) {
1378 /* try to default csform to avoid translation through non-u=
nicode */
1379 if (CSFORM_IMPLIES_UTF8(SQLCS_NCHAR)) /* prefer N=
CHAR */
1380 csform =3D SQLCS_NCHAR;
1381 else if (CSFORM_IMPLIES_UTF8(SQLCS_IMPLICIT))
1382 csform =3D SQLCS_IMPLICIT;
1383 /* else leave csform == 0 */
1384 if (trace_level)
1385 PerlIO_printf(DBILOGFP, " rebinding %s with UTF8 =
value %s", phs->name,
1386 (csform == SQLCS_NCHAR) ? "so setting csform=
=3DSQLCS_IMPLICIT" :
1387 (csform == SQLCS_IMPLICIT) ? "so setting csform=
=3DSQLCS_NCHAR" :
1388 "but neither CHAR nor NCHAR are unicode\n");
1389 }

The first thing to notice is that the trace message is backwards: It
prints setting csform=3DSQLCS_IMPLICIT if the csform is set to SQLCS_NCHAR
and vice versa. So that explains why I didn't notice anything strange in
the trace output. While it claimed to set csform=3DSQLCS_IMPLICIT, it was
actually setting it to SQLCS_NCHAR, and I guess that was what prevented
the use of the index on the varchar2 column. Sure enough, if I
explicitely set ora_csform to 1, it's fast. It is also fast, if I set
NLS_NCHAR to US7ASCII, so that it isn't used.=20

Maybe the tests in line 1379 and 1381 should be reversed so that
SQLCS_IMPLICIT is preferred over SQLCS_NCHAR? I guess there was some
good reason to prefer SQLCS_NCHAR, but it seems to me that SQLCS_IMPLICIT is
usually what you would want (but then I'm probably biased from my
absolute failure to use nvarchar or nclob columns with Oracle 8.0 from
Java or Perl many moons ago).

> But do please write up what you find as a patch the the Oracle docs.
> It's entirely possible that the code is doing the wrong thing.

Like this?

Index: Oracle.pm
==================== =====3D=
==================== =====3D=
=================3D
--- Oracle.pm (revision 9909)
+++ Oracle.pm (working copy)
@@ -1462,7 +1462,14 @@
=20
Specify the OCI_ATTR_CHARSET_FORM for the bind value. Valid values
are SQLCS_IMPLICIT (1) and SQLCS_NCHAR (2). Both those constants can
-be imported from the DBD::Oracle module. Rarely needed.
+be imported from the DBD::Oracle module.
+This should be rarely needed, but at least some versions of Oracle
+(e.g., 10.2) don't use an index if this is set wrong, which can happen
+if both the character set and the national character set are UTF-8 and
+the placeholder value has the utf8 flag set. In this case you can either
+choose ora_csform to match your column or (if you don't have any NCHAR,
+NVARCHAR2 or NCLOB columns) you can set NLS_NCHAR to US7ASCII to prevent
+its use.
=20
=3Ditem ora_csid
=20

hp

--=20
_ | Peter J. Holzer | If I wanted to be "academically correct",
|_|_) | Sysadmin WSR | I'd be programming in Java.
| | | hjp@wsr.ac.at | I don't, and I'm not.
__/ | http://www.hjp.at/ | -- Jesse Erlbaum on dbi-users

--CNK/L7dwKXQ4Ub8J
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG3EJqMdFfQa64PCwRAnhMAJ94/EZ8ZYAoGg3KAmVZ1qofQjhDQACf T9Em
m4tD+HO1xNjfR5yW6N8wztI=
=hAeY
-----END PGP SIGNATURE-----

--CNK/L7dwKXQ4Ub8J--

Re: Fun with UTF-8 and Oracle

am 04.09.2007 15:20:20 von Tim.Bunce

On Mon, Sep 03, 2007 at 07:20:42PM +0200, Peter J. Holzer wrote:
> On 2007-09-03 13:03:37 +0100, Tim Bunce wrote:
> > Oh the joys of Oracle character set support.
> >
> > See the dbd_rebind_ph() code in dbdimp.c. Especially from the line
> > "csform = phs->csform;" onwards.
> >
> > You can explicitly set the csform and csid using bind_param(..., { ... })
> >
> > Have fun!
>
> I had :-). You gave me a push into the right direction, and I think I
> found the problem:
>
> 1375 csform = phs->csform;
> 1376
> 1377 if (!csform && SvUTF8(phs->sv)) {
> 1378 /* try to default csform to avoid translation through non-unicode */
> 1379 if (CSFORM_IMPLIES_UTF8(SQLCS_NCHAR)) /* prefer NCHAR */
> 1380 csform = SQLCS_NCHAR;
> 1381 else if (CSFORM_IMPLIES_UTF8(SQLCS_IMPLICIT))
> 1382 csform = SQLCS_IMPLICIT;
> 1383 /* else leave csform == 0 */
> 1384 if (trace_level)
> 1385 PerlIO_printf(DBILOGFP, " rebinding %s with UTF8 value %s", phs->name,
> 1386 (csform == SQLCS_NCHAR) ? "so setting csform=SQLCS_IMPLICIT" :
> 1387 (csform == SQLCS_IMPLICIT) ? "so setting csform=SQLCS_NCHAR" :
> 1388 "but neither CHAR nor NCHAR are unicode\n");
> 1389 }
>
> The first thing to notice is that the trace message is backwards: It
> prints setting csform=SQLCS_IMPLICIT if the csform is set to SQLCS_NCHAR
> and vice versa. So that explains why I didn't notice anything strange in
> the trace output. While it claimed to set csform=SQLCS_IMPLICIT, it was
> actually setting it to SQLCS_NCHAR, and I guess that was what prevented
> the use of the index on the varchar2 column. Sure enough, if I
> explicitely set ora_csform to 1, it's fast. It is also fast, if I set
> NLS_NCHAR to US7ASCII, so that it isn't used.
>
> Maybe the tests in line 1379 and 1381 should be reversed so that
> SQLCS_IMPLICIT is preferred over SQLCS_NCHAR?

Sounds very plausible.

> I guess there was some good reason to prefer SQLCS_NCHAR, but it seems
> to me that SQLCS_IMPLICIT is usually what you would want

It was probably just the way the code evolved.

> (but then I'm probably biased from my absolute failure to use nvarchar
> or nclob columns with Oracle 8.0 from Java or Perl many moons ago).
>
> > But do please write up what you find as a patch the the Oracle docs.
> > It's entirely possible that the code is doing the wrong thing.
>
> Like this?

That's great, but why stop there? Now you seem to have identified the
problem, how about patching the code instead ;-)

Tim.

> Index: Oracle.pm
> ============================================================ =======
> --- Oracle.pm (revision 9909)
> +++ Oracle.pm (working copy)
> @@ -1462,7 +1462,14 @@
>
> Specify the OCI_ATTR_CHARSET_FORM for the bind value. Valid values
> are SQLCS_IMPLICIT (1) and SQLCS_NCHAR (2). Both those constants can
> -be imported from the DBD::Oracle module. Rarely needed.
> +be imported from the DBD::Oracle module.
> +This should be rarely needed, but at least some versions of Oracle
> +(e.g., 10.2) don't use an index if this is set wrong, which can happen
> +if both the character set and the national character set are UTF-8 and
> +the placeholder value has the utf8 flag set. In this case you can either
> +choose ora_csform to match your column or (if you don't have any NCHAR,
> +NVARCHAR2 or NCLOB columns) you can set NLS_NCHAR to US7ASCII to prevent
> +its use.
>
> =item ora_csid
>
>
> hp
>
> --
> _ | Peter J. Holzer | If I wanted to be "academically correct",
> |_|_) | Sysadmin WSR | I'd be programming in Java.
> | | | hjp@wsr.ac.at | I don't, and I'm not.
> __/ | http://www.hjp.at/ | -- Jesse Erlbaum on dbi-users

Re: Fun with UTF-8 and Oracle

am 12.11.2007 16:35:55 von mark.bracher

Peter J. Holzer wrote:
> On 2007-09-03 13:03:37 +0100, Tim Bunce wrote:
>> Oh the joys of Oracle character set support.
>>
>> See the dbd_rebind_ph() code in dbdimp.c. Especially from the line
>> "csform = phs->csform;" onwards.
>>
>> You can explicitly set the csform and csid using bind_param(..., { ... })
>>
>> Have fun!
>
> I had :-). You gave me a push into the right direction, and I think I
> found the problem:
>
> 1375 csform = phs->csform;
> 1376
> 1377 if (!csform && SvUTF8(phs->sv)) {
> 1378 /* try to default csform to avoid translation through non-unicode */
> 1379 if (CSFORM_IMPLIES_UTF8(SQLCS_NCHAR)) /* prefer NCHAR */
> 1380 csform = SQLCS_NCHAR;
> 1381 else if (CSFORM_IMPLIES_UTF8(SQLCS_IMPLICIT))
> 1382 csform = SQLCS_IMPLICIT;
> 1383 /* else leave csform == 0 */
> 1384 if (trace_level)
> 1385 PerlIO_printf(DBILOGFP, " rebinding %s with UTF8 value %s", phs->name,
> 1386 (csform == SQLCS_NCHAR) ? "so setting csform=SQLCS_IMPLICIT" :
> 1387 (csform == SQLCS_IMPLICIT) ? "so setting csform=SQLCS_NCHAR" :
> 1388 "but neither CHAR nor NCHAR are unicode\n");
> 1389 }
>
> The first thing to notice is that the trace message is backwards: It
> prints setting csform=SQLCS_IMPLICIT if the csform is set to SQLCS_NCHAR
> and vice versa. So that explains why I didn't notice anything strange in
> the trace output. While it claimed to set csform=SQLCS_IMPLICIT, it was
> actually setting it to SQLCS_NCHAR, and I guess that was what prevented
> the use of the index on the varchar2 column. Sure enough, if I
> explicitely set ora_csform to 1, it's fast. It is also fast, if I set
> NLS_NCHAR to US7ASCII, so that it isn't used.
>
> Maybe the tests in line 1379 and 1381 should be reversed so that
> SQLCS_IMPLICIT is preferred over SQLCS_NCHAR? I guess there was some
> good reason to prefer SQLCS_NCHAR, but it seems to me that SQLCS_IMPLICIT is
> usually what you would want (but then I'm probably biased from my
> absolute failure to use nvarchar or nclob columns with Oracle 8.0 from
> Java or Perl many moons ago).

stumbled on this thread while debugging a problem with full table scans
binding utf8 strings into oracle queries...

it seems the degraded performance is "expected"

http://download-uk.oracle.com/docs/cd/B10501_01/server.920/a 96529/ch6.htm#1008756

when you used csform SQLCS_NCHAR and the database column is CHAR,
VARCHAR2 or CLOB. same is true of the reverse, using SQLCS_IMPLICIT
when the column is NCHAR, NVARCHAR2, NCLOB.

perhaps we need to rework CSFORM_IMPLIES_UTF8 so it doesn't return true
for csform SQLCS_NCHAR if the columns are CHAR, VARCHAR2 or CLOB?

- mark

Re: Fun with UTF-8 and Oracle

am 12.11.2007 20:34:40 von mbracher

mark warren bracher wrote:
> Peter J. Holzer wrote:
>> On 2007-09-03 13:03:37 +0100, Tim Bunce wrote:
>>> Oh the joys of Oracle character set support.
>>>
>>> See the dbd_rebind_ph() code in dbdimp.c. Especially from the line
>>> "csform = phs->csform;" onwards.
>>>
>>> You can explicitly set the csform and csid using bind_param(..., {
>>> ... })
[snip]
> stumbled on this thread while debugging a problem with full table scans
> binding utf8 strings into oracle queries...
>
> it seems the degraded performance is "expected"
>
> http://download-uk.oracle.com/docs/cd/B10501_01/server.920/a 96529/ch6.htm#1008756
>
> when you used csform SQLCS_NCHAR and the database column is CHAR,
> VARCHAR2 or CLOB. same is true of the reverse, using SQLCS_IMPLICIT
> when the column is NCHAR, NVARCHAR2, NCLOB.
>
> perhaps we need to rework CSFORM_IMPLIES_UTF8 so it doesn't return true
> for csform SQLCS_NCHAR if the columns are CHAR, VARCHAR2 or CLOB?

As I think about this some more, I realize that this sort of
introspection into the table structure may not be possible at the point
we're binding values. Merely reversing the preference won't work, as
SQLCS_IMPLICIT when using N* types is equally troublesome or worse.
Beyond performance concerns, the grid in the OCI docs mention data loss
for that case...

Perhaps it would work to have a ora_default_csform_behavior (or similar)
that could be set on the dbh and checked in dbd_rebind_ph(), so that
users of CHAR, VARCHAR2 and CLOB would not need to bind every utf8 value
explicitly. That strikes me as a bit clunky, but not as clunky as
recoding every bind_param() call to pass an explicit ora_csform value...

- mark

Re: Fun with UTF-8 and Oracle [PATCH]

am 05.12.2007 11:47:03 von hjp

--vEao7xgI/oilGqZ+
Content-Type: multipart/mixed; boundary="3lcZGd9BuhuYXNfi"
Content-Disposition: inline

--3lcZGd9BuhuYXNfi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2007-09-04 14:20:20 +0100, Tim Bunce wrote:
> On Mon, Sep 03, 2007 at 07:20:42PM +0200, Peter J. Holzer wrote:
> > On 2007-09-03 13:03:37 +0100, Tim Bunce wrote:
> > > Oh the joys of Oracle character set support.
> > >=20
> > > See the dbd_rebind_ph() code in dbdimp.c. Especially from the line
> > > "csform =3D phs->csform;" onwards.
> > >=20
> > > You can explicitly set the csform and csid using bind_param(..., { ..=
})
> > >=20
> > > Have fun!
> >=20
> > I had :-). You gave me a push into the right direction, and I think I
> > found the problem:

[code deleted]

> > While it claimed to set csform=3DSQLCS_IMPLICIT, it was actually
> > setting it to SQLCS_NCHAR, and I guess that was what prevented the
> > use of the index on the varchar2 column. Sure enough, if I
> > explicitely set ora_csform to 1, it's fast. It is also fast, if I
> > set NLS_NCHAR to US7ASCII, so that it isn't used.=20
> >=20
> > Maybe the tests in line 1379 and 1381 should be reversed so that
> > SQLCS_IMPLICIT is preferred over SQLCS_NCHAR?
>=20
> Sounds very plausible.
[...]
> That's great, but why stop there? Now you seem to have identified the
> problem, how about patching the code instead ;-)

Sorry for the long delay. I wanted to test the patch properly first and
then forgot to post it. Now I stumbled upon the same problem on a
different server which reminded me that I never posted the patch. So
here it is.

(I couldn't think of a proper test case: Building a table with 4 billion
rows and then checking whether the query takes a few seconds or an hour
doesn't seem a good idea :-). Anyway, I've used it in production for
almost 3 months.)

hp

--=20
_ | Peter J. Holzer | If I wanted to be "academically correct",
|_|_) | Sysadmin WSR | I'd be programming in Java.
| | | hjp@wsr.ac.at | I don't, and I'm not.
__/ | http://www.hjp.at/ | -- Jesse Erlbaum on dbi-users

--3lcZGd9BuhuYXNfi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="dbdimp-utf8.patch"
Content-Transfer-Encoding: quoted-printable

--- dbdimp.c.orig 2006-11-03 15:05:46.000000000 +0100
+++ dbdimp.c 2007-09-24 10:36:25.000000000 +0200
@@ -1376,15 +1376,15 @@
=20
if (!csform && SvUTF8(phs->sv)) {
/* try to default csform to avoid translation through non-unicode */
- if (CSFORM_IMPLIES_UTF8(SQLCS_NCHAR)) /* prefer NCHAR */
- csform =3D SQLCS_NCHAR;
- else if (CSFORM_IMPLIES_UTF8(SQLCS_IMPLICIT))
+ if (CSFORM_IMPLIES_UTF8(SQLCS_IMPLICIT)) /* prefer IMPLICIT */
csform =3D SQLCS_IMPLICIT;
+ else if (CSFORM_IMPLIES_UTF8(SQLCS_NCHAR))
+ csform =3D SQLCS_NCHAR;
/* else leave csform == 0 */
if (trace_level)
PerlIO_printf(DBILOGFP, " rebinding %s with UTF8 value %s", phs=
->name,
- (csform == SQLCS_NCHAR) ? "so setting csform=3DSQLCS_IMPLICIT" :
- (csform == SQLCS_IMPLICIT) ? "so setting csform=3DSQLCS_NCHAR" :
+ (csform == SQLCS_IMPLICIT) ? "so setting csform=3DSQLCS_IMPLICIT" :
+ (csform == SQLCS_NCHAR) ? "so setting csform=3DSQLCS_NCHAR" :
"but neither CHAR nor NCHAR are unicode\n");
}
=20

--3lcZGd9BuhuYXNfi--

--vEao7xgI/oilGqZ+
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHVoGnMdFfQa64PCwRAnM9AKCQ+Zq49yyjydMkJ1M8F+UY9AOpBwCe IALg
s57QVedMz/oN3xT1v5S5fyg=
=Itb7
-----END PGP SIGNATURE-----

--vEao7xgI/oilGqZ+--

Re: Fun with UTF-8 and Oracle [PATCH]

am 05.12.2007 13:24:19 von scoles

Thanks for that I will include that in the release and in the next RC

Peter J. Holzer wrote:
> On 2007-09-04 14:20:20 +0100, Tim Bunce wrote:
>
>> On Mon, Sep 03, 2007 at 07:20:42PM +0200, Peter J. Holzer wrote:
>>
>>> On 2007-09-03 13:03:37 +0100, Tim Bunce wrote:
>>>
>>>> Oh the joys of Oracle character set support.
>>>>
>>>> See the dbd_rebind_ph() code in dbdimp.c. Especially from the line
>>>> "csform = phs->csform;" onwards.
>>>>
>>>> You can explicitly set the csform and csid using bind_param(..., { ... })
>>>>
>>>> Have fun!
>>>>
>>> I had :-). You gave me a push into the right direction, and I think I
>>> found the problem:
>>>
>
> [code deleted]
>
>
>>> While it claimed to set csform=SQLCS_IMPLICIT, it was actually
>>> setting it to SQLCS_NCHAR, and I guess that was what prevented the
>>> use of the index on the varchar2 column. Sure enough, if I
>>> explicitely set ora_csform to 1, it's fast. It is also fast, if I
>>> set NLS_NCHAR to US7ASCII, so that it isn't used.
>>>
>>> Maybe the tests in line 1379 and 1381 should be reversed so that
>>> SQLCS_IMPLICIT is preferred over SQLCS_NCHAR?
>>>
>> Sounds very plausible.
>>
> [...]
>
>> That's great, but why stop there? Now you seem to have identified the
>> problem, how about patching the code instead ;-)
>>
>
> Sorry for the long delay. I wanted to test the patch properly first and
> then forgot to post it. Now I stumbled upon the same problem on a
> different server which reminded me that I never posted the patch. So
> here it is.
>
> (I couldn't think of a proper test case: Building a table with 4 billion
> rows and then checking whether the query takes a few seconds or an hour
> doesn't seem a good idea :-). Anyway, I've used it in production for
> almost 3 months.)
>
> hp
>
>
> ------------------------------------------------------------ ------------
>
> --- dbdimp.c.orig 2006-11-03 15:05:46.000000000 +0100
> +++ dbdimp.c 2007-09-24 10:36:25.000000000 +0200
> @@ -1376,15 +1376,15 @@
>
> if (!csform && SvUTF8(phs->sv)) {
> /* try to default csform to avoid translation through non-unicode */
> - if (CSFORM_IMPLIES_UTF8(SQLCS_NCHAR)) /* prefer NCHAR */
> - csform = SQLCS_NCHAR;
> - else if (CSFORM_IMPLIES_UTF8(SQLCS_IMPLICIT))
> + if (CSFORM_IMPLIES_UTF8(SQLCS_IMPLICIT)) /* prefer IMPLICIT */
> csform = SQLCS_IMPLICIT;
> + else if (CSFORM_IMPLIES_UTF8(SQLCS_NCHAR))
> + csform = SQLCS_NCHAR;
> /* else leave csform == 0 */
> if (trace_level)
> PerlIO_printf(DBILOGFP, " rebinding %s with UTF8 value %s", phs->name,
> - (csform == SQLCS_NCHAR) ? "so setting csform=SQLCS_IMPLICIT" :
> - (csform == SQLCS_IMPLICIT) ? "so setting csform=SQLCS_NCHAR" :
> + (csform == SQLCS_IMPLICIT) ? "so setting csform=SQLCS_IMPLICIT" :
> + (csform == SQLCS_NCHAR) ? "so setting csform=SQLCS_NCHAR" :
> "but neither CHAR nor NCHAR are unicode\n");
> }
>
>