Seltsame Query-Vorschau/Ausführung

am 22.11.2010 21:34:49 von volker

Hallo,
ich habe bei einer eigentlich einfachen Abfrage eine äußerst seltsame
Erscheinung:

Ich habe zwei Tabellen:
eine Adress-Tabelle (ca. 125.000 Sätze)
> CREATE TABLE as400.ass_adresse
> (
> adressnr_pk int4 NOT NULL,
> anrede_fk int2,
> titel varchar(20),
> name varchar(61),
> strasse varchar(50),
> lkz varchar(3),
> plz varchar(6),
> ort varchar(50),
> -- und eine Reihe weiterer normaler Felder
> );
> ALTER TABLE as400.ass_adresse
> ADD CONSTRAINT ass_adresse_pkey PRIMARY KEY (adressnr_pk);
>
> CREATE INDEX ass_adresse_plz ON ass_adresse USING btree (plz);
und eine Tabelle mit Suchworten (ca. 650.000 Sätze)
> CREATE TABLE as400.adrwords
> (
> type_pk int2 NOT NULL,
> adressnr_pk int4 NOT NULL,
> word_pk varchar(50) NOT NULL
> );
> ALTER TABLE as400.adrwords
> ADD CONSTRAINT adrwords_pkey PRIMARY KEY (type_pk, adressnr_pk,
> word_pk);
>
> CREATE INDEX ass_adrwords_word ON adrwords USING btree (type_pk, word_p=
k);
in der einzelne (normalisierte) Worte aus den Felder Name, Straße und
Ort der Adressen der ersten Tabelle stehen.

Meine Abfrage lautet jetzt
> select a.adressnr_pk,
> COALESCE(a.name, '') AS name,
> COALESCE(a.strasse, '') AS strasse,
> COALESCE(a.plz, '') AS plz,
> COALESCE(a.ort, '') AS ort
> from as400.ass_adresse a
> where a.plz =3D '47589'
> OR a.adressnr_pk in (
> select adressnr_pk
> from as400.adrwords
> where type_pk =3D 1 AND word_pk IN ('goertz', 'haftungsbeschraenkt'=
,
> 'pflegeteam')
> OR type_pk =3D 2 AND word_pk IN ('kervenheimer')
> OR type_pk =3D 3 AND word_pk IN ('uedem'))
> ORDER BY adressnr_pk;
d.h. ich habe zwei mit OR verknüpfte Teilbedingungen, die darauf
hinauslaufen, dass alle Sätze eingeschlossen werden, bei denen entweder
die PLZ stimmt oder in Name, Straße oder Ort ein übereinstimmender
Suchbegriff gefunden wird.

Obwohl nur gut 40 Sätze als Ergebnis herauskommen und alles Relevante
indiziert ist, dauert die Query recht lange (mehrere Sekunden). Ein
EXPLAIN ANALYZE zeigt auch warum:
> Sort (cost=3D37024.87..37181.79 rows=3D62766 width=3D60) (actual
> time=3D957.211..957.308 rows=3D46 loops=3D1)
> Sort Key: a.adressnr_pk
> Sort Method: quicksort Memory: 22kB
> -> Seq Scan on ass_adresse a (cost=3D790.64..29769.14 rows=3D62766
> width=3D60) (actual time=3D47.745..956.781 rows=3D46 loops=3D1)
> Filter: (((plz)::text =3D '47589'::text) OR (hashed subplan))
> SubPlan
> -> Bitmap Heap Scan on adrwords (cost=3D55.40..790.11
> rows=3D212 width=3D4) (actual time=3D0.801..2.533 rows=3D46 loops=3D1)
> Recheck Cond: (((type_pk =3D 1) AND ((word_pk)::text =3D
> ANY ('{goertz,haftungsbeschraenkt,pflegeteam}'::text[]))) OR ((type_pk
> =3D 2) AND ((word_pk)::text =3D 'kervenheimer'::text)) OR ((type_pk =3D=
3)
> AND ((word_pk)::text =3D 'uedem'::text)))
> -> BitmapOr (cost=3D55.40..55.40 rows=3D212 width=3D0=
)
> (actual time=3D0.745..0.745 rows=3D0 loops=3D1)
> -> Bitmap Index Scan on ass_adrwords_word=20
> (cost=3D0.00..39.75 rows=3D170 width=3D0) (actual time=3D0.505..0.505 r=
ows=3D38
> loops=3D1)
> Index Cond: ((type_pk =3D 1) AND
> ((word_pk)::text =3D ANY
> ('{goertz,haftungsbeschraenkt,pflegeteam}'::text[])))
> -> Bitmap Index Scan on ass_adrwords_word=20
> (cost=3D0.00..9.78 rows=3D25 width=3D0) (actual time=3D0.105..0.105 row=
s=3D0
> loops=3D1)
> Index Cond: ((type_pk =3D 2) AND
> ((word_pk)::text =3D 'kervenheimer'::text))
> -> Bitmap Index Scan on ass_adrwords_word=20
> (cost=3D0.00..5.71 rows=3D18 width=3D0) (actual time=3D0.119..0.119 row=
s=3D8
> loops=3D1)
> Index Cond: ((type_pk =3D 3) AND
> ((word_pk)::text =3D 'uedem'::text))
> Total runtime: 957.668 ms

Der Analyzer ist der irrigen Meinung, es kämen mehr als 60.000 Sätze =
aus
der Query heraus; damit ist natürlich ein sequenzieller Scan schneller
als ein indizierter Zugriff.

Komisch ist nur, dass die beiden Teile des ORs getrennt
nur PLZ:
> select a.adressnr_pk,
> COALESCE(a.name, '') AS name,
> COALESCE(a.strasse, '') AS strasse,
> COALESCE(a.plz, '') AS plz,
> COALESCE(a.ort, '') AS ort, r.rechtsform_txt
> from as400.ass_adresse a
> where a.plz =3D '47589'
> ORDER BY adressnr_pk;
EXPLAIN ANALYZE:
> Sort (cost=3D134.62..134.70 rows=3D32 width=3D60) (actual
> time=3D97.428..97.444 rows=3D8 loops=3D1)
> Sort Key: adressnr_pk
> Sort Method: quicksort Memory: 17kB
> -> Bitmap Heap Scan on ass_adresse a (cost=3D8.72..133.82 rows=3D32
> width=3D60) (actual time=3D97.177..97.380 rows=3D8 loops=3D1)
> Recheck Cond: ((plz)::text =3D '47589'::text)
> -> Bitmap Index Scan on ass_adresse_plz (cost=3D0.00..8.71
> rows=3D32 width=3D0) (actual time=3D97.068..97.068 rows=3D8 loops=3D1)
> Index Cond: ((plz)::text =3D '47589'::text)
> Total runtime: 97.531 ms
und nur restliche Daten:
>> select a.adressnr_pk,
>> COALESCE(a.name, '') AS name,
>> COALESCE(a.strasse, '') AS strasse,
>> COALESCE(a.plz, '') AS plz,
>> COALESCE(a.ort, '') AS ort
>> from as400.ass_adresse a
>> where a.adressnr_pk in (
>> select adressnr_pk
>> from as400.adrwords
>> where type_pk =3D 1 AND word_pk IN ('goertz',
>> 'haftungsbeschraenkt', 'pflegeteam')
>> OR type_pk =3D 2 AND word_pk IN ('kervenheimer')
>> OR type_pk =3D 3 AND word_pk IN ('uedem'))
>> ORDER BY adressnr_pk;
EXPLAIN ANALYZE:
> Sort (cost=3D940.64..940.69 rows=3D18 width=3D60) (actual
> time=3D466.769..466.864 rows=3D46 loops=3D1)
> Sort Key: a.adressnr_pk
> Sort Method: quicksort Memory: 22kB
> -> Nested Loop (cost=3D790.64..940.26 rows=3D18 width=3D60) (actual
> time=3D47.404..466.351 rows=3D46 loops=3D1)
> -> HashAggregate (cost=3D790.64..790.82 rows=3D18 width=3D4)
> (actual time=3D3.447..3.635 rows=3D46 loops=3D1)
> -> Bitmap Heap Scan on adrwords (cost=3D55.40..790.11
> rows=3D212 width=3D4) (actual time=3D1.231..3.260 rows=3D46 loops=3D1)
> Recheck Cond: (((type_pk =3D 1) AND ((word_pk)::tex=
t
> =3D ANY ('{goertz,haftungsbeschraenkt,pflegeteam}'::text[]))) OR
> ((type_pk =3D 2) AND ((word_pk)::text =3D 'kervenheimer'::text)) OR
> ((type_pk =3D 3) AND ((word_pk)::text =3D 'uedem'::text)))
> -> BitmapOr (cost=3D55.40..55.40 rows=3D212 width=
=3D0)
> (actual time=3D1.126..1.126 rows=3D0 loops=3D1)
> -> Bitmap Index Scan on ass_adrwords_word=20
> (cost=3D0.00..39.75 rows=3D170 width=3D0) (actual time=3D0.753..0.753 r=
ows=3D38
> loops=3D1)
> Index Cond: ((type_pk =3D 1) AND
> ((word_pk)::text =3D ANY
> ('{goertz,haftungsbeschraenkt,pflegeteam}'::text[])))
> -> Bitmap Index Scan on ass_adrwords_word=20
> (cost=3D0.00..9.78 rows=3D25 width=3D0) (actual time=3D0.174..0.174 row=
s=3D0
> loops=3D1)
> Index Cond: ((type_pk =3D 2) AND
> ((word_pk)::text =3D 'kervenheimer'::text))
> -> Bitmap Index Scan on ass_adrwords_word=20
> (cost=3D0.00..5.71 rows=3D18 width=3D0) (actual time=3D0.184..0.184 row=
s=3D8
> loops=3D1)
> Index Cond: ((type_pk =3D 3) AND
> ((word_pk)::text =3D 'uedem'::text))
> -> Index Scan using ass_adresse_pkey on ass_adresse a=20
> (cost=3D0.00..8.29 rows=3D1 width=3D60) (actual time=3D9.801..10.040 ro=
ws=3D1
> loops=3D46)
> Index Cond: (a.adressnr_pk =3D adrwords.adressnr_pk)
> Total runtime: 467.172 ms
lediglich 32 Sätze bzw. 18 Sätze prognostizieren. Die Anzahl der
Ergebnissätze eines OR kann aber nicht größer sein, als die Summe d=
er
beiden Teilbedingungen.

Hat der Postgres-Analyzer hier vielleicht eine kleine Macke?

Wenn ich die vollständige Query als
> select a.adressnr_pk,
> COALESCE(a.name, '') AS name,
> COALESCE(a.strasse, '') AS strasse,
> COALESCE(a.plz, '') AS plz,
> COALESCE(a.ort, '') AS ort
> from as400.ass_adresse a
> where a.adressnr_pk in (
> select adressnr_pk
> from as400.adrwords
> where type_pk =3D 1 AND word_pk IN ('goertz', 'haftungsbeschraenkt'=
,
> 'pflegeteam')
> OR type_pk =3D 2 AND word_pk IN ('kervenheimer')
> OR type_pk =3D 3 AND word_pk IN ('uedem')
> union
> select adressnr_pk
> from ass_adresse
> where plz =3D '47589')
> ORDER BY adressnr_pk;
schreibe, ist das Ergebnis
> Nested Loop (cost=3D936.05..2957.09 rows=3D244 width=3D60) (actual
> time=3D2.501..6.672 rows=3D46 loops=3D1)
> -> Unique (cost=3D936.05..937.27 rows=3D244 width=3D4) (actual
> time=3D2.417..2.745 rows=3D46 loops=3D1)
> -> Sort (cost=3D936.05..936.66 rows=3D244 width=3D4) (actual
> time=3D2.412..2.520 rows=3D54 loops=3D1)
> Sort Key: adrwords.adressnr_pk
> Sort Method: quicksort Memory: 18kB
> -> Append (cost=3D55.40..926.37 rows=3D244 width=3D4)
> (actual time=3D0.617..2.247 rows=3D54 loops=3D1)
> -> Bitmap Heap Scan on adrwords=20
> (cost=3D55.40..790.11 rows=3D212 width=3D4) (actual time=3D0.613..1.703
> rows=3D46 loops=3D1)
> Recheck Cond: (((type_pk =3D 1) AND
> ((word_pk)::text =3D ANY
> ('{goertz,haftungsbeschraenkt,pflegeteam}'::text[]))) OR ((type_pk =3D
> 2) AND ((word_pk)::text =3D 'kervenheimer'::text)) OR ((type_pk =3D 3) =
AND
> ((word_pk)::text =3D 'uedem'::text)))
> -> BitmapOr (cost=3D55.40..55.40 rows=3D212
> width=3D0) (actual time=3D0.569..0.569 rows=3D0 loops=3D1)
> -> Bitmap Index Scan on
> ass_adrwords_word (cost=3D0.00..39.75 rows=3D170 width=3D0) (actual
> time=3D0.369..0.369 rows=3D38 loops=3D1)
> Index Cond: ((type_pk =3D 1) AND
> ((word_pk)::text =3D ANY
> ('{goertz,haftungsbeschraenkt,pflegeteam}'::text[])))
> -> Bitmap Index Scan on
> ass_adrwords_word (cost=3D0.00..9.78 rows=3D25 width=3D0) (actual
> time=3D0.114..0.114 rows=3D0 loops=3D1)
> Index Cond: ((type_pk =3D 2) AND
> ((word_pk)::text =3D 'kervenheimer'::text))
> -> Bitmap Index Scan on
> ass_adrwords_word (cost=3D0.00..5.71 rows=3D18 width=3D0) (actual
> time=3D0.075..0.075 rows=3D8 loops=3D1)
> Index Cond: ((type_pk =3D 3) AND
> ((word_pk)::text =3D 'uedem'::text))
> -> Bitmap Heap Scan on ass_adresse=20
> (cost=3D8.72..133.82 rows=3D32 width=3D4) (actual time=3D0.145..0.319 r=
ows=3D8
> loops=3D1)
> Recheck Cond: ((plz)::text =3D '47589'::text)
> -> Bitmap Index Scan on ass_adresse_plz=20
> (cost=3D0.00..8.71 rows=3D32 width=3D0) (actual time=3D0.113..0.113 row=
s=3D8
> loops=3D1)
> Index Cond: ((plz)::text =3D '47589'::t=
ext)
> -> Index Scan using ass_adresse_pkey on ass_adresse a=20
> (cost=3D0.00..8.26 rows=3D1 width=3D60) (actual time=3D0.069..0.073 row=
s=3D1
> loops=3D46)
> Index Cond: (a.adressnr_pk =3D adrwords.adressnr_pk)
> Total runtime: 6.968 ms
mit geplanten 244 Zeilen und gut 5 Millisekunden Ausführungszeit
deutlich dichter an der Wahrheit.

Habe ich hier irgendwo einen Denkfehler oder Postgres einen Analyse-Fehle=
r?

mfg Volker

PS:
- Die Beispiele stammen von einer Postgres 8.3.9 (Linux, i386) sind abe=
r
auch auf einer 8.4.4 mit annähernd gleichen Daten reproduzierbar
(die einzelnen Schritte des EXPLAIN unterscheiden sich ein wenig,
aber die viel zu hohe erwartete Satzanzahl, die sequenzielle Suche
und die damit verbundene lange Ausführungszeit bleiben).

- Ich habe schon früher einmal ein ähnliches Beispiel gehabt, in
dem ich ein 'feld =3D 5 or feld =3D 7 or feld =3D 23' in ein
'feld in (5, 7, 23)' umwandelte und dadurch die Ausführungszeit
deutlich verkürzte; auch wieder mit einer viel zu hohen
erwarteten Satzanzahl, die einen sequeziellen Scan verursachte.

- Das die Ausführungszeit des Analyzer-Statements mit 957 ms deutlich
unter der von mir genannten Gesamt-Ausführungszeit von mehr als fü=
nf
Sekunden liegt, kommt wohl von Caching-Effekten. Jedenfalls egal ob
1 oder 5 Sekunden: Die Analyse des Ursprungs-Statements ist falsch.=20

--=20
Volker Böhm Tel.: +49 4141 981152 www.vboehm.de
Voßkuhl 5 Fax: +49 4141 981154
D-21682 Stade mailto:volker@vboehm.de

--=20
Sent via pgsql-de-allgemein mailing list (pgsql-de-allgemein@postgresql.o=
rg)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-de-allgemein