Shutdown fails with both "fast" and "immediate"

am 12.05.2010 16:22:14 von David Schnur

--0016364267236134d60486665e22
Content-Type: text/plain; charset=ISO-8859-1

I develop an app that uses a back-end Postgres database, currently 8.3.9.
The database is started when the app starts up, and stopped when it shuts
down. Shutdown uses pg_ctl with -m fast, and waits two minutes for the
process to complete. If it doesn't, it tries -m immediate, and waits two
more minutes before logging an error and giving up.

One user, on OSX 10.5.8, has a script that stops the app each morning, to
upgrade to the newest build. In his case, both the fast and immediate
shutdowns time out, and Postgres continues running for at least 2-4 hours.
At that point he brings up the terminal to kill all the back-ends manually,
so we haven't seen it finish shutting down on its own yet. It is in fact
shutting down, because all queries fail with the 'database system is
shutting down' error.

The query running during this time is a DELETE that runs as part of the
application's daily maintenance. The size of the DELETE varies, and in his
case happened to be unusually large one day, which is apparently what
triggered the problem. Since the DELETE never gets a chance to finish, the
problem recurs every morning.

I'll obviously need to deal with that query, but I'm concerned that Postgres
is unable to interrupt it. Why might this be happening? Thanks,

David

--0016364267236134d60486665e22
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I develop an app that uses a back-end Postgres database, currently 8.3.9. =
=A0The database is started when the app starts up, and stopped when it shut=
s down. =A0Shutdown uses pg_ctl with -m fast, and waits two minutes for the=
process to complete. =A0If it doesn't, it tries -m immediate, and wait=
s two more minutes before logging an error and giving up.

One user, on OSX 10.5.8, has a script that stops the app eac=
h morning, to upgrade to the newest build. =A0In his case, both the fast an=
d immediate shutdowns time out, and Postgres continues running for at least=
2-4 hours. =A0At that point he brings up the terminal to kill all the back=
-ends manually, so we haven't seen it finish shutting down on its own y=
et. =A0It is in fact shutting down, because all queries fail with the '=
database system is shutting down' error.

The query running during this time is a DELETE that run=
s as part of the application's daily maintenance. =A0The size of the DE=
LETE varies, and in his case happened to be unusually large one day, which =
is apparently what triggered the problem. =A0Since the DELETE never gets a =
chance to finish, the problem recurs every morning.

I'll obviously need to deal with that query, but I&=
#39;m concerned that Postgres is unable to interrupt it. =A0Why might this =
be happening? =A0Thanks,

David

--0016364267236134d60486665e22--

Re: Shutdown fails with both "fast" and "immediate"

am 12.05.2010 16:32:33 von Tom Lane

David Schnur writes:
> I develop an app that uses a back-end Postgres database, currently 8.3.9.
> The database is started when the app starts up, and stopped when it shuts
> down. Shutdown uses pg_ctl with -m fast, and waits two minutes for the
> process to complete. If it doesn't, it tries -m immediate, and waits two
> more minutes before logging an error and giving up.

Hm, does it shut down properly if you use -m immediate immediately
instead of trying fast first?

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Shutdown fails with both "fast" and "immediate"

am 12.05.2010 16:45:25 von Kenneth Marshall

On Wed, May 12, 2010 at 10:22:14AM -0400, David Schnur wrote:
> I develop an app that uses a back-end Postgres database, currently 8.3.9.
> The database is started when the app starts up, and stopped when it shuts
> down. Shutdown uses pg_ctl with -m fast, and waits two minutes for the
> process to complete. If it doesn't, it tries -m immediate, and waits two
> more minutes before logging an error and giving up.
>
> One user, on OSX 10.5.8, has a script that stops the app each morning, to
> upgrade to the newest build. In his case, both the fast and immediate
> shutdowns time out, and Postgres continues running for at least 2-4 hours.
> At that point he brings up the terminal to kill all the back-ends manually,
> so we haven't seen it finish shutting down on its own yet. It is in fact
> shutting down, because all queries fail with the 'database system is
> shutting down' error.
>
> The query running during this time is a DELETE that runs as part of the
> application's daily maintenance. The size of the DELETE varies, and in his
> case happened to be unusually large one day, which is apparently what
> triggered the problem. Since the DELETE never gets a chance to finish, the
> problem recurs every morning.
>
> I'll obviously need to deal with that query, but I'm concerned that Postgres
> is unable to interrupt it. Why might this be happening? Thanks,
>
> David

In many cases, I/O requests are not interruptable until they complete
and DELETE causes a lot of I/O. Check to see if the processes are in
device-wait, D in top or ps. The solution is to fix the DELETE processing.
One option would be to batch it in smaller numbers of rows which should
allow the quit to squeeze in between one of the batches.

Cheers,
Ken

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Shutdown fails with both "fast" and "immediate"

am 12.05.2010 18:14:16 von Scott Marlowe

On Wed, May 12, 2010 at 8:45 AM, Kenneth Marshall wrote:
> On Wed, May 12, 2010 at 10:22:14AM -0400, David Schnur wrote:
>> I develop an app that uses a back-end Postgres database, currently 8.3.9.
>> =A0The database is started when the app starts up, and stopped when it s=
huts
>> down. =A0Shutdown uses pg_ctl with -m fast, and waits two minutes for the
>> process to complete. =A0If it doesn't, it tries -m immediate, and waits =
two
>> more minutes before logging an error and giving up.
>>
>> One user, on OSX 10.5.8, has a script that stops the app each morning, to
>> upgrade to the newest build. =A0In his case, both the fast and immediate
>> shutdowns time out, and Postgres continues running for at least 2-4 hour=
s.
>> =A0At that point he brings up the terminal to kill all the back-ends man=
ually,
>> so we haven't seen it finish shutting down on its own yet. =A0It is in f=
act
>> shutting down, because all queries fail with the 'database system is
>> shutting down' error.
>>
>> The query running during this time is a DELETE that runs as part of the
>> application's daily maintenance. =A0The size of the DELETE varies, and i=
n his
>> case happened to be unusually large one day, which is apparently what
>> triggered the problem. =A0Since the DELETE never gets a chance to finish=
, the
>> problem recurs every morning.
>>
>> I'll obviously need to deal with that query, but I'm concerned that Post=
gres
>> is unable to interrupt it. =A0Why might this be happening? =A0Thanks,
>>
>> David
>
> In many cases, I/O requests are not interruptable until they complete
> and DELETE causes a lot of I/O. Check to see if the processes are in
> device-wait, D in top or ps. The solution is to fix the DELETE processing.
> One option would be to batch it in smaller numbers of rows which should
> allow the quit to squeeze in between one of the batches.

Also see if truncate can be used here or not.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Shutdown fails with both "fast" and "immediate"

am 12.05.2010 19:03:57 von David Schnur

--0016e65b60f2b674f4048668a094
Content-Type: text/plain; charset=ISO-8859-1

@Julio Leyva: The table does get vacuumed at the end of the maintenance
tasks; in this case it's not making it that far, of course.

@Scott Marlowe: Truncate isn't an option here, unfortunately.

I'm less concerned with the particular query than with the general question
of when a shutdown could hang like this. I expected this to be possible
when using -m fast, but my understanding was that -m immediate really forced
termination.

I'm setting up a test on the user's machine where it will try immediate
first, rather than fast.

David

--0016e65b60f2b674f4048668a094
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

@Julio Leyva: The table does get vacuumed at the end of the maintenanc=
e tasks; in this case it's not making it that far, of course.

@Scott Marlowe: Truncate isn't an option here, unfortun=
ately.

I'm less concerned with the particular query than w=
ith the general question of when a shutdown could hang like this. =A0I expe=
cted this to be possible when using -m fast, but my understanding was that =
-m immediate really forced termination.

I'm setting up a test on the user's machine whe=
re it will try immediate first, rather than fast.

=
David

--0016e65b60f2b674f4048668a094--

Re: Shutdown fails with both "fast" and "immediate"

am 12.05.2010 19:16:13 von Tom Lane

David Schnur writes:
> I'm less concerned with the particular query than with the general question
> of when a shutdown could hang like this. I expected this to be possible
> when using -m fast, but my understanding was that -m immediate really forced
> termination.

Yeah, it's supposed to. The sequence is pg_ctl -m immediate sends
SIGQUIT to the postmaster, which in turn sends SIGQUIT to all its child
processes, and their SIGQUIT interrupt handlers just immediately exit().
I was thinking earlier that there might be a bug in the postmaster state
machine that prevented it from sending SIGQUIT if it had already
received SIGTERM (-m fast), but a look at the sources doesn't support
that theory. The only obvious theory at this point is that the backend
is stuck in some uninterruptable kernel call, but it's hard to imagine
what.

Is the postmaster still there after -m immediate, or does it quit?
If it's still there, maybe there's some problem in the earlier part
of the sequence.

A gdb stack trace from whichever processes are still there after -m
immediate could be informative. Another thing you could try is a
manual "kill -QUIT pid" on the uncooperative backend(s).

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Shutdown fails with both "fast" and "immediate"

am 12.05.2010 22:03:44 von Donald Fraser

----- Original Message -----
Subject: Re: [ADMIN] Shutdown fails with both 'fast' and 'immediate'

> David Schnur writes:
>> I'm less concerned with the particular query than with the general
>> question
>> of when a shutdown could hang like this. I expected this to be possible
>> when using -m fast, but my understanding was that -m immediate really
>> forced
>> termination.
>
> Yeah, it's supposed to. The sequence is pg_ctl -m immediate sends
> SIGQUIT to the postmaster, which in turn sends SIGQUIT to all its child
> processes, and their SIGQUIT interrupt handlers just immediately exit().
> I was thinking earlier that there might be a bug in the postmaster state
> machine that prevented it from sending SIGQUIT if it had already
> received SIGTERM (-m fast), but a look at the sources doesn't support
> that theory. The only obvious theory at this point is that the backend
> is stuck in some uninterruptable kernel call, but it's hard to imagine
> what.
>
> Is the postmaster still there after -m immediate, or does it quit?
> If it's still there, maybe there's some problem in the earlier part
> of the sequence.
>
> A gdb stack trace from whichever processes are still there after -m
> immediate could be informative. Another thing you could try is a
> manual "kill -QUIT pid" on the uncooperative backend(s).
>
> regards, tom lane

Just to add some more comments on similar observations.
We have a restore script that restores a database from a backup (pg_dump).
The only users connected during the restore are postgres on localhost. In
the script we use
pg_ctl stop -D $PGDATA -m immediate
to stop the database and have noted that this doesn't always work.
We recently upgraded from 8.1 to 8.3 and have never previously noticed this
issue.
The main difference in our configuration between 8.1 and 8.3 is that we now
have "autovacuum = on".
So for what its worth, the DELETE might be a "red herring".
Given our circumstance, I would be more inclined to think the issue is
something to do with autovacuum as there are no DELETE statements in our
restore procedure and we don't execute pg_ctl stop untill all statements are
complete.

Regards
Donald Fraser

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin