Strange performance issue apparently causd by pg_stat timeout

Strange performance issue apparently causd by pg_stat timeout

am 03.06.2010 05:48:35 von Benjamin Krajmalnik

This is a multi-part message in MIME format.

------_=_NextPart_001_01CB02CF.A4F721D0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

System is running PG 8.4.0 (I have been unable to upgrade because the
system needs to be up 24x7), FreeBSD 7.2 amd64, 8 cores, 16GB RAM.

Our application is a network monitoring system, so we are constantly
inserting vast amounts of data (server presently processes about 50
million transactions per day).

As digests of test points come in, they are stored in message queues on
a second server (running PG 8.4.4). A set of daemons process the
digests and insert the data into the main database residing on a second
server. Presently, the database has about 60GB of data.

=20

A few days ago, I noticed that the data in the message queues on the
other server was getting backed up, and then after a few minutes it
would process and clear. This was a totally new behavior. Initially I
suspected deadlocks caused by background processes which create
materialized views, so I stopped those, yet the behavior continued.
Then I suspected server load, yet CPU utilization and load was fine (l
minute and 5 minute load was at < 4), and iostat did not show an overly
busy disk subsystem (I had seen it at much higher utilization levels on
both the data and log partitions without any performance hits).

=20

I then suspected network issues, so I checked the infrastructure (runnig
on Juniper Gig switches, of course non-blocking, and all of the port
information was clean).

Finally, I noticed that the logs had "pg_stat wait timeout" warnings.
Initially I though these were caused by our checking the running process
via pgadmin from our office to the data center, yet even when I exited
pgadmin, the warnings were still there.

After further testing, I saw a correlation between the data getting
queued up and the "pg_stat wait timeout" warnings. As data would begin
to queue up, I would see the warnings, and about a minute later it would
start to dequeue and get stored on the server.

=20

I searched the archives and found some messages stating that this has
been observed. The interesting thing is that nothing has changed on the
server and it started to manifest itself. I ran an analyze of the
entire database, hoping this may rectify any issue, but unfortunately to
no avail.

=20

Any suggestions would be deeply appreciated - this behavior is
definitely not good.

=20

Below is a snapshot from the log files:

=20

2010-06-02 20:59:19 MDT WARNING: pgstat wait timeout

2010-06-02 21:10:42 MDT WARNING: pgstat wait timeout

2010-06-02 21:16:21 MDT WARNING: pgstat wait timeout

2010-06-02 21:21:23 MDT WARNING: pgstat wait timeout

2010-06-02 21:25:50 MDT WARNING: pgstat wait timeout

2010-06-02 21:35:20 MDT WARNING: pgstat wait timeout

2010-06-02 21:39:13 MDT WARNING: pgstat wait timeout

2010-06-02 21:45:32 MDT WARNING: pgstat wait timeout

=20

Thanks in advance.


------_=_NextPart_001_01CB02CF.A4F721D0
Content-Type: text/html;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:x=3D"urn:schemas-microsoft-com:office:excel" =
xmlns:p=3D"urn:schemas-microsoft-com:office:powerpoint" =
xmlns:a=3D"urn:schemas-microsoft-com:office:access" =
xmlns:dt=3D"uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" =
xmlns:s=3D"uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" =
xmlns:rs=3D"urn:schemas-microsoft-com:rowset" xmlns:z=3D"#RowsetSchema" =
xmlns:b=3D"urn:schemas-microsoft-com:office:publisher" =
xmlns:ss=3D"urn:schemas-microsoft-com:office:spreadsheet" =
xmlns:c=3D"urn:schemas-microsoft-com:office:component:spread sheet" =
xmlns:odc=3D"urn:schemas-microsoft-com:office:odc" =
xmlns:oa=3D"urn:schemas-microsoft-com:office:activation" =
xmlns:html=3D"http://www.w3.org/TR/REC-html40" =
xmlns:q=3D"http://schemas.xmlsoap.org/soap/envelope/" =
xmlns:rtc=3D"http://microsoft.com/officenet/conferencing" =
xmlns:D=3D"DAV:" xmlns:Repl=3D"http://schemas.microsoft.com/repl/" =
xmlns:mt=3D"http://schemas.microsoft.com/sharepoint/soap/mee tings/" =
xmlns:x2=3D"http://schemas.microsoft.com/office/excel/2003/x ml" =
xmlns:ppda=3D"http://www.passport.com/NameSpace.xsd" =
xmlns:ois=3D"http://schemas.microsoft.com/sharepoint/soap/oi s/" =
xmlns:dir=3D"http://schemas.microsoft.com/sharepoint/soap/di rectory/" =
xmlns:ds=3D"http://www.w3.org/2000/09/xmldsig#" =
xmlns:dsp=3D"http://schemas.microsoft.com/sharepoint/dsp" =
xmlns:udc=3D"http://schemas.microsoft.com/data/udc" =
xmlns:xsd=3D"http://www.w3.org/2001/XMLSchema" =
xmlns:sub=3D"http://schemas.microsoft.com/sharepoint/soap/20 02/1/alerts/"=
xmlns:ec=3D"http://www.w3.org/2001/04/xmlenc#" =
xmlns:sp=3D"http://schemas.microsoft.com/sharepoint/" =
xmlns:sps=3D"http://schemas.microsoft.com/sharepoint/soap/" =
xmlns:xsi=3D"http://www.w3.org/2001/XMLSchema-instance" =
xmlns:udcs=3D"http://schemas.microsoft.com/data/udc/soap" =
xmlns:udcxf=3D"http://schemas.microsoft.com/data/udc/xmlfile " =
xmlns:udcp2p=3D"http://schemas.microsoft.com/data/udc/partto part" =
xmlns:wf=3D"http://schemas.microsoft.com/sharepoint/soap/wor kflow/" =
xmlns:dsss=3D"http://schemas.microsoft.com/office/2006/digsi g-setup" =
xmlns:dssi=3D"http://schemas.microsoft.com/office/2006/digsi g" =
xmlns:mdssi=3D"http://schemas.openxmlformats.org/package/200 6/digital-sig=
nature" =
xmlns:mver=3D"http://schemas.openxmlformats.org/markup-compa tibility/2006=
" xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns:mrels=3D"http://schemas.openxmlformats.org/package/200 6/relationshi=
ps" xmlns:spwp=3D"http://microsoft.com/sharepoint/webpartpages" =
xmlns:ex12t=3D"http://schemas.microsoft.com/exchange/service s/2006/types"=
=
xmlns:ex12m=3D"http://schemas.microsoft.com/exchange/service s/2006/messag=
es" =
xmlns:pptsl=3D"http://schemas.microsoft.com/sharepoint/soap/ SlideLibrary/=
" =
xmlns:spsl=3D"http://microsoft.com/webservices/SharePointPor talServer/Pub=
lishedLinksService" xmlns:Z=3D"urn:schemas-microsoft-com:" =
xmlns:st=3D"" xmlns=3D"http://www.w3.org/TR/REC-html40">


charset=3Dus-ascii">









System is running PG 8.4.0 (I have been unable to =
upgrade
because the system needs to be up 24x7), FreeBSD 7.2 amd64, 8 cores, =
16GB RAM.



Our application is a network monitoring system, so =
we are
constantly inserting vast amounts of data (server presently processes =
about 50
million transactions per day).



As digests of test points come in, they are stored =
in
message queues on a second server (running PG 8.4.4).  A set of =
daemons
process the digests and insert the data into the main database residing =
on a
second server.  Presently, the database has about 60GB of =
data.



 



A few days ago, I noticed that the data in the =
message
queues on the other server was getting backed up, and then after a few =
minutes
it would process and clear.  This was a totally new =
behavior.  
Initially I suspected deadlocks caused by background processes which =
create
materialized views, so I stopped those, yet the behavior =
continued.  Then
I suspected server load, yet CPU utilization and load was fine (l minute =
and 5
minute load was at < 4), and iostat did not show an overly busy disk
subsystem (I had seen it at much higher utilization levels on both the =
data and
log partitions without any performance hits).



 



I then suspected network issues, so I checked the
infrastructure (runnig on Juniper Gig switches, of course non-blocking, =
and all
of the port information was clean).



Finally, I noticed that the logs had “pg_stat =
wait timeout”
warnings.  Initially I though these were caused by our checking the
running process via pgadmin from our office to the data center, yet even =
when I
exited pgadmin, the warnings were still there.



After further testing, I saw a correlation between =
the data
getting queued up and the “pg_stat wait timeout” =
warnings.  As
data would begin to queue up, I would see the warnings, and about a =
minute
later it would start to dequeue and get stored on the =
server.



 



I searched the archives and found some messages =
stating that
this has been observed.  The interesting thing is that nothing has =
changed
on the server and it started to manifest itself.  I ran an analyze =
of the
entire database, hoping this may rectify any issue, but unfortunately to =
no
avail.



 



Any suggestions would be deeply appreciated – =
this behavior
is definitely not good.



 



Below is a snapshot from the log =
files:



 



2010-06-02 20:59:19 MDT WARNING:  pgstat wait =
timeout



2010-06-02 21:10:42 MDT WARNING:  pgstat wait =
timeout



2010-06-02 21:16:21 MDT WARNING:  pgstat wait =
timeout



2010-06-02 21:21:23 MDT WARNING:  pgstat wait =
timeout



2010-06-02 21:25:50 MDT WARNING:  pgstat wait =
timeout



2010-06-02 21:35:20 MDT WARNING:  pgstat wait =
timeout



2010-06-02 21:39:13 MDT WARNING:  pgstat wait =
timeout



2010-06-02 21:45:32 MDT WARNING:  pgstat wait =
timeout



 



Thanks in advance.









------_=_NextPart_001_01CB02CF.A4F721D0--

Re: Strange performance issue apparently causd by pg_stat

am 05.06.2010 16:19:51 von Anj Adu

We run a similar system with very large volumes

I had similar queue build ups (not related to pg_stat_timeouts though)

I was using Rules for my partition management instead of triggers.

Changing from rules to triggers fixed the problem. (rules were extremely sl=
ow)

Hope this helps you.

On Wed, Jun 2, 2010 at 8:48 PM, Benjamin Krajmalnik wr=
ote:
> System is running PG 8.4.0 (I have been unable to upgrade because the sys=
tem
> needs to be up 24x7), FreeBSD 7.2 amd64, 8 cores, 16GB RAM.
>
> Our application is a network monitoring system, so we are constantly
> inserting vast amounts of data (server presently processes about 50 milli=
on
> transactions per day).
>
> As digests of test points come in, they are stored in message queues on a
> second server (running PG 8.4.4).=A0 A set of daemons process the digests=
and
> insert the data into the main database residing on a second server.
> Presently, the database has about 60GB of data.
>
>
>
> A few days ago, I noticed that the data in the message queues on the other
> server was getting backed up, and then after a few minutes it would proce=
ss
> and clear.=A0 This was a totally new behavior.   Initially I suspected
> deadlocks caused by background processes which create materialized views,=
so
> I stopped those, yet the behavior continued.=A0 Then I suspected server l=
oad,
> yet CPU utilization and load was fine (l minute and 5 minute load was at <
> 4), and iostat did not show an overly busy disk subsystem (I had seen it =
at
> much higher utilization levels on both the data and log partitions without
> any performance hits).
>
>
>
> I then suspected network issues, so I checked the infrastructure (runnig =
on
> Juniper Gig switches, of course non-blocking, and all of the port
> information was clean).
>
> Finally, I noticed that the logs had =93pg_stat wait timeout=94 warnings.
> Initially I though these were caused by our checking the running process =
via
> pgadmin from our office to the data center, yet even when I exited pgadmi=
n,
> the warnings were still there.
>
> After further testing, I saw a correlation between the data getting queued
> up and the =93pg_stat wait timeout=94 warnings.=A0 As data would begin to=
queue
> up, I would see the warnings, and about a minute later it would start to
> dequeue and get stored on the server.
>
>
>
> I searched the archives and found some messages stating that this has been
> observed.=A0 The interesting thing is that nothing has changed on the ser=
ver
> and it started to manifest itself.=A0 I ran an analyze of the entire data=
base,
> hoping this may rectify any issue, but unfortunately to no avail.
>
>
>
> Any suggestions would be deeply appreciated =96 this behavior is definite=
ly
> not good.
>
>
>
> Below is a snapshot from the log files:
>
>
>
> 2010-06-02 20:59:19 MDT WARNING:=A0 pgstat wait timeout
>
> 2010-06-02 21:10:42 MDT WARNING:=A0 pgstat wait timeout
>
> 2010-06-02 21:16:21 MDT WARNING:=A0 pgstat wait timeout
>
> 2010-06-02 21:21:23 MDT WARNING:=A0 pgstat wait timeout
>
> 2010-06-02 21:25:50 MDT WARNING:=A0 pgstat wait timeout
>
> 2010-06-02 21:35:20 MDT WARNING:=A0 pgstat wait timeout
>
> 2010-06-02 21:39:13 MDT WARNING:=A0 pgstat wait timeout
>
> 2010-06-02 21:45:32 MDT WARNING:=A0 pgstat wait timeout
>
>
>
> Thanks in advance.

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Strange performance issue apparently causd by pg_stat

am 05.06.2010 16:20:57 von Anj Adu

To clarify on the rules...I was using rules for the tables that are
the entry point for the insertion into the database.

On Sat, Jun 5, 2010 at 7:19 AM, Anj Adu wrote:
> We run a similar system with very large volumes
>
> I had similar queue build ups (not related to pg_stat_timeouts though)
>
> I was using Rules for my partition management instead of triggers.
>
> Changing from rules to triggers fixed the problem. (rules were extremely =
slow)
>
> Hope this helps you.
>
> On Wed, Jun 2, 2010 at 8:48 PM, Benjamin Krajmalnik =
wrote:
>> System is running PG 8.4.0 (I have been unable to upgrade because the sy=
stem
>> needs to be up 24x7), FreeBSD 7.2 amd64, 8 cores, 16GB RAM.
>>
>> Our application is a network monitoring system, so we are constantly
>> inserting vast amounts of data (server presently processes about 50 mill=
ion
>> transactions per day).
>>
>> As digests of test points come in, they are stored in message queues on a
>> second server (running PG 8.4.4).=A0 A set of daemons process the digest=
s and
>> insert the data into the main database residing on a second server.
>> Presently, the database has about 60GB of data.
>>
>>
>>
>> A few days ago, I noticed that the data in the message queues on the oth=
er
>> server was getting backed up, and then after a few minutes it would proc=
ess
>> and clear.=A0 This was a totally new behavior.   Initially I suspect=
ed
>> deadlocks caused by background processes which create materialized views=
, so
>> I stopped those, yet the behavior continued.=A0 Then I suspected server =
load,
>> yet CPU utilization and load was fine (l minute and 5 minute load was at=
<
>> 4), and iostat did not show an overly busy disk subsystem (I had seen it=
at
>> much higher utilization levels on both the data and log partitions witho=
ut
>> any performance hits).
>>
>>
>>
>> I then suspected network issues, so I checked the infrastructure (runnig=
on
>> Juniper Gig switches, of course non-blocking, and all of the port
>> information was clean).
>>
>> Finally, I noticed that the logs had =93pg_stat wait timeout=94 warnings.
>> Initially I though these were caused by our checking the running process=
via
>> pgadmin from our office to the data center, yet even when I exited pgadm=
in,
>> the warnings were still there.
>>
>> After further testing, I saw a correlation between the data getting queu=
ed
>> up and the =93pg_stat wait timeout=94 warnings.=A0 As data would begin t=
o queue
>> up, I would see the warnings, and about a minute later it would start to
>> dequeue and get stored on the server.
>>
>>
>>
>> I searched the archives and found some messages stating that this has be=
en
>> observed.=A0 The interesting thing is that nothing has changed on the se=
rver
>> and it started to manifest itself.=A0 I ran an analyze of the entire dat=
abase,
>> hoping this may rectify any issue, but unfortunately to no avail.
>>
>>
>>
>> Any suggestions would be deeply appreciated =96 this behavior is definit=
ely
>> not good.
>>
>>
>>
>> Below is a snapshot from the log files:
>>
>>
>>
>> 2010-06-02 20:59:19 MDT WARNING:=A0 pgstat wait timeout
>>
>> 2010-06-02 21:10:42 MDT WARNING:=A0 pgstat wait timeout
>>
>> 2010-06-02 21:16:21 MDT WARNING:=A0 pgstat wait timeout
>>
>> 2010-06-02 21:21:23 MDT WARNING:=A0 pgstat wait timeout
>>
>> 2010-06-02 21:25:50 MDT WARNING:=A0 pgstat wait timeout
>>
>> 2010-06-02 21:35:20 MDT WARNING:=A0 pgstat wait timeout
>>
>> 2010-06-02 21:39:13 MDT WARNING:=A0 pgstat wait timeout
>>
>> 2010-06-02 21:45:32 MDT WARNING:=A0 pgstat wait timeout
>>
>>
>>
>> Thanks in advance.
>

--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin