ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file

ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file

am 28.03.2010 19:41:41 von Mike Williams

Hi guys,

We upgraded from 8.3.5 to 8.3.9 yesterday, after an extended period of
testing, to 8.3.10 today in hope of fixing it, and we're getting the following
error on occasion.

ERROR: could not open segment 1 of relation 1663/743352/743420 (target block
2171336): No such file or directory
CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<= $1 AND
unique_ip=-1 ORDER BY day DESC LIMIT 1"
PL/pgSQL function "update_unique_ips" line 13 at SQL statement


update_unique_ips() is called once per minute, but we get anywhere from one
error per minute, to one error per hour.

Relation 743420 in database 743352 is 441270272 bytes long and growing by a
few thousand bytes here and there.
Obviously, as we've gone from 8.3.9 to 8.3.10, we're restarted postgres a few
times. We have also REINDEXd each table, the database, dropped and re-created
all the indexes.
I've even pg_dump'd the database, and loaded it into a fresh database. The
relation numbers changed, but the 1663 bit remained the same.

Best I can tell it's this bit of code from the beginning of
update_unique_ips() that's doing it, but it seems completely innocuous to me.


CREATE FUNCTION update_unique_ips() RETURNS timestamp without time zone
AS $$

DECLARE
last_day timestamp;
day_for_update timestamp;

BEGIN
-- declare last_day
SELECT INTO last_day day-interval '30 min' as day
FROM min_stats
ORDER BY day DESC
LIMIT 1;

SELECT INTO day_for_update day
FROM min_stats
WHERE day<=last_day AND unique_ip=-1
ORDER BY day DESC
LIMIT 1;



Could anyone help us out here please?
Thanks

--
Mike Williams

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such f

am 28.03.2010 20:20:27 von Tom Lane

Mike Williams writes:
> We upgraded from 8.3.5 to 8.3.9 yesterday, after an extended period of
> testing, to 8.3.10 today in hope of fixing it, and we're getting the following
> error on occasion.

> ERROR: could not open segment 1 of relation 1663/743352/743420 (target block
> 2171336): No such file or directory
> CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<= $1 AND
> unique_ip=-1 ORDER BY day DESC LIMIT 1"
> PL/pgSQL function "update_unique_ips" line 13 at SQL statement

> update_unique_ips() is called once per minute, but we get anywhere from one
> error per minute, to one error per hour.

> Relation 743420 in database 743352 is 441270272 bytes long and growing by a
> few thousand bytes here and there.
> Obviously, as we've gone from 8.3.9 to 8.3.10, we're restarted postgres a few
> times. We have also REINDEXd each table, the database, dropped and re-created
> all the indexes.
> I've even pg_dump'd the database, and loaded it into a fresh database. The
> relation numbers changed, but the 1663 bit remained the same.

Wow. The symptom seems to be what you'd get from a corrupt tuple
pointer in an index, but it's hard to believe that such a thing would
persist over reindexes and even reloads. Is the "target block" number
always the same, or does that change? Can you reproduce the problem on
a different machine? What platform are you running on?

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such f

am 28.03.2010 21:29:52 von Mike Williams

On Sun, March 28, 2010 7:20 pm, Tom Lane wrote:
> Mike Williams writes:
>> ERROR: could not open segment 1 of relation 1663/743352/743420 (targe=
t
>> block
>> 2171336): No such file or directory
>> CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<=3D $1 A=
ND
>> unique_ip=3D-1 ORDER BY day DESC LIMIT 1"
>> PL/pgSQL function "update_unique_ips" line 13 at SQL statement


> Wow. The symptom seems to be what you'd get from a corrupt tuple
> pointer in an index, but it's hard to believe that such a thing would
> persist over reindexes and even reloads. Is the "target block" number
> always the same, or does that change? Can you reproduce the problem on
> a different machine? What platform are you running on?

While it's always pleasing to find an error condition that makes an exper=
t
go "Wow", it is also quite troubling! :)

The "target block" stays the same for extended periods, but does change.
Oddly it's been quite for a while now, half an hour or so ago it had been
going nearly every minute for a good hour. For the last 30 times the bloc=
k
was 11387925, before then 6634333 once, then 15520041 for 4 times, 666719=
0
before that for a while.
And just as I'm about to hit send, 3 more errors. All block 14528541.

Where does the 1663 number come from?
/743352/743420 is a real file, but I don't have any file 16=
63.

The server is Gentoo, and I don't have an identical server to use. This
machines HA partner is staying at 8.3.5 until we're completely and
entirely sure 8.3.10 works right.

--=20
Mike Williams


--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such f

am 28.03.2010 21:49:50 von Tom Lane

"Mike Williams" writes:
> The "target block" stays the same for extended periods, but does change.

Fascinating. Don't know what it means, but it's fascinating.

> Where does the 1663 number come from?
> /743352/743420 is a real file, but I don't have any file 1663.

1663 is the OID of the pg_default tablespace; that's never going to
change unless you move the table to a non-default tablespace.

> The server is Gentoo, and I don't have an identical server to use. This
> machines HA partner is staying at 8.3.5 until we're completely and
> entirely sure 8.3.10 works right.

I wasn't really suggesting that you need an *identical* machine. If
this is a software problem, as seems moderately likely at this point,
it should be possible to reproduce it on a different machine. Even more
to the point, it's going to be difficult to find the problem without
tracing through the fault case with a debugger --- which isn't something
you're going to want to do on a production server. Have you got a test
machine you can load the data onto? Preferably one you could give a PG
developer access to?



regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such f

am 29.03.2010 00:27:33 von Mike Williams

On Sun, March 28, 2010 8:49 pm, Tom Lane wrote:
>> The server is Gentoo, and I don't have an identical server to use. Thi=
s
>> machines HA partner is staying at 8.3.5 until we're completely and
>> entirely sure 8.3.10 works right.
>
> I wasn't really suggesting that you need an *identical* machine. If
> this is a software problem, as seems moderately likely at this point,
> it should be possible to reproduce it on a different machine. Even mor=
e
> to the point, it's going to be difficult to find the problem without
> tracing through the fault case with a debugger --- which isn't somethin=
g
> you're going to want to do on a production server. Have you got a test
> machine you can load the data onto? Preferably one you could give a PG
> developer access to?

We do have a test system, and the only significant difference from
production is the kernel.
Test are para-virt VMs with "regular" kernels, production are real
machines with hardened kernels (grsec+pax).
I got prod onto a regular kernel and nothing in nearly an hour.

8.3.5 has worked forever and a day on this type of setup.

--=20
Mike Williams


--=20
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: ERROR: could not open segment 1 of relation1663/743352/743420 (target block 6407642): No such fi

am 29.03.2010 15:49:36 von ALEXANDER JOSE

--_73568e4a-76b3-4740-b480-eaf133a3b185_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


Good=20
morning

You=20
have to revise the base folder to create postgres when you perform the=20
initial installation and verify the relation fallanado that I spend And=20
it turned out he had a folder in C: \ Program Files \ PostgreSQL \ 8.2 \
data \ base folder exists the number 1=2C this=20
is where the file belongs to the database that lost the connection does=20
not exist=2C determine why and try to copy again=2C delete the database tha=
t
you are creating and then think again.


Alexander Angel
Venezuela
=20
____________________________________________________________ _____
Invite your mail contacts to join your friends list with Windows Live Space=
s. It's easy!
http://spaces.live.com/spacesapi.aspx?wx_action=3Dcreate&wx_ url=3D/friends.=
aspx&mkt=3Den-us=

--_73568e4a-76b3-4740-b480-eaf133a3b185_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable






Goo=
d=20
morning

5)=3B" title=3D"Tienes que revisar la carpeta base que crea postgres=20
cuando realiza la primera instalaci=F3n y verificar la relacion que esta=20
fallanado=2C me paso y resulto ser que habia una carpeta dentro de=20
C:\Program Files\PostgreSQL\8.2\data\base existe una carpeta con">You=20
have to revise the base folder to create postgres when you perform the=20
initial installation and verify the relation fallanado that I spend And=20
it turned out he had a folder in C: \ Program Files \ PostgreSQL \ 8.2 \
data \ base folder exists
=2C 255=2C 255)=3B" title=3D"el numero 1=2C es alli en donde el archivo=20
perteneciente a la base de datos que perdio la relaci=F3n no existe,
determina cual es y trata de copiarla de nuevo=2C elimina esa base de=20
datos que estas creando y luego la creas de nuevo.">the number 1=2C this=20
is where the file belongs to the database that lost the connection does=20
not exist=2C determine why and try to copy again=2C delete the database tha=
t
you are creating and then think again.


Alexander Angel
Venez=
uela


Invite your mail contacts to =
join your friends list with Windows Live Spaces. It's easy! ://spaces.live.com/spacesapi.aspx?wx_action=3Dcreate&wx_url= 3D/friends.aspx=
&mkt=3Den-us' target=3D'_new'>Try it!

=

--_73568e4a-76b3-4740-b480-eaf133a3b185_--

Re: ERROR: could not open segment 1 of relation

am 29.03.2010 21:26:08 von Alex Hunsaker

On Sun, Mar 28, 2010 at 16:27, Mike Williams wrote:
> Test are para-virt VMs with "regular" kernels, production are real
> machines with hardened kernels (grsec+pax).

Ive seen this error on a few boxes around here, using a non grsec
kernel fixes it. I never bothered to report it because I cant
reproduce it and it goes away without grsec. (2.6.30-32 IIRC, have
not tired the latest 33). Interestingly I have the *exact* same
hardware and setup (kernel/userspace/postgres etc) on a test machine
that I've never been able to trigger it on.

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such

am 30.03.2010 12:16:28 von Mike Williams

On Monday 29 March 2010 20:26:08 Alex Hunsaker wrote:
> > Test are para-virt VMs with "regular" kernels, production are real
> > machines with hardened kernels (grsec+pax).
>
> Ive seen this error on a few boxes around here, using a non grsec
> kernel fixes it. I never bothered to report it because I cant
> reproduce it and it goes away without grsec. (2.6.30-32 IIRC, have
> not tired the latest 33). Interestingly I have the *exact* same
> hardware and setup (kernel/userspace/postgres etc) on a test machine
> that I've never been able to trigger it on.

Thanks Alex, good to know I've not screwed up the kernel somehow.

I've been using 2.6.32 with grsecurity-2.1.14-2.6.32.9-201002231820 applied.

--
Mike Williams

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: ERROR: could not open segment 1 of relation

am 30.03.2010 22:28:09 von Alex Hunsaker

On Tue, Mar 30, 2010 at 04:16, Mike Williams wrote:
> Thanks Alex, good to know I've not screwed up the kernel somehow.
>
> I've been using 2.6.32 with grsecurity-2.1.14-2.6.32.9-201002231820 applied.

Looks like the first instance I had of this problem was with
2.6.31.1-rc1-grsec. I know I tried 2.6.32-grsec and various
2.6.32.X-grsecs but all those had this issue at some point. Currently
im on a mostly stock 2.6.33.1 with no problems. I have not had the
nerve to try a -grsec kernel on it again.

For reference here are the errors I got:

could not open segment 3 of relation base/4440720/8003730

COPY public.page_loads (cgi, content_length, date_created, defunct,
host, ip, page_load_id, protocol, proxy_ip, referrer, request_method,
sessionid, url, user_id, audit_tid, user_agent_id, action, server) TO
'/tmp/blah.sql';
ERROR: invalid memory alloc request size 18446744073709551613


There were more could not open segment errors... but I seem to have lost them.

Normally I would think the above is corrupt data, but it would
sometimes work. It *always* worked on the non grsec kernel. So
instead it smells like bad ram, well its got ecc ram and survived
multiple runs of memtest, memtest86+ various versions. [ Yeah I know
people including me have seen ram that passes all that and is still
bad ]

Since you are having similar problems with a -grsec kernel sounds like
there might be some kind of memory corruption bug with it. I would
recommend trying a stock kernel and seeing if the problem goes away.
I also think the general attitude here is if you run crazy security
patches you get to keep both pieces. :)

Another fact that seemed to point to bad ram or some kind of kernel
corruption was trying to find the bad row COPY reported above:

SELECT count(*) from (select * from page_loads order by page_load_id
desc limit 937980) as foo;
ERROR: could not open segment 3 of relation base/4440720/8003730
(target block 4680336): No such file or directory

SELECT count(*) from ( select * from page_loads order by page_load_id
desc limit 937970) as foo;
count
--------
937970

<70-79 snipped all worked>

SELECT count(*) from (select * from page_loads order by page_load_id
desc limit 937979) as foo;
count
--------
937979

-- Uhh this was just broken...
SELECT count(*) from (select * order by page_load_id desc limit 937980) as foo;
count
--------
937980


I thought I had some stacktraces... but they are not in my notes... I
do remember tracing through them and coming to the conclusion that its
most likely some kind of kernel bug. (That error can only happen if we
try to open a file that does not exist, but we can only get that far
if the file existed or some such) Sorry Im a bit hazy this was back
in September.

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin