Fix corrupt pg_toast table?

am 30.07.2009 23:48:08 von Michael Clark

--00151750e8b65f351f046ff342c6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hello everyone.
Sorry if this has been covered already, I haven't had any luck searching and
finding a solution.

I have a database which appears to have some corruption in a toast value.

When I select a certain table, I get the following error:
ERROR: invalid page header in block 984 of relation "pg_toast_17341_index"

I am new enough to PG that I am having some troubles navigating around the
system tables and being able to do anything useful.

This error seems to be preventing a vacuum, reindex, and PGCOPY from going
through on the database.

I wrote a little script to fetch this table row by row until I hit an error,
and I think this has narrowed it down to a specific row, although trying to
fetch a row beyond this also gave this error (I am hoping that is a side
effect of the problem and that not the rest of the table is lost).

What is the best course of action to try and fix this and recover the data?
I am not concerned with recovering the particular row that appears to be
having problems, but the rest of the data in the table is important.

Thanks a lot for any help on this,
Michael.

--00151750e8b65f351f046ff342c6
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello everyone.

Sorry if this has been covered already, I haven't h=
ad any luck searching and finding a solution.

I ha=
ve a database which appears to have some corruption in a toast value.

When I select a certain table, I get the following erro=
r:

ERROR: invalid page header in block 984 of relation "pg_t=
oast_17341_index"

I am new enough to PG that =
I am having some troubles navigating around the system tables and being abl=
e to do anything useful.

This error seems to be preventing a vacuum, reindex, an=
d PGCOPY from going through on the database.

I wro=
te a little script to fetch this table row by row until I hit an error, and=
I think this has narrowed it down to a specific row, although trying to fe=
tch a row beyond this also gave this error (I am hoping that is a side effe=
ct of the problem and that not the rest of the table is lost).

What is the best course of action to try and fix this a=
nd recover the data? =A0I am not concerned with recovering the particular r=
ow that appears to be having problems, but the rest of the data in the tabl=
e is important.

Thanks a lot for any help on this,

Michael. div>

--00151750e8b65f351f046ff342c6--

Re: Fix corrupt pg_toast table?

am 30.07.2009 23:59:08 von Tom Lane

Michael Clark writes:
> I have a database which appears to have some corruption in a toast value.

> When I select a certain table, I get the following error:
> ERROR: invalid page header in block 984 of relation "pg_toast_17341_index"

If it's only that index that has gotten damaged, REINDEX will fix it, eg
reindex index pg_toast.pg_toast_17341_index;

The bigger concern is whether there is other damage. Have you had any
system crashes, indications of flaky hardware, etc on that machine?

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Fix corrupt pg_toast table?

am 31.07.2009 21:01:16 von Michael Clark

--0015175cb0587c7e7e0470050ba2
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hello again.

I have an update, hoping someone can steer me in the right direction here.

After receiving the "could not access transaction" error mentioned in my
previous email I did some more digging and found that people have had
success resolving this issue using pg_resetxlog.

I gave this a try, first without specifying any options, which did not
resolve anything (it did not give an error on the command line). I then
read up on the command and passed a value for every swtich to set new
values, and again this had no effect. It did not give an error on the
command line, but trying to reset the index gives the same error. The
response I get at the command line after running the pg_resetxlog command
is: Transaction log reset

Here is the error I get when trying to reset this particular index:
2009-07-31 12:58:07.503 EDT [Test] - ERROR: could not access status of
transaction 3839923882
2009-07-31 12:58:07.503 EDT [Test] - DETAIL: Could not open file
"pg_clog/0E4E": No such file or directory.
2009-07-31 12:58:07.503 EDT [Test] - STATEMENT: reindex index
pg_toast.pg_toast_17431_index;

Here is the final pg_resetxlog command I tried to resolve the issue:
pg_resetxlog -f -x 0x100000 -m 0x10000 -o 0x10000 -l 0x1,0x1,0xCD ./

Here are the filenames for the files in the various directories that led me
to these switch values, as per the docs on pg_resetxlog:
-x switch - pg_clog folder had 1 file named: 0000
-m switch - pg_multixact/offsets had 1 file named: 0000
-o switch - pg_multixact/members had 1 file named: 0000
-l swtich - pg_xlog had 7 files, the one with the biggest named is:
0000000100000000000000CC

After running the pg_resetxlog command above I get the exact same error
about pg_clog/0E4E missing.

Does anyone have any other ideas about how to solve this?

Thanks,
Michael.

On Thu, Jul 30, 2009 at 6:08 PM, Michael Clark wrote:

> Thanks for the reply!
> On Thu, Jul 30, 2009 at 5:59 PM, Tom Lane wrote:
>
>> Michael Clark writes:
>> > I have a database which appears to have some corruption in a toast
>> value.
>>
>> > When I select a certain table, I get the following error:
>> > ERROR: invalid page header in block 984 of relation
>> "pg_toast_17341_index"
>>
>> If it's only that index that has gotten damaged, REINDEX will fix it, eg
>> reindex index pg_toast.pg_toast_17341_index;
>>
>
>
> I tried this, and might have exposed another issue.
> Now I am getting:
> ERROR: could not access status of transaction 3839923882
> DETAIL: could not open file "pg_clog/0E4E": No such file or directory.
>
> I checked the pg_clog folder, and there is only a 0000 file.
>
> Any ideas on this one?
>
>
>> The bigger concern is whether there is other damage. Have you had any
>> system crashes, indications of flaky hardware, etc on that machine?
>>
>
> This happened on a customer machine which I have not had a chance to
> investigate. I have copied the PGDATA folder to one of our systems to
> diagnose.
>
> Thanks again,
> Michael.
>
>

--0015175cb0587c7e7e0470050ba2
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello again.

I have an update, hoping someone=
can steer me in the right direction here.

After r=
eceiving the "could not access transaction" error mentioned in my=
previous email I did some more digging and found that people have had succ=
ess resolving this issue using pg_resetxlog.

I gave this a try, first without specifying any options=
, which did not resolve anything (it did not give an error on the command l=
ine). =A0I then read up on the command and passed a value for every swtich =
to set new values, and again this had no effect. =A0It did not give an erro=
r on the command line, but trying to reset the index gives the same error. =
=A0The response I get at the command line after running the pg_resetxlog co=
mmand is: Transaction log reset

Here is the error I get when trying to reset this parti=
cular index:

2009-07-31 12:58:07.503 EDT [Test] - ERROR: could no=
t access status of transaction 3839923882

2009-07-31 12:58:07.503=
EDT [Test] - DETAIL: Could not open file "pg_clog/0E4E": No such=
file or directory.

2009-07-31 12:58:07.503 EDT [Test] - STATEMENT: reindex index pg_toast=
..pg_toast_17431_index;

Here is the final pg_resetx=
log command I tried to resolve the issue:

pg_resetxlog -f -x 0x10=
0000 -m 0x10000 -o 0x10000 -l 0x1,0x1,0xCD ./

Here are the filenames for the files in the various dir=
ectories that led me to these switch values, as per the docs on pg_resetxlo=
g:

-x switch - pg_clog folder had 1 file named: 0000

-m switch - pg_multixact/offsets had 1 file named: 0000

-o switch=
- pg_multixact/members had 1 file named: 0000

-l swtich - pg_xlo=
g had 7 files, the one with the biggest named is: 0000000100000000000000CC<=
/div>

After running the pg_resetxlog command a=
bove I get the exact same error about pg_clog/0E4E missing.

<=
/div>

Does anyone have any other ideas about how to solve this?

Thanks,=A0

Michael.

<=
div class=3D"gmail_quote">On Thu, Jul 30, 2009 at 6:08 PM, Michael Clark pan dir=3D"ltr"><codingninja@gm=
ail.com> wrote:

x #ccc solid;padding-left:1ex;">Thanks for the reply!

"gmail_quote">
On Thu, Jul 30, 2009 at 5:59 PM, Tom Lane <=
span dir=3D"ltr">< >tgl@sss.pgh.pa.us> wrote:

x #ccc solid;padding-left:1ex">

Michael Clark < blank">codingninja@gmail.com> writes:

> I have a database which appears to have some corruption in a toast val=
ue.

> When I select a certain table, I get the following error:

> ERROR: invalid page header in block 984 of relation "pg_toast_173=
41_index"

If it's only that index that has gotten damaged, REINDEX will fix=
it, eg

=A0 =A0 =A0 =A0reindex index pg_toast.pg_toast_17341_index;

I tried this, and mig=
ht have exposed another issue.
Now I am getting:
ERROR:=
could not access status of transaction 3839923882
DETAIL: could =
not open file "pg_clog/0E4E": No such file or directory.

I checked the pg_clog folder, and there is only a 0000 =
file.

Any ideas on this one?
">

ex;border-left:1px #ccc solid;padding-left:1ex">

The bigger concern is whether there is other damage. =A0Have you had any >
system crashes, indications of flaky hardware, etc on that machine?

This happened on a customer machine =
which I have not had a chance to investigate. =A0I have copied the PGDATA f=
older to one of our systems to diagnose.

Thanks ag=
ain,

Michael.

--0015175cb0587c7e7e0470050ba2--

Re: Fix corrupt pg_toast table?

am 31.07.2009 23:24:19 von Greg Stark

On Fri, Jul 31, 2009 at 8:01 PM, Michael Clark wrote:
> I tried this, and might have exposed another issue.
> Now I am getting:
> ERROR: could not access status of transaction 3839923882
> DETAIL: could not open file "pg_clog/0E4E": No such file or directory.
> I checked the pg_clog folder, and there is only a 0000 file.

How long has this database been in use? That's a very high transaction
number that would only be reached on a heavily used database after a
reasonably long period of use. If that's not likely than a likely
possibility is that your table has been overwritten with garbage.
Possibly by a filesystem bug or hardware failure.

You can work around this problem for that row by creating a file named
0E4E in the clog directory. I think you can fill it with zeros but if
you search the mailing list you'll find instructions for doing this
that are might have better suggestions.

You'll still be best off recovering rows one by one. If you have
trouble using the index you might find it more reliable (but more
tedious) to do it using:

select * from tab where ctid = '(0,0)'
select * from tab where ctid = '(0,1)'
select * from tab where ctid = '(0,2)'
....

The first digit is the page number and the second is the index on the
page which can go as high as about 250.

--
greg
http://mit.edu/~gsstark/resume.pdf

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Fix corrupt pg_toast table?

am 01.08.2009 00:57:03 von Michael Clark

--0015175cb100b5835004700856fb
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hello,
Thanks for the reply.

On Fri, Jul 31, 2009 at 5:24 PM, Greg Stark wrote:

> > I tried this, and might have exposed another issue.
> > Now I am getting:
> > ERROR: could not access status of transaction 3839923882
> > DETAIL: could not open file "pg_clog/0E4E": No such file or directory.
> > I checked the pg_clog folder, and there is only a 0000 file.
>
> How long has this database been in use? That's a very high transaction
> number that would only be reached on a heavily used database after a
> reasonably long period of use. If that's not likely than a likely
> possibility is that your table has been overwritten with garbage.
> Possibly by a filesystem bug or hardware failure.
>

Not really that long. The database was created in early June, and if that
is supposed to be a transaction count (3839923882), then yeah that seems way
out there.

>
> You can work around this problem for that row by creating a file named
> 0E4E in the clog directory. I think you can fill it with zeros but if
> you search the mailing list you'll find instructions for doing this
> that are might have better suggestions.
>

When you say fill it with zeros, is there a certain file size I need to
create?
(I will do some more googling to see if I can find more examples of people
recovering from this problem)

>
> You'll still be best off recovering rows one by one. If you have
> trouble using the index you might find it more reliable (but more
> tedious) to do it using:
>
> select * from tab where ctid = '(0,0)'
> select * from tab where ctid = '(0,1)'
> select * from tab where ctid = '(0,2)'
> ...
>
> The first digit is the page number and the second is the index on the
> page which can go as high as about 250.
>

I am sorry, that is a little over my head with my knowledge gained with PG
thus far.
Is there some background information I could read that would explain this to
me?

I tried that select a couple of times, and eventually got the "invalid page
header in the block 984 of relation "pg_toast_17431_index" error message
again. ctid = '(0,6)' gave it to me.
(Although, I don't know what I am doing here! :)

Thanks again for the reply, your help is appreciated!
Michael.

--0015175cb100b5835004700856fb
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello,

Thanks for the reply.

quote">On Fri, Jul 31, 2009 at 5:24 PM, Greg Stark < href=3D"mailto:gsstark@mit.edu">gsstark@mit.edu> wrote:
<=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
#ccc solid;padding-left:1ex;">

> I tried this, and might have exposed another issue.<=
br>
> Now I am getting:

> ERROR: could not access status of transaction 3839923882

> DETAIL: could not open file "pg_clog/0E4E": No such file or =
directory.

> I checked the pg_clog folder, and there is only a 0000 file.

How long has this database been in use? That's a very high transa=
ction

number that would only be reached on a heavily used database after a

reasonably long period of use. If that's not likely than a likely

possibility is that your table has been overwritten with garbage.

Possibly by a filesystem bug or hardware failure.

Not really that long. =A0The database was =
created in early June, and if that is supposed to be a transaction count (3=
839923882), then yeah that seems way out there.

=A0

ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex;">

You can work around this problem for that row by creating a file named

0E4E in the clog directory. I think you can fill it with zeros but if

you search the mailing list you'll find instructions for doing this

that are might have better suggestions.

When you say fill it with zeros, is there =
a certain file size I need to create?

(I will do some more googli=
ng to see if I can find more examples of people recovering from this proble=
m)

=A0

border-left:1px #ccc solid;padding-left:1ex;">

You'll still be best off recovering rows one by one. If you have

trouble using the index you might find it more reliable (but more

tedious) to do it using:

select * from tab where ctid =3D '(0,0)'

select * from tab where ctid =3D '(0,1)'

select * from tab where ctid =3D '(0,2)'

....

The first digit is the page number and the second is the index on the

page which can go as high as about 250.

I am sorry,=
that is a little over my head with my knowledge gained with PG thus far. div>

Is there some background information I could read that would expla=
in this to me? =A0

I tried that select a couple of times, and eventually g=
ot the "invalid page header in the block 984 of relation "pg_toas=
t_17431_index" error message again. =A0 ctid =3D '(0,6)' gave =
it to me.=A0

(Although, I don't know what I am doing here! :)

iv>

Thanks again for the reply, your help is appreciated=
!

Michael.

--0015175cb100b5835004700856fb--

Re: Fix corrupt pg_toast table?

am 01.08.2009 01:18:03 von Tom Lane

Michael Clark writes:
> On Fri, Jul 31, 2009 at 5:24 PM, Greg Stark wrote:
>>> Now I am getting:
>>> ERROR: could not access status of transaction 3839923882

>> How long has this database been in use? That's a very high transaction
>> number that would only be reached on a heavily used database after a
>> reasonably long period of use. If that's not likely than a likely
>> possibility is that your table has been overwritten with garbage.
>> Possibly by a filesystem bug or hardware failure.

> Not really that long. The database was created in early June, and if that
> is supposed to be a transaction count (3839923882), then yeah that seems way
> out there.

It seems certain that you're looking at corrupt data. Trashed data
frequently manifests this way, because the transaction ID is the first
field of tuple headers that the database can cross-check with any
amount of rigor.

At this point we know that at least two unrelated disk blocks have been
clobbered by something (first that index page, and now this). There's
no very good reason to think there are only two :-(. I'd bet at least
a cheese sandwich on hardware problems. Test and fix/replace your
hardware, then go back to your last backup (I hope you've got one).

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Fix corrupt pg_toast table?

am 03.08.2009 21:01:28 von Michael Clark

--00151750e8b6b2b3fe04704165dd
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hello Tom and others who replied.
I appreciate the help tracking this down and eventually determining there is
no point tracking it down further.
We will do our best recovering what is accessible from the corrupt DB and an
older backup.

Thanks,
Michael.

On Fri, Jul 31, 2009 at 7:18 PM, Tom Lane wrote:

> Michael Clark writes:
> > On Fri, Jul 31, 2009 at 5:24 PM, Greg Stark wrote:
> >>> Now I am getting:
> >>> ERROR: could not access status of transaction 3839923882
>
> >> How long has this database been in use? That's a very high transaction
> >> number that would only be reached on a heavily used database after a
> >> reasonably long period of use. If that's not likely than a likely
> >> possibility is that your table has been overwritten with garbage.
> >> Possibly by a filesystem bug or hardware failure.
>
> > Not really that long. The database was created in early June, and if
> that
> > is supposed to be a transaction count (3839923882), then yeah that seems
> way
> > out there.
>
> It seems certain that you're looking at corrupt data. Trashed data
> frequently manifests this way, because the transaction ID is the first
> field of tuple headers that the database can cross-check with any
> amount of rigor.
>
> At this point we know that at least two unrelated disk blocks have been
> clobbered by something (first that index page, and now this). There's
> no very good reason to think there are only two :-(. I'd bet at least
> a cheese sandwich on hardware problems. Test and fix/replace your
> hardware, then go back to your last backup (I hope you've got one).
>
> regards, tom lane
>

--00151750e8b6b2b3fe04704165dd
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello Tom and others who replied.

I appreciate the help =
tracking this down and eventually determining there is no point tracking it=
down further. =A0

We will do our best recovering what is accessi=
ble from the corrupt DB and an older backup.

Thanks,

Michael.

=3D"gmail_quote">On Fri, Jul 31, 2009 at 7:18 PM, Tom Lane "><>=
wrote:

x #ccc solid;padding-left:1ex;">
Michael Clark < =3D"mailto:codingninja@gmail.com">codingninja@gmail.com> writes:

> On Fri, Jul 31, 2009 at 5:24 PM, Greg Stark < ark@mit.edu">gsstark@mit.edu> wrote:

>>> Now I am getting:

>>> ERROR: could not access status of transaction 3839923882

>> How long has this database been in use? Th=
at's a very high transaction

>> number that would only be reached on a heavily used database after=
a

>> reasonably long period of use. If that's not likely than a lik=
ely

>> possibility is that your table has been overwritten with garbage.<=
br>
>> Possibly by a filesystem bug or hardware failure.

> Not really that long. =A0The database was created in early June, and i=
f that

> is supposed to be a transaction count (3839923882), then yeah that see=
ms way

> out there.

It seems certain that you're looking at corrupt data. =A0Trashed =
data

frequently manifests this way, because the transaction ID is the first

field of tuple headers that the database can cross-check with any

amount of rigor.

At this point we know that at least two unrelated disk blocks have been

clobbered by something (first that index page, and now this). =A0There'=
s

no very good reason to think there are only two :-(. =A0I'd bet at leas=
t

a cheese sandwich on hardware problems. =A0Test and fix/replace your

hardware, then go back to your last backup (I hope you've got one).

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0regards, tom lane

--00151750e8b6b2b3fe04704165dd--