Fixing a currupt mbox format file?

Fixing a currupt mbox format file?

am 14.06.2007 14:56:53 von tuxedo

Hello!

In the course of my daily work routines, I'm faced with the task of
repairing a gigantic 'mbox' file that appears to have been corrupted in
that it only displays the 20 odd most recent of approximately 3000 messages.

The mail application used is Mozilla on Windows. The exact same error
occurs in all Mozilla applications tested, i.e. the full Mozilla suite, or
the more recent Seamonkey, as well as the stand-alone Thunderbird.

The identical error happens when testing the mbox file in Mozilla
Thunderbird on a Linux system. However, the same mbox works fine when
viewed in for example Kmail, the standard KDE mailer. The Kmail application
used was also configured to display the file in mbox format. In other
words, the error may partly be attributed to how Mozilla parses it's own
mbox and partly to the incorrectly formatted mail message(s).

I realize that there are various drawbacks with the mbox format, however,
I'm not sure if Mozilla Windows offer any other options. Also, it is a
machine of a customer who is unfortunately resisting the idea of switching
to an operating system with better alternatives such as Linux.

Naturally, I can edit the faulty mbox by hand, but as it contains over 3000
messages, I'd hate the idea of trying to spend the next couple of days
trying to identify which messages are causing the index (.msf file) not to
display the complete mbox file.

I have also tried the idea of copying all messages into an mdir formatted
folder and thereafter back to an mbox formatted file in hoping that could
perhaps fix the problem, but this did not work :-(

Does the problem sound familiar and can anyone recommend a Unix/Linux
script or program which can traverse through an mbox file to identify
incorrectly formatted messages? Once I know which messages are the culprits
I can simply remove them by hand using a plain text editor.

Many thanks for any ideas!

Tuxedo

--
"Imagine if every Thursday your shoes exploded if you tied them the
usual way. This happens to us all the time with computers, and nobody
thinks of complaining."
-- Jeff Raskin, interviewed in Doctor Dobb's Journal

Re: Fixing a currupt mbox format file?

am 15.06.2007 11:39:28 von chris-usenet

Tuxedo wrote:
> In the course of my daily work routines, I'm faced with the task of
> repairing a gigantic 'mbox' file that appears to have been corrupted
> in that it only displays the 20 odd most recent of approximately
> 3000 messages.

> The mail application used is Mozilla on Windows [...] The identical
> error happens when testing the mbox file in Mozilla Thunderbird on a
> Linux system [...]

First off, simply delete the associated .msf file and let
Mozilla/Thunderbird rebuild it for you. (Don't do this while the
application is running as it may have cached the faulty index.)

If that doesn't work you can use csplit (unix) to split the file into
its constituent mail messages. Once you have those there are various
things you can do with them (including feeding them back into sendmail
for delivery as "new" messages).

Chris

Re: Fixing a currupt mbox format file?

am 15.06.2007 14:52:13 von Mark Crispin

On Fri, 15 Jun 2007, Chris Davies wrote:
> First off, simply delete the associated .msf file and let
> Mozilla/Thunderbird rebuild it for you. (Don't do this while the
> application is running as it may have cached the faulty index.)

So Mozilla/Thunderbird maintains an index file for traditional UNIX
format mailboxes, but doesn't have adequate measures in place to detect
when the index is out of sync with the mailbox?

Interesting. This has been the standard argument against trying to use an
index file with traditional UNIX format and other formats which were not
designed to use an index from the onset.

One of the benefits -- perhaps the only benefit -- of traditional UNIX
format is that it is difficult to corrupt and trivial to repair. A skewed
index file that is not reliably detected (and rebuilt) removes this
advantage.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.

Re: Fixing a currupt mbox format file?

am 15.06.2007 18:22:03 von tuxedo

Chris Davies wrote:

[...]

> First off, simply delete the associated .msf file and let
> Mozilla/Thunderbird rebuild it for you. (Don't do this while the
> application is running as it may have cached the faulty index.)

Thanks, that's how I did it but it didn't do the trick :-(

> If that doesn't work you can use csplit (unix) to split the file into
> its constituent mail messages. Once you have those there are various
> things you can do with them (including feeding them back into sendmail
> for delivery as "new" messages).

Thanks for this, too. I have csplit on my Linux box. However, I'm not
familiar with this utility. For example, with an mbox in
~/Mail/myCorruptInbox exactly how would I run csplit to process that file?

Even if I feed them into sendmail after, or somehow merge them into one
mbox file, perhaps the errors - which occur exclusively in Mozilla
regardless on which platform, but not MUTT or other native Linux mailers -
may carry themselves forward once all messages are placed back into an
Mozilla folder again. Judging by the fact that whichever messages are
currupt do not affect other mail programs, it seems to be a Mozilla mail
handling bug, and this makes it difficult to find any existing utilities
which could help identify which of the 3000+ messages cause the error.

Re: Fixing a currupt mbox format file?

am 16.06.2007 23:34:24 von feenberg

On Jun 15, 12:22 pm, Tuxedo wrote:
> Chris Davies wrote:
>
> [...]
>
> > First off, simply delete the associated .msf file and let
> > Mozilla/Thunderbird rebuild it for you. (Don't do this while the
> > application is running as it may have cached the faulty index.)
>
> Thanks, that's how I did it but it didn't do the trick :-(
>
> > If that doesn't work you can use csplit (unix) to split the file into
> > its constituent mail messages. Once you have those there are various
> > things you can do with them (including feeding them back into sendmail
> > for delivery as "new" messages).
>
> Thanks for this, too. I have csplit on my Linux box. However, I'm not
> familiar with this utility. For example, with an mbox in
> ~/Mail/myCorruptInbox exactly how would I run csplit to process that file?
>
> Even if I feed them into sendmail after, or somehow merge them into one
> mbox file, perhaps the errors - which occur exclusively in Mozilla
> regardless on which platform, but not MUTT or other native Linux mailers -
> may carry themselves forward once all messages are placed back into an
> Mozilla folder again. Judging by the fact that whichever messages are
> currupt do not affect other mail programs, it seems to be a Mozilla mail
> handling bug, and this makes it difficult to find any existing utilities
> which could help identify which of the 3000+ messages cause the error.

I am no expert, but my first guess is that somewhere in the body of
the 21st message there is a line beginning with "From " that didn't
get properly escaped. Delete that line and things should improve. Why
doesn't it bother the other mail clients? Perhaps they parse the rest
of the line, and realize the line isn't a message separator.

If that doesn't work, I would delete a couple of messages in the mbox
file around the 20th and see if that doesn't fix things. It doesn't
seem like the problem could be in any of the 2979 messages preceeding
that one, or clearing the index would have fixed matters.

Daniel Feenberg

Re: Fixing a currupt mbox format file?

am 18.06.2007 18:19:07 von Neil Woods

Tuxedo writes:

> Chris Davies wrote:
>
> [...]
>
>> First off, simply delete the associated .msf file and let
>> Mozilla/Thunderbird rebuild it for you. (Don't do this while the
>> application is running as it may have cached the faulty index.)
>
> Thanks, that's how I did it but it didn't do the trick :-(
>
>> If that doesn't work you can use csplit (unix) to split the file into
>> its constituent mail messages. Once you have those there are various
>> things you can do with them (including feeding them back into sendmail
>> for delivery as "new" messages).
>
> Thanks for this, too. I have csplit on my Linux box. However, I'm not
> familiar with this utility. For example, with an mbox in
> ~/Mail/myCorruptInbox exactly how would I run csplit to process that file?

Well in situations like this it's always a good idea to work on a copy
of the data. I would do something like the following:

$ mkdir /tmp/test && cp ~/Mail/myCurruptInbox /tmp/test/mbox && cd /tmp/test

Now to run csplit. You mentioned this mailbox has approximately 3000
messages in it -- thus use '-n 4', also we don't want to remove output
files on errors -- that's the '-k' argument...

$ csplit -n 4 -k mbox '/^From /' '{*}'

(Before doing this I recommend reading the man/info pages for csplit, to
make sure you understand exactly what the above command is doing).

This will produce files of the form xx0000 to xx3000 (if there are
3001 messages, for example). The first file should be empty, and
can be deleted.

Now it's 'just' a case of going through each file to check which one(s)
may be causing errors within Mozilla. Things to look out for might be
malformed 'From ' lines, and possibly binary data. Ask in one of the
Mozilla groups for help here.

> Even if I feed them into sendmail after, or somehow merge them into one
> mbox file, perhaps the errors - which occur exclusively in Mozilla
> regardless on which platform, but not MUTT or other native Linux mailers -
> may carry themselves forward once all messages are placed back into an
> Mozilla folder again. Judging by the fact that whichever messages are
> currupt do not affect other mail programs, it seems to be a Mozilla mail
> handling bug, and this makes it difficult to find any existing utilities
> which could help identify which of the 3000+ messages cause the error.

No need to run these messages through sendmail, just use the 'cat'
command. E.g. something like:

$ cat xx* > new_mbox

to concatenate _all_ messages into new_mbox, or

$ cat xx0* xx1000 > new_mbox1

to concatenate the first 1000 messages into new_mbox1.

--
Neil.
Even more amazing was the realization that God has Internet access. I
wonder if He has a full newsfeed?
-- Matt Welsh

Re: Fixing a currupt mbox format file?

am 19.06.2007 20:30:05 von tuxedo

Neil Woods wrote:

> Tuxedo writes:
>
> > Chris Davies wrote:
> >
> > [...]
> >
> >> First off, simply delete the associated .msf file and let
> >> Mozilla/Thunderbird rebuild it for you. (Don't do this while the
> >> application is running as it may have cached the faulty index.)
> >
> > Thanks, that's how I did it but it didn't do the trick :-(
> >
> >> If that doesn't work you can use csplit (unix) to split the file into
> >> its constituent mail messages. Once you have those there are various
> >> things you can do with them (including feeding them back into sendmail
> >> for delivery as "new" messages).
> >
> > Thanks for this, too. I have csplit on my Linux box. However, I'm not
> > familiar with this utility. For example, with an mbox in
> > ~/Mail/myCorruptInbox exactly how would I run csplit to process that
> > file?
>
> Well in situations like this it's always a good idea to work on a copy
> of the data. I would do something like the following:
>
> $ mkdir /tmp/test && cp ~/Mail/myCurruptInbox /tmp/test/mbox && cd
> /tmp/test
>
> Now to run csplit. You mentioned this mailbox has approximately 3000
> messages in it -- thus use '-n 4', also we don't want to remove output
> files on errors -- that's the '-k' argument...
>
> $ csplit -n 4 -k mbox '/^From /' '{*}'
>
> (Before doing this I recommend reading the man/info pages for csplit, to
> make sure you understand exactly what the above command is doing).
>
> This will produce files of the form xx0000 to xx3000 (if there are
> 3001 messages, for example). The first file should be empty, and
> can be deleted.
>
> Now it's 'just' a case of going through each file to check which one(s)
> may be causing errors within Mozilla. Things to look out for might be
> malformed 'From ' lines, and possibly binary data. Ask in one of the
> Mozilla groups for help here.
>
> > Even if I feed them into sendmail after, or somehow merge them into one
> > mbox file, perhaps the errors - which occur exclusively in Mozilla
> > regardless on which platform, but not MUTT or other native Linux mailers
> > - may carry themselves forward once all messages are placed back into an
> > Mozilla folder again. Judging by the fact that whichever messages are
> > currupt do not affect other mail programs, it seems to be a Mozilla mail
> > handling bug, and this makes it difficult to find any existing utilities
> > which could help identify which of the 3000+ messages cause the error.
>
> No need to run these messages through sendmail, just use the 'cat'
> command. E.g. something like:
>
> $ cat xx* > new_mbox
>
> to concatenate _all_ messages into new_mbox, or
>
> $ cat xx0* xx1000 > new_mbox1
>
> to concatenate the first 1000 messages into new_mbox1.
>

Thanks for the above examples. I will keep the these procedures in mind for
another purpose. The problematic mbox has been repaired using a free
Windows based program named Tbird2OE:
http://www.download.com/Tbird2OE/3000-2369_4-10601980.html

This small and excellent GUI Perl built application is designed to convert
mbox format to Outlook format, for whoever needs that. All it does is
splitting all messages into separate eml messages, which can easily be
imported into Outlook. By thereafter converting the resulting Outlook
messages back into the Mozilla mbox format using the builtin import mail
function in Mozilla, all messages were successfully restored. Simply
importing the corrupt mbox using Outlook's builtin import function did not
work, in that it just carried the errors into Outlook. In other words,
Tbird2OE does a better job for this type of conversion. I guess also
Windows applications must come to rescue on the odd occasion :-)

Re: Fixing a currupt mbox format file?

am 20.06.2007 19:08:58 von keeling

Neil Woods :
>
> No need to run these messages through sendmail, just use the 'cat'
> command. E.g. something like:
>
> $ cat xx* > new_mbox
>
> to concatenate _all_ messages into new_mbox, or
>
> $ cat xx0* xx1000 > new_mbox1
..........^^^^
>
> to concatenate the first 1000 messages into new_mbox1.

How about "xx0[0-9]*"? That'll get 0000 - 0999, which really is the
first thousand.


--
Any technology distinguishable from magic is insufficiently advanced.
(*) http://www.spots.ab.ca/~keeling Linux Counter #80292
- - http://www.faqs.org/rfcs/rfc1855.html Please, don't Cc: me.

Re: Fixing a currupt mbox format file?

am 20.06.2007 21:01:13 von Neil Woods

"s. keeling" writes:

> Neil Woods :
>>
>> No need to run these messages through sendmail, just use the 'cat'
>> command. E.g. something like:
>>
>> $ cat xx* > new_mbox
>>
>> to concatenate _all_ messages into new_mbox, or
>>
>> $ cat xx0* xx1000 > new_mbox1
> .........^^^^
>>
>> to concatenate the first 1000 messages into new_mbox1.
>
> How about "xx0[0-9]*"? That'll get 0000 - 0999, which really is the
> first thousand.

Yes but the first file produced by the csplit command[1] (i.e. xx0000)
is actually empty, thus we need to start at xx0001.

[1] $ csplit -n 4 -k mbox '/^From /' '{*}'

--
Neil.
If parents would only realize how they bore their children.
-- G. B. Shaw

Re: Fixing a currupt mbox format file?

am 29.06.2007 11:37:19 von chris-usenet

On Fri, 15 Jun 2007, Chris Davies wrote:
> First off, simply delete the associated .msf file and let
> Mozilla/Thunderbird rebuild it for you. (Don't do this while the
> application is running as it may have cached the faulty index.)

Mark Crispin wrote:
> So Mozilla/Thunderbird maintains an index file for traditional UNIX
> format mailboxes, but doesn't have adequate measures in place to detect
> when the index is out of sync with the mailbox?

No..., I suggested deleting the index file, not amending the mailbox
iteself. It's quite reasonable for TB to have cached the index file, and
as the OP wouldn't have changed the mailbox there's no reason for TB to
invalidate its index cache.

Regards,
Chris