Fixing mangled mbox "From " header lines?
Fixing mangled mbox "From " header lines?
am 07.07.2005 23:42:31 von mark
Hello,
I have an archive of a 10 year old public mailing list that I plan to
import into GoogleGroups for archival and retrieval. There are over
27000 messages in the archive. It is in standard 'mbox' format.
In preparation for uploading to Google, I've been doing a lot of
cleanup of the archive -- finding duplicate and off-topic posts,
fixing some mangled headers, removing excess EOL spaces, etc. The
tools I've used for this cleanup are 'vi' and 'The Bat' email client.
One problem I notice is that over 2000 messages have badly misdated
'From ' header fields (the first line in the header). The date in the
field is essentially bogus (however, the data in the 'Date:' and the
various 'Received:' fields look correct.)
So, is there a tool or script which will fix the 'From ' lines?
If you can, post your reply to this newsgroup.
Thanks!
Mark
Re: Fixing mangled mbox "From " header lines?
am 08.07.2005 00:38:36 von Alan Connor
On comp.mail.misc, in
, "Mark" wrote:
> Hello,
>
> I have an archive of a 10 year old public mailing list
> that I plan to import into GoogleGroups for archival and
> retrieval. There are over 27000 messages in the archive. It is
> in standard 'mbox' format.
>
> In preparation for uploading to Google, I've been doing a lot
> of cleanup of the archive -- finding duplicate and off-topic
> posts, fixing some mangled headers, removing excess EOL spaces,
> etc. The tools I've used for this cleanup are 'vi' and 'The
> Bat' email client.
>
> One problem I notice is that over 2000 messages have badly
> misdated 'From ' header fields (the first line in the
> header). The date in the field is essentially bogus (however,
> the data in the 'Date:' and the various 'Received:' fields look
> correct.)
>
> So, is there a tool or script which will fix the 'From ' lines?
>
> If you can, post your reply to this newsgroup.
That's how it's normally done, and it is obvious that you know
that full well.
>
> Thanks!
>
> Mark
>
Why would someone need to cover their tracks in google groups
by using a common first name for an alias and a newsserver that
doesn't give their IP, and a forged Message-ID, for something
like this?
Not to mention that there are DNS problems resolving
nowhere.com, which is (marginally) registered by Tucows.
Piss off.
AC
--
http://angel.1jh.com./nanae/kooks/alanconnor.html
http://home.earthlink.net/~alanconnor/
FAQ: Canonical list of questions Beavis refuses to answer (V1.40) (was Re: Fixing mangled mbo
am 08.07.2005 01:00:09 von Sam
This is a MIME GnuPG-signed message. If you see this text, it means that
your E-mail or Usenet software does not support MIME signed messages.
--=_mimegpg-commodore.email-scan.com-10719-1120777211-0001
Content-Type: text/plain; format=flowed; charset="US-ASCII"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
Usenet Beavis writes:
> On comp.mail.misc, in
> , "Mark" wrote:
>
>> One problem I notice is that over 2000 messages have badly
>> misdated 'From ' header fields (the first line in the
>> header). The date in the field is essentially bogus (however,
>> the data in the 'Date:' and the various 'Received:' fields look
>> correct.)
>>
>> So, is there a tool or script which will fix the 'From ' lines?
>>
>> If you can, post your reply to this newsgroup.
>
> That's how it's normally done, and it is obvious that you know
> that full well.
How's _what_ is normally done, Beavis?
>> Thanks!
>>
>> Mark
>>
>
> Why would someone need to cover their tracks in google groups
> by using a common first name for an alias and a newsserver that
> doesn't give their IP, and a forged Message-ID, for something
> like this?
Why do everyone call you Beavis, Beavis?
> Not to mention that there are DNS problems resolving
> nowhere.com, which is (marginally) registered by Tucows.
Tells us all you about DNS, Beavis. That should be a fascination
conversation.
============================================================ ================
FAQ: Canonical list of questions Beavis refuses to answer (V1.40)
This is a canonical list of questions that Beavis never answers. This FAQ is
posted on a semi-regular schedule, as circumstances warrant.
For more information on Beavis, see:
http://angel.1jh.com/nanae/kooks/alanconnor.shtml
Although Beavis has been posting for a long time, he always remains silent
on the subjects enumerated below. His response, if any, usually consists of
replying to the parent post with a loud proclamation that his Usenet-reading
software runs a magical filter that automatically identifies anyone who's
making fun of him, and hides those offensive posts. For more information
see question #9 below.
============================================================ ================
1) If spammers avoid forging real E-mail addresses on spam, then where do
all these bounces everyone reports getting (for spam with their return
address was forged onto) come from?
2) If your Challenge-Response filter is so great, why do you still munge
when posting to Usenet?
3) Do you still believe that rsh is the best solution for remote access?
(http://tinyurl.com/5qqb6)
4) What is your evidence that everyone who disagrees with you, and thinks
that you're a moron, is a spammer?
5) How many different individuals do you believe really post to
comp.mail.misc? What is the evidence for your paranoid belief that everyone,
except you, who posts here is some unknown arch-nemesis of yours?
6) How many times, or how often, do you believe is necessary to announce
that you do not read someone's posts? What is your reason for making these
regularly-scheduled proclamations? Who do you believe is so interested in
keeping track of your Usenet-reading habits?
7) When was the last time you saw Bigfoot (http://tinyurl.com/23r3f)?
8) If your C-R system employs a spam filter so that it won't challenge spam,
then why does any of the mail that passes the filter, and is thusly presumed
not to be spam, need to be challenged?
9) You claim that the software you use to read Usenet magically identifies
any post that makes fun of you. In http://tinyurl.com/3swes you explain
that "What I get in my newsreader is a mock post with fake headers and no
body, except for the first parts of the Subject and From headers."
Since your headers indicate that you use slrn and, as far as anyone knows,
the stock slrn doesn't work that way, is this interesting patch to slrn
available for download anywhere?
10) You regularly post alleged logs of your procmail recipe autodeleting a
bunch of irrelevant mail that you've received. Why, and who exactly do you
believe is interested in your mail logs?
11) How exactly do you "enforce" an "order" to stay out of your mailbox,
supposedly (http://tinyurl.com/cs8jt)? Since you issue this "order" about
every week, or so, apparently nobody wants to follow it. What are you going
to do about it?
12) What's with your fascination with shit? (also http://tinyurl.com/cs8jt)?
13) You complain about some arch-nemesis of yours always posting forged
messages in your name. Can you come up with even a single URL, as an example
of what you're talking about?
14) You always complain about some mythical spammers that pretend to be
spamfighters (http://tinyurl.com/br4td). Who exactly are those people, and
can you post a copy of a spam that you supposedly received from them, that
proves that they're really spammers, and not spamfighters?
--=_mimegpg-commodore.email-scan.com-10719-1120777211-0001
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iD8DBQBCzbP8x9p3GYHlUOIRAvYaAJ4+Lw304XYO8pfibd2VjWl4LTEndwCd Em1K
zjIBF40B3p26Gqy4tj5sw0w=
=1fCG
-----END PGP SIGNATURE-----
--=_mimegpg-commodore.email-scan.com-10719-1120777211-0001--
Re: Fixing mangled mbox "From " header lines?
am 08.07.2005 07:23:55 von AK
Mark wrote:
> Hello,
>
> I have an archive of a 10 year old public mailing list that I plan to
> import into GoogleGroups for archival and retrieval. There are over
> 27000 messages in the archive. It is in standard 'mbox' format.
>
> In preparation for uploading to Google, I've been doing a lot of
> cleanup of the archive -- finding duplicate and off-topic posts,
> fixing some mangled headers, removing excess EOL spaces, etc. The
> tools I've used for this cleanup are 'vi' and 'The Bat' email client.
>
> One problem I notice is that over 2000 messages have badly misdated
> 'From ' header fields (the first line in the header). The date in the
> field is essentially bogus (however, the data in the 'Date:' and the
> various 'Received:' fields look correct.)
>
> So, is there a tool or script which will fix the 'From ' lines?
>
> If you can, post your reply to this newsgroup.
>
> Thanks!
>
> Mark
>
Mark,
In vi, if you determined the parten the "bogus" entries have, you can
use the substitution mechanism to correct it :%s/pattern/newpattern/g.
It might be better to separate the mbox stored messages into individual
messages using formail.
If you can determine a pattern, you can then use perl (perl -e -ibak
's/pattern/newpatern/g;' *) to work on all files and alter the
information. -ibak will create a backup of evey file with a .bak extention.
Another thought is, how important is the information to you that exists
in the 'From ' header entry? Delete the "bogus" entry or leave it alone
since it is quite apparent that your interests lie in the data within
the message and not by whom or when it was sent.
Alan,
What was the point you were trying to convey in your response?
Usenet is full of trolls trolling for email addresses, as you might
know. I do not believe there is any RFC requirement for USENET to use a
valid email address (only exception might be the moderated groups since
they rely on the SMTP process to exchange data). Have a look and
correct me should I be wrong:
http://www.google.com/search?hl=en&q=RFC+NNTP&btnG=Google+Se arch
AK
Re: Fixing mangled mbox "From " header lines?
am 08.07.2005 16:52:35 von mark
AK wrote:
> Mark wrote:
>> I have an archive of a 10 year old public mailing list that I plan to
>> import into GoogleGroups for archival and retrieval. There are over
>> 27000 messages in the archive. It is in standard 'mbox' format.
>>
>> In preparation for uploading to Google, I've been doing a lot of
>> cleanup of the archive -- finding duplicate and off-topic posts,
>> fixing some mangled headers, removing excess EOL spaces, etc. The
>> tools I've used for this cleanup are 'vi' and 'The Bat' email client.
>>
>> One problem I notice is that over 2000 messages have badly misdated
>> 'From ' header fields (the first line in the header). The date in the
>> field is essentially bogus (however, the data in the 'Date:' and the
>> various 'Received:' fields look correct.)
>>
>> So, is there a tool or script which will fix the 'From ' lines?
> In vi, if you determined the parten the "bogus" entries have, you
> can use the substitution mechanism to correct it
> :%s/pattern/newpattern/g.
Well, with 2000 of them to fix, and each one having its own date/time
(which should closely correspond to the datestamp in in the last
Received: header line), I think that 'vi' would take too long. If
there were only 200 posts to fix, I'd just manually do it as described
(since the time to post here, write/run scripts, etc. probably exceeds
fixing 'em the brute force way.)
> Another thought is, how important is the information to you that
> exists in the 'From ' header entry? Delete the "bogus" entry or
> leave it alone since it is quite apparent that your interests lie in
> the data within the message and not by whom or when it was sent.
Some mbox-capable mail clients use the datestamp info in the 'From '
header line for sorting/presentation purposes. "The Bat", which I
use, does so. The data in the 'From ' line should accurately
correspond to the timestamp in the last Received: line and the source
from the "From:" line.
Now, one approach is simply to trash the current 'From ' lines and
reconstruct them using the timestamp in the last 'Received:' line
(or the 'Date:" line) and the source from the 'From:" line. This
requires mbox header parsing when done by machine. That
would be acceptable.
> Alan,
>
> What was the point you were trying to convey in your response?
> Usenet is full of trolls trolling for email addresses, as you might
> know. I do not believe there is any RFC requirement for USENET to
> use a valid email address (only exception might be the moderated
> groups since they rely on the SMTP process to exchange data). Have a
> look and correct me should I be wrong:
> http://www.google.com/search?hl=en&q=RFC+NNTP&btnG=Google+Se arch
Well, I probably should have setup an email address using fastmail or
something, and then provide the email address in the body using a
mangled form to foil spambots.
I simply want to maintain my privacy and my mailbox from spam, which
I believe is my right to do so on Usenet, which I've used for almost
twenty years now (there's not yet a law or general consensus that
everybody posting to Usenet must use their real identity and traceback
information -- it would not surprise me if a substantial number of
people, many of whom post here, are at least pseudonymous.)
The proof as to whether my message is acceptable or not should not lie
in the use of anonymity or pseudonymity, but in the "content" of what
is being said -- is it objective and rational, or is it inflammatory,
off-topic, dangerous or even illegal (such as distributing
child-porn). I do understand Alan's position, though, but don't
believe that full anonymity is evil and somehow anti-social. The focus
should be on the content, not the person. (Anyway, had I broken the
law, law enforcement could easily track me down in minutes with a
simple phone call -- I didn't take the final steps of covering my
tracks where even the authorities could not locate me -- it's
inconvenient for me to be so extreme in anonymity, and I don't plan to
violate the law any time soon! Now, if I live in a dictatorship, I
certainly would be more extreme in my anonymity.)
Anyway, I digress. This is comp.mail.misc, and not a newsgroup to
discuss the legal and social aspects of anonymous posting. If Alan
would like to further discuss this, let's move that discussion to a
newsgroup relevant to such discussion. I'll be happy to participate
to give a rational and objective perspective of why using varying
levels of anonymity in posting to Usenet should be acceptable so
long as the posts are on-topic, reasonable and rational, and do not
violate the law.
Mark
Re: Fixing mangled mbox "From " header lines?
am 08.07.2005 17:25:56 von NormanM
On Thu, 07 Jul 2005 22:38:36 GMT, Alan Connor wrote:
> Not to mention that there are DNS problems resolving
> nowhere.com, which is (marginally) registered by Tucows.
I expect that those problems are miniscule when compared to resolving DNS
for "immoral.invalid".
--
Norman
~Win dain a lotica, En vai tu ri, Si lo ta
~Fin dein a loluca, En dragu a sei lain
~Vi fa-ru les shutai am, En riga-lint
Re: Fixing mangled mbox "From " header lines?
am 09.07.2005 14:06:10 von AK
Mark wrote:
> AK wrote:
>
>>Mark wrote:
>
>
>>>I have an archive of a 10 year old public mailing list that I plan to
>>>import into GoogleGroups for archival and retrieval. There are over
>>>27000 messages in the archive. It is in standard 'mbox' format.
>>>
>>>In preparation for uploading to Google, I've been doing a lot of
>>>cleanup of the archive -- finding duplicate and off-topic posts,
>>>fixing some mangled headers, removing excess EOL spaces, etc. The
>>>tools I've used for this cleanup are 'vi' and 'The Bat' email client.
>>>
>>>One problem I notice is that over 2000 messages have badly misdated
>>>'From ' header fields (the first line in the header). The date in the
>>>field is essentially bogus (however, the data in the 'Date:' and the
>>>various 'Received:' fields look correct.)
>>>
>>>So, is there a tool or script which will fix the 'From ' lines?
>
>
>>In vi, if you determined the parten the "bogus" entries have, you
>>can use the substitution mechanism to correct it
>>:%s/pattern/newpattern/g.
>
>
> Well, with 2000 of them to fix, and each one having its own date/time
> (which should closely correspond to the datestamp in in the last
> Received: header line), I think that 'vi' would take too long. If
> there were only 200 posts to fix, I'd just manually do it as described
> (since the time to post here, write/run scripts, etc. probably exceeds
> fixing 'em the brute force way.)
>
>
>
>>Another thought is, how important is the information to you that
>>exists in the 'From ' header entry? Delete the "bogus" entry or
>>leave it alone since it is quite apparent that your interests lie in
>>the data within the message and not by whom or when it was sent.
>
>
> Some mbox-capable mail clients use the datestamp info in the 'From '
> header line for sorting/presentation purposes. "The Bat", which I
> use, does so. The data in the 'From ' line should accurately
> correspond to the timestamp in the last Received: line and the source
> from the "From:" line.
>
> Now, one approach is simply to trash the current 'From ' lines and
> reconstruct them using the timestamp in the last 'Received:' line
> (or the 'Date:" line) and the source from the 'From:" line. This
> requires mbox header parsing when done by machine. That
> would be acceptable.
>
>
>
>>Alan,
>>
>>What was the point you were trying to convey in your response?
>>Usenet is full of trolls trolling for email addresses, as you might
>>know. I do not believe there is any RFC requirement for USENET to
>>use a valid email address (only exception might be the moderated
>>groups since they rely on the SMTP process to exchange data). Have a
>>look and correct me should I be wrong:
>>http://www.google.com/search?hl=en&q=RFC+NNTP&btnG=Google+ Search
>
>
> Well, I probably should have setup an email address using fastmail or
> something, and then provide the email address in the body using a
> mangled form to foil spambots.
>
> I simply want to maintain my privacy and my mailbox from spam, which
> I believe is my right to do so on Usenet, which I've used for almost
> twenty years now (there's not yet a law or general consensus that
> everybody posting to Usenet must use their real identity and traceback
> information -- it would not surprise me if a substantial number of
> people, many of whom post here, are at least pseudonymous.)
>
> The proof as to whether my message is acceptable or not should not lie
> in the use of anonymity or pseudonymity, but in the "content" of what
> is being said -- is it objective and rational, or is it inflammatory,
> off-topic, dangerous or even illegal (such as distributing
> child-porn). I do understand Alan's position, though, but don't
> believe that full anonymity is evil and somehow anti-social. The focus
> should be on the content, not the person. (Anyway, had I broken the
> law, law enforcement could easily track me down in minutes with a
> simple phone call -- I didn't take the final steps of covering my
> tracks where even the authorities could not locate me -- it's
> inconvenient for me to be so extreme in anonymity, and I don't plan to
> violate the law any time soon! Now, if I live in a dictatorship, I
> certainly would be more extreme in my anonymity.)
>
> Anyway, I digress. This is comp.mail.misc, and not a newsgroup to
> discuss the legal and social aspects of anonymous posting. If Alan
> would like to further discuss this, let's move that discussion to a
> newsgroup relevant to such discussion. I'll be happy to participate
> to give a rational and objective perspective of why using varying
> levels of anonymity in posting to Usenet should be acceptable so
> long as the posts are on-topic, reasonable and rational, and do not
> violate the law.
>
> Mark
>
Mark,
Scripting the process might be much easier if you were to seperate the
mbox formated emails into A Maildir style where each message is
contained within a single file.
You can then script the process that will read in the header of a
message until it hits the first Received line, extract the sender info
from the existing 'From ' line, extract the date stamp, drop the
existing 'From ' header entry, generate a new 'From ' header with the
updated information and the headers that were seen thus far into a new
message file, continue to pipe the rest of the message into a new file.
And you should be done.
The additional difficulty is to get the proper day of the week and
convert the timestamp from the received line into localtime. In both
cases the issue can be resolved if you have HTTP::Date perl module
installed which can be used to convert a string into a unix timestamp
(since epoch). You can then use localtime to generate all the
inforamtion you need.
Once you have the newly reformated messages with the corrected 'From '
lines, you can recombine them into the mbox format.
AK
Re: Fixing mangled mbox "From " header lines?
am 16.07.2005 10:46:42 von Frank Slootweg
Mark wrote:
> Hello,
>
> I have an archive of a 10 year old public mailing list that I plan to
> import into GoogleGroups for archival and retrieval. There are over
> 27000 messages in the archive. It is in standard 'mbox' format.
>
> In preparation for uploading to Google, I've been doing a lot of
> cleanup of the archive -- finding duplicate and off-topic posts,
> fixing some mangled headers, removing excess EOL spaces, etc. The
> tools I've used for this cleanup are 'vi' and 'The Bat' email client.
>
> One problem I notice is that over 2000 messages have badly misdated
> 'From ' header fields (the first line in the header). The date in the
> field is essentially bogus (however, the data in the 'Date:' and the
> various 'Received:' fields look correct.)
>
> So, is there a tool or script which will fix the 'From ' lines?
Before going to much trouble, you may want to ask Google if they
actually *care* about 'From ' (*not* 'From: '). In a *News* article,
'From ' is mostly irrelevant [1]. It's 'From: ' and 'Date: ' which
count.
[1] For example in a rnews batchfile, which is the most common (file)
import format for News, 'From ' does not even occur. And in the 'save'
format of my newsreader (tin), it is the date/time the article was
saved, not when it was written/received/etc..