Reliable html2plain/text conversion using procmailrc

Reliable html2plain/text conversion using procmailrc

am 09.12.2006 23:13:48 von Oliver Pfeiffer

Hi,

I've seen many basic approaches to convert incoming mails from html to
plain/text by piping the body through html2text, lynx or w3m. But this kind
of mechanism isn't sufficient since most html mails are wrapped by using a
complex content-type (e.g. multipart/alternative) combined with various
character-encodings (e.g. iso-8859-1) and transfer-encodings (e.g. quoted-
printable).

Thus I am looking for a smart all-in-one script, that incorporates these
sophisticated aspects (as described in RFC822) and provides a reliable
conversion of any incoming mail to safely get rid of these annoying html
stuff.

Note: I am the person who does not want to get html stuff in his inbox,
because I hate these large and blinking messages. But on the other hand, I
definitely want to provide the freedom to everybody - who like so - to
always send html messages. I can't start teaching other people about the
badness of html messages by using auto-responders or similar didactic
strategies, since many of my communication parties are business contacts
(e.g. customers). Thus I just want to apply a sophisticated conversion in
my own small world to make me more happy. ;)

Does anybody know a reliable solution for this problem? In my first
thoughts (if I won't find a dedicated tool for this job) it quickly becomes
obvious that this kind of conversion needs a fully RFC822 compatible
parsing framework to decode the message, convert (or drop in case of
multipart/alternative) the html parts, and encode the message again using a
robust handling of all possible encodings and especially attachments. The
main conversion can be easily done by using html2text (Martin Bayer) since
this tool provides a very accurate and flexible conversion. The most
reliable RFC822 parsing framework I know is JavaMail provided by the SDN
(Sun-Developer-Network). Using this framework it needs only a few lines of
code to decode, process and encode *any* RFC822 message passed on command
line ...

But nobody really like to fork a JRE task instance during a high throughput
procmail processing. Thus I would appreciate any comments to this topic!

--
Grüße - Regards
Oliver Pfeiffer
ICQ-ID 84320006

Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 00:14:51 von Alan Connor

On comp.mail.misc, in
,
"Oliver Pfeiffer" wrote:


> Hi,
>
> I've seen many basic approaches to convert incoming mails
> from html to plain/text by piping the body through html2text,
> lynx or w3m. But this kind of mechanism isn't sufficient
> since most html mails are wrapped by using a complex
> content-type (e.g. multipart/alternative) combined with various
> character-encodings (e.g. iso-8859-1) and transfer-encodings
> (e.g. quoted- printable).
>
> Thus I am looking for a smart all-in-one script, that
> incorporates these sophisticated aspects (as described in
> RFC822) and provides a reliable conversion of any incoming mail
> to safely get rid of these annoying html stuff.
>
> Note: I am the person who does not want to get html stuff
> in his inbox, because I hate these large and blinking
> messages. But on the other hand, I definitely want to provide
> the freedom to everybody - who like so - to always send html
> messages. I can't start teaching other people about the
> badness of html messages by using auto-responders or similar
> didactic strategies, since many of my communication parties are
> business contacts (e.g. customers). Thus I just want to apply a
> sophisticated conversion in my own small world to make me more
> happy. ;)
>
> Does anybody know a reliable solution for this problem? In
> my first thoughts (if I won't find a dedicated tool for this
> job) it quickly becomes obvious that this kind of conversion
> needs a fully RFC822 compatible parsing framework to decode the
> message, convert (or drop in case of multipart/alternative)
> the html parts, and encode the message again using a
> robust handling of all possible encodings and especially
> attachments. The main conversion can be easily done by using
> html2text (Martin Bayer) since this tool provides a very
> accurate and flexible conversion. The most reliable RFC822
> parsing framework I know is JavaMail provided by the SDN
> (Sun-Developer-Network). Using this framework it needs only a
> few lines of code to decode, process and encode *any* RFC822
> message passed on command line ...
>
> But nobody really like to fork a JRE task instance during a
> high throughput procmail processing. Thus I would appreciate
> any comments to this topic!

As soon as you find the tool, they'll change it all around again
to evade that tool. Because they knew about it before you did.
It's their job.

I use the basic conversion that you refer to above:

:0
* ^Content-Type:.*html
{
:0 bf
| w3m -dump -cols 80 -T text/html

:0 hf
| formail -I"Content-Type: text/plain"
}

If that doesn't work, the mail is history. If the person
sending it is remotely important to me, I'll write them
back and tell them to try again with plain text and to
not send me attachments without asking first.

If they won't comply _they_ are history.

That's the problem with too many businesspersons: They
give up their standards in the pursuit of money.

And pretend that they aren't.

Alan

--
http://home.earthlink.net/~alanconnor
A Brief Introduction to Challenge-Response Systems:
http://home.earthlink.net/~alanconnor/cr.html

Re: [kook] Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 00:14:55 von Alan Connor FGA

Thanks for your kookfart, Beavis.

--


Info about "Alan Connor"

Alan "The Usenet Beavis" Connor is a good friend of Bigfoot:
http://tinyurl.com/23r3f

A couple of years ago he was kidnapped and raped by Xena,
the Warrior Princess: http://tinyurl.com/2gjcy

Beavis believes that the MSBlast virus of yesteryear was explicitly
targeting him, for some inexplicable reason: http://tinyurl.com/ifrt

Beavis belongs to a UFO cult: http://tinyurl.com/2hhdx
Beavis's life in a UFO cult: http://tinyurl.com/24jqm
Beavis knows all about network security: http://tinyurl.com/5qqb6
And he's also a search engine expert: http://tinyurl.com/9pjnt


<1164724734.389844@nnrp2.phx1.gblx.net>
"But if you must know, Alans' name is Bruce Burhans, and he lives in
Bellingham WA. To his hippie friends he calls himself "Tom Littlefoot"
**Google Tom Littlefoot, Bruce Burhans and "Wildwood"**.

Bruce has some serious mental problems and spends a lot of time as an
in-patient at the big mental hospital in Bellingham, when he's not
hospitalized, he posts to usenet. In every group he posts to he comes off as
some sort of expert in the subject at hand, and when anyone disagrees (and
they will, he sees to that) he starts in on his trollery.

Again, Bruce is a true Professional Usenet Troll. It is his entertainment
and it's what he lives for."


http://www.pearlgates.net/nanae/kooks/ac/fga.shtml
http://groups.google.com/groups/profile?enc_user=MQ9uxRYAAAA X2tAp-itjMPAOxLgFwCc3_gRbb05PKyTO4L-MEqh3HQ&hl=en
http://www.pearlgates.net/nanae/kooks/ac/
http://linuxmafia.com/faq/Mail/challenge-response.html
http://www.spamcop.net/fom-serve/cache/329.html#CR
http://www.gatago.com/authors_pgs/13650.html
http://blog.bananasplit.info/?p=84
http://tinyurl.com/ifrt
http://tinyurl.com/3h6a5
http://tinyurl.com/ys6z4

Also in the headers for alan to read.

Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 00:18:16 von keeling

Oliver Pfeiffer :
> Hi,
>
> I've seen many basic approaches to convert incoming mails from html to
> plain/text by piping the body through html2text, lynx or w3m. But this kind
> of mechanism isn't sufficient since most html mails are wrapped by using a
> complex content-type (e.g. multipart/alternative) combined with various
> character-encodings (e.g. iso-8859-1) and transfer-encodings (e.g. quoted-
> printable).
>
> Thus I am looking for a smart all-in-one script, that incorporates these

Google for mutt.octet.filter perhaps?


--
Any technology distinguishable from magic is insufficiently advanced.
(*) http://www.spots.ab.ca/~keeling Linux Counter #80292
- - http://www.faqs.org/rfcs/rfc1855.html Please, don't Cc: me.
Spammers! http://www.spots.ab.ca/~keeling/emails.html

Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 03:57:34 von Sam

This is a MIME GnuPG-signed message. If you see this text, it means that
your E-mail or Usenet software does not support MIME signed messages.
The Internet standard for MIME PGP messages, RFC 2015, was published in 1996.
To open this message correctly you will need to install E-mail or Usenet
software that supports modern Internet standards.

--=_mimegpg-commodore.email-scan.com-4795-1165719453-0001
Content-Type: text/plain; format=flowed; charset="US-ASCII"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit

Usenet Beavis writes:

> As soon as you think that I ran out of kookfarts, I'll rip another
> one.

You can always rely on Beavis.

> I used to be a normal person, but now I'm just a Beavis.

And I hope that you never become normal again. I like smacking your bitch
up.

> And if you're not convinced that I'm Usenet's laughing stock,
> just read the Beavis FAQ, and your doubts will vanish completely.

Good idea.

> That's the problem with me: I was dropped on my head, as a child.

No need to state the obvious.

> Beavis
>
> --
> http://www.geocities.com/suhatrasabib
> A Brief Introduction to the Usenet Beavis
> http://www.pearlgates.net/nanae/kooks/ac/



--=_mimegpg-commodore.email-scan.com-4795-1165719453-0001
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQBFe3edx9p3GYHlUOIRAh5IAJ4luXVY4/qPD4/VcNoiv8pJoy5JFQCe JbHe
54qR7j1jTBKS8BUJpn1cro4=
=k8LU
-----END PGP SIGNATURE-----

--=_mimegpg-commodore.email-scan.com-4795-1165719453-0001--

Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 12:00:18 von Alan Clifford

On Sat, 9 Dec 2006, Oliver Pfeiffer wrote:

OP>
OP> Note: I am the person who does not want to get html stuff in his inbox,
OP> because I hate these large and blinking messages. But on the other hand, I

There is often a text version in the email. Have you set your mail reader
to prefer the plain text version rather than the html version?

--
Alan

( If replying by mail, please note that all "sardines" are canned.
There is also a password autoresponder but, unless this a very
old message, a "tuna" will swim right through. )

Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 12:10:55 von Oliver Pfeiffer

Alan Connor wrote in
news:slrnenmgpj.1ks.i3x9mdw@b29x3m.invalid:

> As soon as you find the tool, they'll change it all around again
> to evade that tool. Because they knew about it before you did.
> It's their job.

I do not want to convert SPAM messages (these are filtered out by
SpamAssassin here), I just want to convert incoming HTML mails to
Plain/Text. Most of these multipart/alternative mails are sent by Outlook
users to me and I'm sure that Microsoft won't change Outlook to evade a
potential HTML2PlainText conversion script here. ;)

--
Grüße - Regards
Oliver Pfeiffer
ICQ-ID 84320006

Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 12:16:44 von Oliver Pfeiffer

"s. keeling" wrote in
news:slrnenmh1o.lrk.keeling@heretic.spots.ab.ca:

>> Thus I am looking for a smart all-in-one script, that incorporates
>> these
>
> Google for mutt.octet.filter perhaps?

This filter script looks like a generic converter to provide a message-
preview of any embedded content (including LaTeX, compressed archives and
other stuff). Instead I just want to safely convert HTML messages to
Plain/Text to never waste space in my inbox or see some blinking phrases.

Is it possible to apply a similar conversion with the refered script?

--
Grüße - Regards
Oliver Pfeiffer
ICQ-ID 84320006

Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 13:24:22 von Oliver Pfeiffer

Alan Clifford wrote in
news:Pine.LNX.4.60.0612101059030.23496@nard.clifford.ac:

> There is often a text version in the email. Have you set your mail
> reader to prefer the plain text version rather than the html version?

Sure, in case of multipart/alternative I will see the small plain/text part
only, but the larger HTML part is stored in my inbox anyway and many
messages doesn't provide both content-types. Thus I want to apply a safe
conversion that is applied to any message to get rid of HTML parts during
the mail processing long before I receive the message in my mail software.

--
Grüße - Regards
Oliver Pfeiffer
ICQ-ID 84320006

Re: Reliable html2plain/text conversion using procmailrc

am 10.12.2006 21:20:37 von Frank Slootweg

Oliver Pfeiffer wrote:
> Alan Clifford wrote in
> news:Pine.LNX.4.60.0612101059030.23496@nard.clifford.ac:
>
> > There is often a text version in the email. Have you set your mail
> > reader to prefer the plain text version rather than the html version?
>
> Sure, in case of multipart/alternative I will see the small plain/text part
> only, but the larger HTML part is stored in my inbox anyway and many
> messages doesn't provide both content-types. Thus I want to apply a safe
> conversion that is applied to any message to get rid of HTML parts during
> the mail processing long before I receive the message in my mail software.

As Alan hinted at, this is really a function of your MUA ('mailer').

Anyway, for which platform is this? You posted with Xnews, i.e. on
(MS-)Windows. Is your MUA/mailer platform also (MS-)Windows? If so, then
please realize that even Outlook Express does what you want, i.e. show
only the text/plain part of multipart/alternative plain+html email and
strip the HTML from a text/html-only email. I sure hope your MUA/mailer,
whatever it is, isn't *worse* than OE!

> --
> Gr??e - Regards
^^

BTW, you may want to use a newsreader which emits the mandatory
"charset=..." specification. IIRC, there is a 'mimeproxy' or some such
plugin/addon for Xnews.

Re: Reliable html2plain/text conversion using procmailrc

am 14.12.2006 14:45:13 von nospam

On Sun, 10 Dec 2006, Oliver Pfeiffer wrote:

> I do not want to convert SPAM messages (these are filtered out by
> SpamAssassin here), I just want to convert incoming HTML mails to
> Plain/Text. Most of these multipart/alternative mails are sent by Outlook
> users to me and I'm sure that Microsoft won't change Outlook to evade a
> potential HTML2PlainText conversion script here. ;)

Possibly what you want is similar to what I wanted, and implemented in a
procmail rule. I wanted at the same time to get rid of the HTML
attachments or HTML parts of a mime multipart, and also get rid of
useless quoted text added by top posters. I do not want to store this
stuff in my folders.

I considered piping the message into a script from Pine (it's relatively
easy to write the script, but is not easy to tell pine to archive the
result in a folder) or using procmail at delivery time.

I ended up with two procmail steps, in one I "analyse" the message with
awk and decide which parts are to be removed, and in the next I get rid
of those records.

I also replace the removed part with a reminder, and keep a backup copy
of the original message (in a folder which is cleaned weekly).

I have documented it in
http://sax.iasf-milano.inaf.it/~lucio/Procmail/noquotenohtml .html
(part of my broader setup described in
http://sax.iasf-milano.inaf.it/~lucio/Procmail/)

You might freely take inspiration from what I did (usual no-guarantee
clause etc. etc.)


--
------------------------------------------------------------ ----------
nospam@mi.iasf.cnr.it is a newsreading account used by more persons to
avoid unwanted spam. Any mail returning to this address will be rejected.
Users can disclose their e-mail address in the article if they wish so.