Convert some files from html to plaintext

Convert some files from html to plaintext

am 11.11.2007 19:02:14 von lucavilla

I have many html files named like these:

c:\dir\femo-black.html
c:\dir\loren-white.html
c:\dir\spark-white.html
c:\dir\kim-black.html
c:\dir\paul-white.html

How can I convert only the files named "c:\dir\*-white.html" to
plaintext files named c:\dir\(original filename)-text.txt?

BTW do you know a better Perl module than HTML::FormatText (
http://search.cpan.org/~sburke/HTML-Format-2.04/lib/HTML/For matText.pm)
to convert HTML to plaintext?

Re: Convert some files from html to plaintext

am 11.11.2007 19:37:03 von jurgenex

Luca Villa wrote:
> I have many html files named like these:
>
> c:\dir\femo-black.html
> c:\dir\loren-white.html
> c:\dir\spark-white.html
> c:\dir\kim-black.html
> c:\dir\paul-white.html
>
> How can I convert only the files named "c:\dir\*-white.html"

perldoc -f glob

> to plaintext files

Many ways, depending on what you consider the plaintext equivalent of an
HTML file. After all, HTML contains more information than plaintext and
therefore a lossless conversion is not possible. One way would be to use
lynx with the text-output option.
Another way is described in the Perl FAQ: "perldoc -q HTML"
"How do I remove HTML from a string?"

> named c:\dir\(original filename)-text.txt?

Depending upon how you generate the target text e.g. by redirecting the
output of lynx to that file or buy writing to that file or ...

jue

Re: Convert some files from html to plaintext

am 11.11.2007 20:45:08 von lucavilla

> e.g. by redirecting the
> output of lynx to that file or buy writing to that file or ...

Isn't there an equivalent of the Lynx rendering engine for Perl?
I know that "Lynx -dump" does a good conversion but I fear that
calling an external program thousand of times is a waste of
resources...

Re: Convert some files from html to plaintext

am 11.11.2007 20:55:58 von jurgenex

Luca Villa wrote:
>> e.g. by redirecting the
>> output of lynx to that file or buy writing to that file or ...
>
> Isn't there an equivalent of the Lynx rendering engine for Perl?

Why would Perl do HTML rendering? Anyway, which part of "perldoc -q HTML"

How do I remove HTML from a string?

don't you understand?

jue

Re: Convert some files from html to plaintext

am 11.11.2007 21:02:54 von lucavilla

Jürgen Exner

the problem is that converting html to a good equivalent in plain text
is not a simple operation of "removing HTML from a string".

Think for example to an html table, with columns of different width
etc...
Textual browsers like Lynx, Links, Elinks, W3M do a good job in
presenting html tables in plain text. I'm searching for something of
this quality...

Re: Convert some files from html to plaintext

am 12.11.2007 02:36:11 von unknown

Post removed (X-No-Archive: yes)