Convert some files from html to plaintext
Convert some files from html to plaintext
am 11.11.2007 19:02:14 von lucavilla
I have many html files named like these:
c:\dir\femo-black.html
c:\dir\loren-white.html
c:\dir\spark-white.html
c:\dir\kim-black.html
c:\dir\paul-white.html
How can I convert only the files named "c:\dir\*-white.html" to
plaintext files named c:\dir\(original filename)-text.txt?
BTW do you know a better Perl module than HTML::FormatText (
http://search.cpan.org/~sburke/HTML-Format-2.04/lib/HTML/For matText.pm)
to convert HTML to plaintext?
Re: Convert some files from html to plaintext
am 11.11.2007 19:37:03 von jurgenex
Luca Villa wrote:
> I have many html files named like these:
>
> c:\dir\femo-black.html
> c:\dir\loren-white.html
> c:\dir\spark-white.html
> c:\dir\kim-black.html
> c:\dir\paul-white.html
>
> How can I convert only the files named "c:\dir\*-white.html"
perldoc -f glob
> to plaintext files
Many ways, depending on what you consider the plaintext equivalent of an
HTML file. After all, HTML contains more information than plaintext and
therefore a lossless conversion is not possible. One way would be to use
lynx with the text-output option.
Another way is described in the Perl FAQ: "perldoc -q HTML"
"How do I remove HTML from a string?"
> named c:\dir\(original filename)-text.txt?
Depending upon how you generate the target text e.g. by redirecting the
output of lynx to that file or buy writing to that file or ...
jue
Re: Convert some files from html to plaintext
am 11.11.2007 20:45:08 von lucavilla
> e.g. by redirecting the
> output of lynx to that file or buy writing to that file or ...
Isn't there an equivalent of the Lynx rendering engine for Perl?
I know that "Lynx -dump" does a good conversion but I fear that
calling an external program thousand of times is a waste of
resources...
Re: Convert some files from html to plaintext
am 11.11.2007 20:55:58 von jurgenex
Luca Villa wrote:
>> e.g. by redirecting the
>> output of lynx to that file or buy writing to that file or ...
>
> Isn't there an equivalent of the Lynx rendering engine for Perl?
Why would Perl do HTML rendering? Anyway, which part of "perldoc -q HTML"
How do I remove HTML from a string?
don't you understand?
jue
Re: Convert some files from html to plaintext
am 11.11.2007 21:02:54 von lucavilla
Jürgen Exner
the problem is that converting html to a good equivalent in plain text
is not a simple operation of "removing HTML from a string".
Think for example to an html table, with columns of different width
etc...
Textual browsers like Lynx, Links, Elinks, W3M do a good job in
presenting html tables in plain text. I'm searching for something of
this quality...
Re: Convert some files from html to plaintext
am 12.11.2007 02:36:11 von unknown
Post removed (X-No-Archive: yes)