Remove all HTML tags

am 18.05.2011 15:02:16 von mickalo

Hello,

Is there a perl module available, or a regex method, that will prase an HTML
formatted file then remove ALL the HTML elements so you end up with just the
text content of the file?

Any help/suggestions appreciated.

Mike(mickalo)Blezien
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Thunder Rain Internet Publishing
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Remove all HTML tags

am 18.05.2011 15:13:20 von Shawn H Corey

On 11-05-18 09:02 AM, Mike Blezien wrote:
> Hello,
>
> Is there a perl module available, or a regex method, that will prase an
> HTML formatted file then remove ALL the HTML elements so you end up with
> just the text content of the file?
>
> Any help/suggestions appreciated.

HTML::TreeBuilder loads HTML::Element which has a method as_text(). Use
HTML::Element::look_down() to find the body, than use as_text()

http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/TreeBu ilder.pm
http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/Elemen t.pm

--
Just my 0.00000002 million dollars worth,
Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software: Fail early & often.

Eliminate software piracy: use only FLOSS.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Remove all HTML tags

am 21.05.2011 18:08:39 von Peter Scott

On Wed, 18 May 2011 09:13:20 -0400, Shawn H Corey wrote:
> On 11-05-18 09:02 AM, Mike Blezien wrote:
>> Is there a perl module available, or a regex method, that will prase a=
n
>> HTML formatted file then remove ALL the HTML elements so you end up
>> with just the text content of the file?
>=20
> HTML::TreeBuilder loads HTML::Element which has a method as_text(). Us=
e
> HTML::Element::look_down() to find the body, than use as_text()
>=20
> http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/TreeBu ilder.pm
> http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/Elemen t.pm

That's the answer I would give, I would just add to the OP that what you=20
think the text content of a page ought to be may not match what this=20
returns. Text without formatting runs together and for the majority of=20
pages produces a useless mess. Usually more complex parsing is called=20
for based on specific knowledge of the page. Although if all you want=20
the text content for is further machine processing like checksums,=20
concordance, or indexing, then this is fine.

--=20
Peter Scott
http://www.perlmedic.com/ http://www.perldebugged.com/
http://www.informit.com/store/product.aspx?isbn=3D0137001274
http://www.oreillyschool.com/courses/perl3/

--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/