HTML text extraction

am 18.08.2009 10:37:41 von leledumbo

Usually, a website gives preview of its articles by extracting some of the
first characters. This is easy if the article is a pure text, but what if
it's a HTML text? For instance, if I have the full text:

bla bla bla

item 1

item 2

item 3

and I take the first 40 characters, it would result in:

bla bla bla

item

As you can see, the tags are incomplete and it might break other texts below
it (I mean, other than this preview). I need a way to solve this problem.

--
View this message in context: http://www.nabble.com/HTML-text-extraction-tp25020687p250206 87.html
Sent from the PHP - General mailing list archive at Nabble.com.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: HTML text extraction

am 18.08.2009 10:41:07 von Ashley Sheridan

On Tue, 2009-08-18 at 01:37 -0700, leledumbo wrote:
> Usually, a website gives preview of its articles by extracting some of the
> first characters. This is easy if the article is a pure text, but what if
> it's a HTML text? For instance, if I have the full text:
>
>

> bla bla bla
>

item 1

item 2

item 3

>
> and I take the first 40 characters, it would result in:
>
>

> bla bla bla
>

item
>
> As you can see, the tags are incomplete and it might break other texts below
> it (I mean, other than this preview). I need a way to solve this problem.
>
> --
> View this message in context: http://www.nabble.com/HTML-text-extraction-tp25020687p250206 87.html
> Sent from the PHP - General mailing list archive at Nabble.com.
>
>
You could do a couple of things:

* Extract all the content and use strip_tags() to remove all the
HTML markup. In the example you gave it might look a bit odd if
the content suggests it was originally a list.
* Access the extracted content through the DOM, and grab the
textual content you need using node values. That way, you can
limit it to a specific character count of content, and with a
bit of work, you can preserve the original markup tags too

Thanks,
Ash
http://www.ashleysheridan.co.uk

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: HTML text extraction

am 18.08.2009 12:10:25 von Richard Heyes

HI,

> ...

The easy way (Back to the Future 2 anyone...?) would be to use
strip_tags() first:

http://uk.php.net/strip_tags

--
Richard Heyes
HTML5 graphing: RGraph - www.rgraph.net (updated 8th August)
Lots of PHP and Javascript code - http://www.phpguru.org

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: HTML text extraction

am 22.08.2009 04:33:54 von Manuel Lemos

Hello,

on 08/18/2009 05:37 AM leledumbo said the following:
> Usually, a website gives preview of its articles by extracting some of the
> first characters. This is easy if the article is a pure text, but what if
> it's a HTML text? For instance, if I have the full text:
>
>

> bla bla bla
>

item 1

item 2

item 3

>
> and I take the first 40 characters, it would result in:
>
>

> bla bla bla
>

item
>
> As you can see, the tags are incomplete and it might break other texts below
> it (I mean, other than this preview). I need a way to solve this problem.

You may want to try these HTML parser classes. They can parse (and even
validate) HTML and return an array of tag or data elements. You can use
it to pick the first tags and data you. Then you you the RewriteElement
function to regenerate the HTML.

http://www.phpclasses.org/secure-html-filter

--

Regards,
Manuel Lemos

Find and post PHP jobs
http://www.phpclasses.org/jobs/

PHP Classes - Free ready to use OOP components written in PHP
http://www.phpclasses.org/

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php