Scan web pages and compose summary

Scan web pages and compose summary

am 17.01.2008 21:48:49 von solk

Hello.

I am looking for a way to read html file and create
a short summary (like that shows in google results for example)
which ought to be the first few lines of welcome text or so.

Does any got any idea on how to do this? (I searched allot,
but all I found was simply extracting meta tags).

Thanks

Re: Scan web pages and compose summary

am 17.01.2008 23:11:35 von adwatson

Well, the tricky part is that you'll need to decide what text to grab
and show from the file - which is why there's a meta description tag
for the purpose. I believe google grabs the text surrounding a search
term and displays that if there's no meta description tag to use - so
if you're actually searching for a term you could do something like
that.

---
www.NEXCESS.NET - Shared/Reseller Hosting
www.EliteRax.com - Dedicated Servers, Server Clusters
www.MaxVPS.com - Virtual Private Servers
- Great prices, Great service - check us out!

On Jan 17, 3:48 pm, solk wrote:
> Hello.
>
> I am looking for a way to read html file and create
> a short summary (like that shows in google results for example)
> which ought to be the first few lines of welcome text or so.
>
> Does any got any idea on how to do this? (I searched allot,
> but all I found was simply extracting meta tags).
>
> Thanks

Re: Scan web pages and compose summary

am 18.01.2008 11:51:12 von Jensen Somers

Hello,

solk wrote:
> Hello.
>
> I am looking for a way to read html file and create
> a short summary (like that shows in google results for example)
> which ought to be the first few lines of welcome text or so.
>
> Does any got any idea on how to do this? (I searched allot,
> but all I found was simply extracting meta tags).
>
> Thanks

I can recommend Snoopy (http://snoopy.sourceforge.net/). It is able to
retrieve an entire web page, follow links and so on. The result will be
the HTML source output you can see if you do a view source in your web
browser. From there you can strip HTML tags, use substr() to jump to
certain sections in the source (eg: jump to right after the body tag,
remove all HTML tags and save the text output).

- Jensen