"reverse templating" or "auto-meta-regex" module for automated screen-scrape lea

"reverse templating" or "auto-meta-regex" module for automated screen-scrape lea

am 19.09.2007 04:09:09 von weston

I was discussing screen scraping with some acquaintances recently, and
they claimed they'd seen a website which allowed users to select
certain regions of a given page via a nice UI... and there was an app
behind this that would then learn from these selections to extract
data from corresponding regions of similar pages. "Reverse templating"
and "auto-meta-regex" were the terms we came up with, but there's
probably a better description. They also claimed there were perl
modules that did this same thing, but I haven't been able to locate
them on CPAN -- does anyone know what these might be?

Thanks!

Re: "reverse templating" or "auto-meta-regex" module for automated screen-scrape

am 20.09.2007 04:45:25 von David Steinbrunner

On 2007-09-18 22:09:09 -0400, Weston
said:

> I was discussing screen scraping with some acquaintances recently, and
> they claimed they'd seen a website which allowed users to select
> certain regions of a given page via a nice UI... and there was an app
> behind this that would then learn from these selections to extract
> data from corresponding regions of similar pages. "Reverse templating"
> and "auto-meta-regex" were the terms we came up with, but there's
> probably a better description. They also claimed there were perl
> modules that did this same thing, but I haven't been able to locate
> them on CPAN -- does anyone know what these might be?

Apple created the Web Clip Widget for Mac OS X 10.5 which is what
popped into my head when I read this. Basically it allows you to
select a region of a page which corresponds to a table, div or whatever
and make a widget out of it. They showed this off a long time ago and
have yet to release it but someone created a knock off right away:

Dash Clipping
http://www.fondantfancies.com/blog/3001239/

On the perl side of things, when it comes to scraping I would say
HTML::Treebuilder is your best friend. It allows you to parse down to
the table, div or whatever and play with what is inside of it.

But it sounds like you are looking for more. Maybe a UI that allows
you to select the table, div or whatever and it then generate perl code
that uses HTML::Treebuilder to get you to where you selected in the UI.
Now that sounds fun. Something tells me someone could work towards
that using Camelbones to access the WebKit innards.

Sorry to those that are put off by the talk of Mac stuff... it is what
I know and where I play.

--
David Steinbrunner