Re: Architectural Question rgd LWP::UserAgent, WWW::Mechanize

Re: Architectural Question rgd LWP::UserAgent, WWW::Mechanize

am 04.03.2005 16:23:42 von jjl

On Sat, 5 Mar 2005, Robert Barta wrote:

> I am using WWW::Mechanize/LWP and some of their subclasses now for
> several things and I see an architectural problem I will be facing in
> some future:
>
> For downstream developers (and for me) I need to offer a facility
> to choose a user agent which supports a number of features:
>
> - local caching
>
> - specialized cookie handling for specific web sites
>
> - scripting (controlling the user agent via a dedicate language
> and not via Perl method calls to WWW::Mechanize).
>
> - triggering of application specific code at particular events
> (page loaded, link selection, page unload)
>
> - maybe optional JavaScript/DOM coverage later
>
> Now much of this functionality is already there (I have implemented
> scripting recently), but somehow spread over several packages in
> incompatible ways. But for a downstream developer it is not possible
> to say something like this:
>
> my $ua = new LWP::UserAgent::Pluggable;
>
> $ua->add_plugin (new LWP::UserAgent::Plugins::Cache (size => '4M'));
> $ua->add_plugin (new LWP::UserAgent::Plugins::Scriptable (plan => ...));
> $ua->add_plugin (new LWP::UserAgent::Plugins::Hooks (
> ('http://specialsite/page' => sub { do something; }));
>
> Does this make sense?

Yes! Python's urllib2 works like this, so I'm sure looking at that is
well worth the time if you want something similar in Perl. I extended it
in a fairly simple way in Python 2.4, and it now works quite nicely to
support all kinds of things (cookies, auth (various flavours), http, ftp,
gopher etc., refresh handling, referer handling, http-equiv, redirection,
seek()-able responses, robots.txt observance...) using a single,
relatively simple, plugin handler interface. Caching (of both content and
connections) would naturally and easily fit into that. Recently noticed
the yum package manager / urlgrabber developers have added more features
(what I assume are decent implementations of throttling, persistent
connections, mirror selection, etc ...), I assume mostly using the same
plugin handler system (though they're pretty application-focused).

There's no requirement to shoehorn everything into some elegant scheme in
order to enable customisation and re-use, though, is there? Module
designs need effort expended to keep them open and reusable, true, but
that doesn't mean (mythical) perfect genericity (although really generic
interfaces can sometimes be just the ticket and very useful, as with
urllib2's handlers). A few examples of where, despite urllib2's rather
nice handlers, I don't feel a need to fit into any grand generic
interface:

For cookie policy, I have (in ClientCookie, and now cookielib in Python
stdlib), CookiePolicy objects -- *not* a handler -- rather, each cookie
handler *has* a CookieJar, which *has* a CookiePolicy.

Hooks as you describe might well be done best with explicit support from
standard handlers, I would guess (though I woouldn't know for sure 'till I
try). Mind you, I have a couple of useful debug handlers, eg. for
printing redirected response bodies.

Never tried scripting, but I don't see any obvious reason for wanting that
as a plugin handler in the urllib2 sense (FWIW, never looked at it, but I
know there's a scripting system based on urllib2 + my libraries (in turned
based in large part on ports from LWP), called PBP). I've not considered
more elaborate generic plugin systems that might offer the opportunity for
having eg. this kind of scripting as a plugin to some browser object (too
much else more valuable I could do first!), but maybe that'd be an
interesting idea to think about a bit.

In my port of WWW::Mechanize, I added simple methods back on top of the
urllib2 handler system, mostly for convenience of *removing* handlers
without rebuilding an opener object each time (eg.
Browser.handle_refresh(handle) -- where handle is a boolean arg). Works
fairly nicely, I think.

I also started on Javascript support. You need a browser model for that
(same goes for proper Referer handling, though eg. my
mechanize.HTTPRefererProcessor is written as an object that works just
like any other handler -- it just happens to use a Browser class in its
implementation), so the sort of handlers I refer to above aren't the main
issue. See DOMForm and python-spidermonkey here:

http://wwwsearch.sourceforge.net/


Enough rambling. Hope this helps stir you to write something interesting
and share it...


John

Architectural Question rgd LWP::UserAgent, WWW::Mechanize

am 05.03.2005 02:02:44 von rho

Hi,

I am using WWW::Mechanize/LWP and some of their subclasses now for
several things and I see an architectural problem I will be facing in
some future:

For downstream developers (and for me) I need to offer a facility
to choose a user agent which supports a number of features:

- local caching

- specialized cookie handling for specific web sites

- scripting (controlling the user agent via a dedicate language
and not via Perl method calls to WWW::Mechanize).

- triggering of application specific code at particular events
(page loaded, link selection, page unload)

- maybe optional JavaScript/DOM coverage later

Now much of this functionality is already there (I have implemented
scripting recently), but somehow spread over several packages in
incompatible ways. But for a downstream developer it is not possible
to say something like this:

my $ua = new LWP::UserAgent::Pluggable;

$ua->add_plugin (new LWP::UserAgent::Plugins::Cache (size => '4M'));
$ua->add_plugin (new LWP::UserAgent::Plugins::Scriptable (plan => ...));
$ua->add_plugin (new LWP::UserAgent::Plugins::Hooks (
('http://specialsite/page' => sub { do something; }));

Does this make sense?

Sorry, if that has been discussed before.

\rho