RFC HTTP::Cache module

RFC HTTP::Cache module

am 27.09.2004 18:25:50 von u1

Hi,

I have written a perl module that I want to publish on CPAN. The module
implements a cache for http requests. It provides a single method get
that fetches a url via http. The result of each get is stored in a
cache on disk, and if the same url has been requested before, the
proper ETag and If-Modified-Since headers are sent to the http server.
The server can then respond that the object in the cache is up-to-date
and HTTP::Cache will return a cached version of the data instead of
fetching it from the server. This speeds up the HTTP get and saves
bandwidth for both the server and the client.

The module is very simple to use. Simply create a HTTP::Cache object
and call the get-method on the returned object to fetch a url. You do
not have to care if the data is fetched from the server or from the
cache (but you can find out if you want to know).


Sample usage:

my $c = HTTP::Cache->new( {
BasePath => "/tmp/cache", # Directory to store the cache in.
MaxAge => 8*24, # How many hours should items be
# kept in the cache after they
# were last accessed?
# Default is 8*24.
Verbose => 1, # Print messages to STDERR.
# Default is 0.
UserAgent => "my-spider", # The user-agent string to use.
# Default is "perl-http-cache".
} );

my( $content, $error ) = $c->get( $url );

if( defined( $content ) )
{
# Data retrieved and stored in $content.
# $error indicates if the data was found in the cache (0)
# if it was fetched from the server but equal to the cache (1)
# or if it was fetched from the server and different from the
# cache (2).
}
else
{
print STDERR "Failed to fetch $url. " .
"Error returned by server: $error";
}

Does anyone object to putting this module on CPAN or is it redundant?
Is HTTP::Cache a good name for it?


Regards,


Mattias Holmlund

Re: RFC HTTP::Cache module

am 27.09.2004 18:58:15 von onave

Mattias Holmlund wrote:

>Hi,
>
>I have written a perl module that I want to publish on CPAN. The module
>implements a cache for http requests. It provides a single method get
>that fetches a url via http. The result of each get is stored in a
>cache on disk, and if the same url has been requested before, the
>proper ETag and If-Modified-Since headers are sent to the http server.
>The server can then respond that the object in the cache is up-to-date
>and HTTP::Cache will return a cached version of the data instead of
>fetching it from the server. This speeds up the HTTP get and saves
>bandwidth for both the server and the client.
>
>The module is very simple to use. Simply create a HTTP::Cache object
>and call the get-method on the returned object to fetch a url. You do
>not have to care if the data is fetched from the server or from the
>cache (but you can find out if you want to know).
>
>
>Sample usage:
>
>my $c = HTTP::Cache->new( {
> BasePath => "/tmp/cache", # Directory to store the cache in.
> MaxAge => 8*24, # How many hours should items be
> # kept in the cache after they
> # were last accessed?
> # Default is 8*24.
> Verbose => 1, # Print messages to STDERR.
> # Default is 0.
> UserAgent => "my-spider", # The user-agent string to use.
> # Default is "perl-http-cache".
>} );
>
>my( $content, $error ) = $c->get( $url );
>
>if( defined( $content ) )
>{
> # Data retrieved and stored in $content.
> # $error indicates if the data was found in the cache (0)
> # if it was fetched from the server but equal to the cache (1)
> # or if it was fetched from the server and different from the
> # cache (2).
>}
>else
>{
> print STDERR "Failed to fetch $url. " .
> "Error returned by server: $error";
>}
>
>Does anyone object to putting this module on CPAN or is it redundant?
>Is HTTP::Cache a good name for it?
>
>
>Regards,
>
>
>Mattias Holmlund
>
>
>
I'm sure the real programmers on the list will chime in regarding the
redundancy issues, as I'm sure there are already plenty of proxy modules
around that might have caching ability, if not caching modules
directly. However, I haven't yet gotten around to exploring that domain
on CPAN, so I can't comment on it myself.

However, a few things to consider:

1) It sucks having to re-implement a subset of LWP::UserAgent parameters
in your module (like UserAgent). Even if you're simply passing them
along verbatim to the UserAgent constructor, you still have to provide
some documentation in your module, and you can't possibly cover all of
the params. You could simply say that params get passed through to
LWP::UserAgent, I suppose.

2) If you try to over-simplify the process, you eliminate the option of
using all less-simple-than-simply-calling-get() functionality in the
libwww module. Eventually people will want to be able to cache posts,
or check the http status code of the response, and other such things,
and you will be busy re-implementing everything that's already implemented.

How about instead of providing "get" methods and returning "content"
directly, you integrate properly into the libwww module and cache/return
HTTP::Response objects? You can still key on the url (ignoring the
parameters, unlike Apache::DBI), although POST content might need to be
part of the cache key.

Perhaps you could make HTTP::Cache one of those "magic" modules that if
you simply "use" it, or load it and set a global variable, caching
starts happening automagically (in the background you could override a
few pieces of libwww to insert the caching in the appropriate place -
should be fairly seamless).

3) Some global cache configuration options would be nice (instead of
per-request). You could look at squid as a model (squid being the
premiere open source web caching application), but off the top of my head:

a) set a max-live time (global, or per mime-type, or per domain.... you
can get as fancy as you dream)
b) turn on/off depending on verb (like GET, POST) or if query-string
params detected
c) set default "expires" time if the web server doesn't offer one
d) whether or not to even bother trying to HEAD the url or just go
straight for the goods
e) yes, a user-agent string

-ofer

Re: RFC HTTP::Cache module

am 27.09.2004 20:13:46 von u1

Thank you for your comments! I have responded inline.

On m=E5n, 2004-09-27 at 18:58, Ofer Nave wrote:
> 1) It sucks having to re-implement a subset of LWP::UserAgent parameters=20
> in your module (like UserAgent). Even if you're simply passing them=20
> along verbatim to the UserAgent constructor, you still have to provide=20
> some documentation in your module, and you can't possibly cover all of=20
> the params. You could simply say that params get passed through to=20
> LWP::UserAgent, I suppose.

Hmm, what about changing the UserAgent option to take an actual
LWP::UserAgent object instead. That gives complete flexibility with no
code duplication.

>=20
> 2) If you try to over-simplify the process, you eliminate the option of=20
> using all less-simple-than-simply-calling-get() functionality in the=20
> libwww module. Eventually people will want to be able to cache posts,=20
> or check the http status code of the response, and other such things,=20
> and you will be busy re-implementing everything that's already implemente=
d.
>=20

In general, I think that only "simple" get-requests are possible to
cache. Post-requests, anything involving cookies etc. can most of the
time not be cached since the response is generated dynamically and the
server does not implement proper cache-control for these responses and
instead just says that the response is new every time.

The interface (i.e. "get") is the same as that provided by LWP::Simple,
but with the added bonus that you get access to any error-codes returned
by the server if you want to have it. I think this covers a majority of
the use cases. If anyone needs it, I can always add a more versatile
interface later that allows you to do more things (perhaps with complete
HTTP::Request and Response objects), but this kind of interface will
probably be too complicated in many cases.

> How about instead of providing "get" methods and returning "content"=20
> directly, you integrate properly into the libwww module and cache/return=20
> HTTP::Response objects? You can still key on the url (ignoring the=20
> parameters, unlike Apache::DBI), although POST content might need to be=20
> part of the cache key.
>=20
> Perhaps you could make HTTP::Cache one of those "magic" modules that if=20
> you simply "use" it, or load it and set a global variable, caching=20
> starts happening automagically (in the background you could override a=20
> few pieces of libwww to insert the caching in the appropriate place -=20
> should be fairly seamless).

The HTTP::Cache module is currently roughly 110 lines of code (not
counting documentation and blank lines). Integrating it into libwww
seems like a lot more work to me since I'm not familiar with the inner
workings of LWP.=20

>=20
> 3) Some global cache configuration options would be nice (instead of=20
> per-request). You could look at squid as a model (squid being the=20
> premiere open source web caching application), but off the top of my head=
:
>=20
> a) set a max-live time (global, or per mime-type, or per domain.... you=20
> can get as fancy as you dream)
> b) turn on/off depending on verb (like GET, POST) or if query-string=20
> params detected
> c) set default "expires" time if the web server doesn't offer one
> d) whether or not to even bother trying to HEAD the url or just go=20
> straight for the goods
> e) yes, a user-agent string
>=20

All configuration is per HTTP::Cache object. This object can be used to
perform several get requests, so the configuration is not per request.

Currently, all requests are always checked against the http-server, so
there is actually nothing to configure regarding expiry-times etc. If
the server thinks that the cached copy is up-to-date, we will use the
cache. What happens is that i send a normal http-request but include the
headers ETag and If-Modified-Since. If the server thinks that the ETag
is correct and/or the content has not been modified since the date
provided, It will return a response code saying that the cache is
up-to-date. Otherwise, it will return the complete response as normal.
So there is no HEAD request involved at all.

/Mattias

Re: RFC HTTP::Cache module

am 28.09.2004 20:48:54 von u1

Based on the feedback i have gotten from Ofer Nave and Nigel Horne, I
have realized that HTTP::Cache might be a too generic name for my
module, since it only handles simple get-requests. I will therefore
change the name of my module. If noone objects, I will upload my module
to CPAN as HTTP::SimpleCache on Thursday.

I think there is room for a more generic module for caching
HTTP-requests and responses. One could probably write such a module by
inheriting from LWP::UserAgent and making a new implementation of
simple_request. This would mean that the rest of LWP could be used as
normal with special headers, cookies, robots.txt processing etc.
However, I personally have very little interest in such a module right
now, and I don't think it's a good idea to write a module that I have no
intention of using. I will therefore leave it up to someone else and
publish my module as is (with the UserAgent suggestion from Ofer).

/Mattias

On m=E5n, 2004-09-27 at 18:25, Mattias Holmlund wrote:
> Hi,
>=20
> I have written a perl module that I want to publish on CPAN. The module
> implements a cache for http requests. It provides a single method get
> that fetches a url via http. The result of each get is stored in a
> cache on disk, and if the same url has been requested before, the
> proper ETag and If-Modified-Since headers are sent to the http server.
> The server can then respond that the object in the cache is up-to-date
> and HTTP::Cache will return a cached version of the data instead of
> fetching it from the server. This speeds up the HTTP get and saves
> bandwidth for both the server and the client.
>=20
> The module is very simple to use. Simply create a HTTP::Cache object
> and call the get-method on the returned object to fetch a url. You do
> not have to care if the data is fetched from the server or from the
> cache (but you can find out if you want to know).
>=20
>=20
> Sample usage:
>=20
> my $c =3D HTTP::Cache->new( {
> BasePath =3D> "/tmp/cache", # Directory to store the cache in.
> MaxAge =3D> 8*24, # How many hours should items be
> # kept in the cache after they
> # were last accessed?
> # Default is 8*24.
> Verbose =3D> 1, # Print messages to STDERR.
> # Default is 0.
> UserAgent =3D> "my-spider", # The user-agent string to use.
> # Default is "perl-http-cache".
> } );
>=20
> my( $content, $error ) =3D $c->get( $url );
>=20
> if( defined( $content ) )
> {
> # Data retrieved and stored in $content.
> # $error indicates if the data was found in the cache (0)
> # if it was fetched from the server but equal to the cache (1)
> # or if it was fetched from the server and different from the
> # cache (2).
> }
> else
> {
> print STDERR "Failed to fetch $url. " .
> "Error returned by server: $error";
> }
>=20
> Does anyone object to putting this module on CPAN or is it redundant?
> Is HTTP::Cache a good name for it?
>=20
>=20
> Regards,
>=20
>=20
> Mattias Holmlund
>=20
>=20
>=20
>=20
>=20

Re: RFC HTTP::Cache module

am 28.09.2004 20:51:44 von onave

That sounds like a great idea. Its still a useful module, afterall.

Sorry I didn't have time to respond to your last email. In general, the
points were valid.

-ofer

Mattias Holmlund wrote:

>Based on the feedback i have gotten from Ofer Nave and Nigel Horne, I
>have realized that HTTP::Cache might be a too generic name for my
>module, since it only handles simple get-requests. I will therefore
>change the name of my module. If noone objects, I will upload my module
>to CPAN as HTTP::SimpleCache on Thursday.
>
>I think there is room for a more generic module for caching
>HTTP-requests and responses. One could probably write such a module by
>inheriting from LWP::UserAgent and making a new implementation of
>simple_request. This would mean that the rest of LWP could be used as
>normal with special headers, cookies, robots.txt processing etc.
>However, I personally have very little interest in such a module right
>now, and I don't think it's a good idea to write a module that I have no
>intention of using. I will therefore leave it up to someone else and
>publish my module as is (with the UserAgent suggestion from Ofer).
>
>/Mattias
>
>On mån, 2004-09-27 at 18:25, Mattias Holmlund wrote:
>
>
>>Hi,
>>
>>I have written a perl module that I want to publish on CPAN. The module
>>implements a cache for http requests. It provides a single method get
>>that fetches a url via http. The result of each get is stored in a
>>cache on disk, and if the same url has been requested before, the
>>proper ETag and If-Modified-Since headers are sent to the http server.
>>The server can then respond that the object in the cache is up-to-date
>>and HTTP::Cache will return a cached version of the data instead of
>>fetching it from the server. This speeds up the HTTP get and saves
>>bandwidth for both the server and the client.
>>
>>The module is very simple to use. Simply create a HTTP::Cache object
>>and call the get-method on the returned object to fetch a url. You do
>>not have to care if the data is fetched from the server or from the
>>cache (but you can find out if you want to know).
>>
>>
>>Sample usage:
>>
>>my $c = HTTP::Cache->new( {
>> BasePath => "/tmp/cache", # Directory to store the cache in.
>> MaxAge => 8*24, # How many hours should items be
>> # kept in the cache after they
>> # were last accessed?
>> # Default is 8*24.
>> Verbose => 1, # Print messages to STDERR.
>> # Default is 0.
>> UserAgent => "my-spider", # The user-agent string to use.
>> # Default is "perl-http-cache".
>>} );
>>
>>my( $content, $error ) = $c->get( $url );
>>
>>if( defined( $content ) )
>>{
>> # Data retrieved and stored in $content.
>> # $error indicates if the data was found in the cache (0)
>> # if it was fetched from the server but equal to the cache (1)
>> # or if it was fetched from the server and different from the
>> # cache (2).
>>}
>>else
>>{
>> print STDERR "Failed to fetch $url. " .
>> "Error returned by server: $error";
>>}
>>
>>Does anyone object to putting this module on CPAN or is it redundant?
>>Is HTTP::Cache a good name for it?
>>
>>
>>Regards,
>>
>>
>>Mattias Holmlund
>>
>>
>>
>>
>>
>>
>>
>
>
>

Re: RFC HTTP::Cache module

am 30.09.2004 14:30:55 von merlyn

>>>>> "Mattias" == Mattias Holmlund writes:

Mattias> I have written a perl module that I want to publish on
Mattias> CPAN. The module implements a cache for http requests.

The module that I've designed a half dozen times in my head but never
implemented would simply overload LWP::UserAgent's "simple_request"
method with one that did caching transparently.

Maybe if you thought about this enough, you could do that. Then I
wouldn't have to finish writing the module that's in my head.

The advantage this has is that you could take *any* application that
ultimately uses LWP::UserAgent (and they pretty much all do, great job
Gisle!), and then mixin your module, and you'd get caching for free.
Such as:

## you add:
use LWP::UserAgent::TransparentCache qw(cache => parameters);
## to this:
use LWP::Simple;
my $content = get("foo");

and it's done. Or anything more complex. And it'd still work.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

Re: RFC HTTP::Cache module

am 30.09.2004 20:32:01 von u1

On tor, 2004-09-30 at 14:30, Randal L. Schwartz wrote:
> The module that I've designed a half dozen times in my head but never
> implemented would simply overload LWP::UserAgent's "simple_request"
> method with one that did caching transparently.
>

Ok, I'll give it a shot. I experimented with how to overload
simple_request and came up with the following. Does it seem right so
far?

use strict;

use LWP::UserAgent;

BEGIN
{
print "Installing\n";
$LWP::Simple::FULL_LWP++;
my $org_simple_request = \&LWP::UserAgent::simple_request;

{
no warnings;
*LWP::UserAgent::simple_request = sub
{
my($self, $request, $arg, $size) = @_;

print "Overloaded\n";
return &$org_simple_request( $self, $request, $arg, $size );
}
}
}
1;

/Mattias

Re: RFC HTTP::Cache module

am 01.10.2004 08:23:09 von ville.skytta

On Thu, 2004-09-30 at 21:32, Mattias Holmlund wrote:
> On tor, 2004-09-30 at 14:30, Randal L. Schwartz wrote:
> > The module that I've designed a half dozen times in my head but never
> > implemented would simply overload LWP::UserAgent's "simple_request"
> > method with one that did caching transparently.
> >
>
> Ok, I'll give it a shot. [...]

FYI, FWIW: this appeared 2 days ago in CPAN:
http://search.cpan.org/dist/LWP-UserAgent-WithCache/

Re: RFC HTTP::Cache module

am 01.10.2004 14:17:37 von merlyn

>>>>> "Mattias" == Mattias Holmlund writes:

Mattias> Ok, I'll give it a shot. I experimented with how to overload
Mattias> simple_request and came up with the following. Does it seem right so
Mattias> far?

Mattias> use strict;

Mattias> use LWP::UserAgent;

Mattias> BEGIN
Mattias> {
Mattias> print "Installing\n";
Mattias> $LWP::Simple::FULL_LWP++;

Mattias> my $org_simple_request = \&LWP::UserAgent::simple_request;

You need "require LWP::UserAgent" first, to ensure that this
routine has its original definition.

Mattias> {
Mattias> no warnings;
Mattias> *LWP::UserAgent::simple_request = sub
Mattias> {
Mattias> my($self, $request, $arg, $size) = @_;

Mattias> print "Overloaded\n";
Mattias> return &$org_simple_request( $self, $request, $arg, $size );
Mattias> }
Mattias> }
Mattias> }
Mattias> 1;

But the rest looks fine.

The most important part here is to be conservative. Cache only
that which is cachable, and pass everything else directly to
the original routine. Be as thin as you can.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

Re: RFC HTTP::Cache module

am 01.10.2004 17:01:19 von u1

Randal L. Schwartz wrote:

>>>>>>"Mattias" == Mattias Holmlund writes:
> Mattias> use strict;
>
> Mattias> use LWP::UserAgent;
>
> Mattias> BEGIN
> Mattias> {
> Mattias> print "Installing\n";
> Mattias> $LWP::Simple::FULL_LWP++;
>
> Mattias> my $org_simple_request = \&LWP::UserAgent::simple_request;
>
> You need "require LWP::UserAgent" first, to ensure that this
> routine has its original definition.
>

I have a use LWP::UserAgent a few lines up. Is that not sufficient?

>
> The most important part here is to be conservative. Cache only
> that which is cachable, and pass everything else directly to
> the original routine. Be as thin as you can.

Yes I agree. I will start off very conservatively and only cache GET. I
will also skip anything with a Range-header. I don't think I will
support content-callbacks in the first version either.

The cache will (at least initially) always ask the server if the data is
up-to-date, so I won't have to do any expiration calculations.

/Mattias

Re: RFC HTTP::Cache module

am 01.10.2004 17:03:05 von u1

Ville Skyttä wrote:
> FYI, FWIW: this appeared 2 days ago in CPAN:
> http://search.cpan.org/dist/LWP-UserAgent-WithCache/

Thanks, I hadn't seen that one. It doesn't work with LWP::Simple however
since you must create an LWP::UserAgent to use it. I think I'll go ahead
with HTTP::TransparentCache anyway.

/Mattias

Re: RFC HTTP::Cache module

am 02.10.2004 03:13:34 von merlyn

>>>>> "Mattias" == Mattias Holmlund writes:

Mattias> I have a use LWP::UserAgent a few lines up. Is that not sufficient?

I think the warranty on my Lasik Eye Surgery is just about up. :)

Yes, that'll do. Although, I'd use require rather than use, and do it
*inside* the sub, so it doesn't do the deed until actually called.

Mattias> Yes I agree. I will start off very conservatively and only cache
Mattias> GET. I will also skip anything with a Range-header. I don't think I
Mattias> will support content-callbacks in the first version either.

Mattias> The cache will (at least initially) always ask the server if the data
Mattias> is up-to-date, so I won't have to do any expiration calculations.

Cool. It's much easier to write code by proxy. :)

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!