Best filesystem type for mod_cache in reverse proxy?

Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 18:47:01 von Neil Gunton

Hi all,

I posted this to the Apache httpd users list, but no reply there, so I'm
posting here in the hopes that someone else who uses mod_perl with
mod_cache in a reverse proxy setup might have insight.

I am using Apache 2.2.9 (built from source) on Debian Lenny to run a
fairly large community LAMP (Perl, MySQL) site. I use the proxy and
cache of Apache to improve site performance - I have a front end proxy
build and a back-end mod_perl build, both on the same server currently.
I have been using this setup for years successfully, but most of that
time was using Apache 1.3, with mod_access and mod_deflate from Igor
Sysoev. Since moving to Apache 2.2, I am using the stock caching.

The cache and front-end proxy help to serve images without bogging down
the heavy mod_perl processes, while also obviously caching the mod_perl
content. The site gets around 100,000 page requests or more per day. The
cache is set to 1000MB, with htcacheclean running in daemon mode,
interval 60 minutes (but looking at the performance charts, it seems to
be running constantly).

I am finding that the cache directories that mod_cache builds are very
large, and take a long time to traverse under ext2. There is currently
about 10 GB under the cache according to du, and it took 162 minutes
just to tell me that. Basically, htcacheclean is not keeping up. I'm
using three levels of directory. Htcacheclean also takes a long time to
process this if I try running it from cron nightly, during which time I
would see a huge spike in iowait on the server, and it would take upward
of 3 hours to complete. If I run htcacheclean in daemon mode, using the
-n (nice) option, then it doesn't seem to be able to keep up, the cache
just creeps up in size. If I take off the nice option, then it takes up
a lot more resources, to the point where I'm concerned it'll be
impacting the server performance by monopolising the disks.

So what I'm observing is that at least part of the problem appears to be
that the directory structure is just very, very big and wide and takes a
long time to traverse, even for basic system functions like du.

This leads to my main question, which is this: Would a different
filesystem, perhaps reiserfs, be better for this type of cache? I have
never used reiser before, but from reputation it seems to be designed
for handling many small files efficiently. I wonder if it would be any
easier for my system to traverse the directory and maintain the cache if
it was under reiser rather than ext.

If not that, then are there other filesystems which make it very
efficient to traverse wide directory structures?

I have a quad core server (AMD Opteron 265), with four 10k SCSI drives
set up in RAID0 (yeah I know it's risky, but everything is backed up
immediately via mysql replication, and I need the space and performance).

Thanks!

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 19:56:26 von Neil Gunton

Neil Gunton wrote:
> The cache and front-end proxy help to serve images without bogging down
> the heavy mod_perl processes, while also obviously caching the mod_perl
> content. The site gets around 100,000 page requests or more per day. The
> cache is set to 1000MB, with htcacheclean running in daemon mode,
> interval 60 minutes (but looking at the performance charts, it seems to
> be running constantly).
>
> I am finding that the cache directories that mod_cache builds are very
> large, and take a long time to traverse under ext2. There is currently
> about 10 GB under the cache according to du, and it took 162 minutes
> just to tell me that. Basically, htcacheclean is not keeping up. I'm
> using three levels of directory. Htcacheclean also takes a long time to
> process this if I try running it from cron nightly, during which time I
> would see a huge spike in iowait on the server, and it would take upward
> of 3 hours to complete. If I run htcacheclean in daemon mode, using the
> -n (nice) option, then it doesn't seem to be able to keep up, the cache
> just creeps up in size. If I take off the nice option, then it takes up
> a lot more resources, to the point where I'm concerned it'll be
> impacting the server performance by monopolising the disks.
>
> So what I'm observing is that at least part of the problem appears to be
> that the directory structure is just very, very big and wide and takes a
> long time to traverse, even for basic system functions like du.

Someone replied to me off-list suggesting using Squid instead of httpd
for the front-end caching reverse proxy. I guess that is a good question
- I use Apache for proxying mainly because I know apache quite well, and
like being able to use mod_rewrite and other neat features that httpd
gives. I've never used Squid. Does anyone have opinions there? Is Squid
better at managing its cache files in a sane (and efficient, i.e. no
100% iowait) fashion?

Does anyone run a 3-layer combination of Squid for cache, and then an
Apache front end proxy (no mod_cache) for it's mod_rewrite capabilities,
and then the back-end mod_perl server?

I need mod_rewrite at some point for stuff like stopping image
hotlinking from other websites (people stealing my bandwidth by making
my server act as an image server for their forums, auctions etc), and
other access control stuff. I'll have to look into whether squid can do
all that.

I'm open to alternatives, if it turns out that Apache's mod_cache simply
isn't mature enough yet. I notice that some of the features of mod_cache
have not even been implemented yet, so maybe this module isn't really
ready for prime time yet? Opinions? Surely most people using mod_perl in
a production environment must be using some form of reverse proxy, since
it just makes so much sense from a server utilization point of view.

Thanks again,

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 20:25:18 von Perrin Harkins

On Mon, Nov 24, 2008 at 1:56 PM, Neil Gunton wrote:
> Someone replied to me off-list suggesting using Squid instead of httpd for
> the front-end caching reverse proxy. I guess that is a good question - I use
> Apache for proxying mainly because I know apache quite well, and like being
> able to use mod_rewrite and other neat features that httpd gives. I've never
> used Squid. Does anyone have opinions there?

I think you hit the main issue right there: squid is not apache and
you can't use the same tools with it. I also haven't seen any recent
benchmark suggesting squid performs better, but I'd like to run a set
of benchmarks on all the recent proxy servers to really sort this out.

> Does anyone run a 3-layer combination of Squid for cache, and then an Apache
> front end proxy (no mod_cache) for it's mod_rewrite capabilities, and then
> the back-end mod_perl server?

That's a bad idea. Too much overhead.

> I need mod_rewrite at some point for stuff like stopping image hotlinking
> from other websites (people stealing my bandwidth by making my server act as
> an image server for their forums, auctions etc), and other access control
> stuff. I'll have to look into whether squid can do all that.

Squid can do a lot, but you have to learn it, and it's not as
comprehensive as apache.

One thing you didn't mention is why you're using mod_cache at all for
things not generated by mod_perl. Why don't you serve the static
files directly from your front-end server? That's the most common
setup I've seen, with proxying only for mod_perl requests.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 20:32:44 von Neil Gunton

Perrin Harkins wrote:
> One thing you didn't mention is why you're using mod_cache at all for
> things not generated by mod_perl. Why don't you serve the static
> files directly from your front-end server? That's the most common
> setup I've seen, with proxying only for mod_perl requests.

Yes, I am only caching mod_perl content. I exclude things like the
static files and images. I cache mod_perl output for performance in
cases like slashdottings (or, these days, links from digg or reddit
etc). The problem is, the site gets so many page requests, that
htcacheclean just seems to be a little overwhelmed.

I'm looking at Squid right now, and have sent a message to their list to
see what they think. At first glance, Squid does seem to have a fairly
big list of configuration directives, so it's possible it might be able
to handle what I need. I'm open to switching, if it turns out that Squid
uses a more scalable cache pruning methodology. I'm a little sad to see
that Apache's mod_cache doesn't seem to even be complete yet - e.g.
directives like CacheGcInterval aren't implemented:

http://httpd.apache.org/docs/2.0/mod/mod_disk_cache.html#cac hegcinterval

Maybe Squid is more mature in the caching department... dunno, but worth
a look. I'd appreciate any more experienced people here educating me if
this is wrong.

Thanks again,

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 20:42:42 von Neil Gunton

Neil Gunton wrote:
> http://httpd.apache.org/docs/2.0/mod/mod_disk_cache.html#cac hegcinterval

Oops - sorry, I seem to have been looking at the 2.0 docs, rather than
the 2.2. In 2.2, it appears that CacheGCInterval has disappeared...

Now, looking at the 2.2. caching guide:

http://httpd.apache.org/docs/2.2/caching.html

The section on "Maintaining the Disk Cache" says you should use
htcacheclean, which is what I've been doing, and it doesn't seem to be
up to the job.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 20:53:49 von Perrin Harkins

On Mon, Nov 24, 2008 at 2:42 PM, Neil Gunton wrote:
> The section on "Maintaining the Disk Cache" says you should use
> htcacheclean, which is what I've been doing, and it doesn't seem to be up to
> the job.

I can't speak to your filesystem question but you might consider
getting better disks. Either a RAID system or a SSD would help your
write speed and both are pretty cheap these days.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:02:12 von Neil Gunton

Perrin Harkins wrote:
> On Mon, Nov 24, 2008 at 2:42 PM, Neil Gunton wrote:
>> The section on "Maintaining the Disk Cache" says you should use
>> htcacheclean, which is what I've been doing, and it doesn't seem to be up to
>> the job.
>
> I can't speak to your filesystem question but you might consider
> getting better disks. Either a RAID system or a SSD would help your
> write speed and both are pretty cheap these days.

I'm using 4x10k SCSI drives in RAID0 configuration currently, on an
Adaptec zero channel SmartRaid V controller. Filesystem is ext2.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:08:13 von aw

Neil Gunton wrote:
[...]
Hi.
I am not really an expert on large websites, caches and so on, but in
our applications we are managing a large number of files.
One of the things we have learned over the years, is that even on modern
operating systems, having large numbers of entries in each directory is
an absolute performance killer.
This may thus be or not relevant to your particular problem, but what is
the average number of entries you have *per directory* ?

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:16:59 von mpeters

Neil Gunton wrote:
> Perrin Harkins wrote:
>> On Mon, Nov 24, 2008 at 2:42 PM, Neil Gunton wrote:
>>> The section on "Maintaining the Disk Cache" says you should use
>>> htcacheclean, which is what I've been doing, and it doesn't seem to
>>> be up to
>>> the job.
>>
>> I can't speak to your filesystem question but you might consider
>> getting better disks. Either a RAID system or a SSD would help your
>> write speed and both are pretty cheap these days.
>
> I'm using 4x10k SCSI drives in RAID0 configuration currently, on an
> Adaptec zero channel SmartRaid V controller. Filesystem is ext2.

Well except for getting 15K disks you probably won't be able to get much more improvement from just
the hardware.

According to these benchmarks
(http://fsbench.netnation.com/new_hardware/2.6.0-test9/scsi/ bonnie.html) ReiserFS handles deletes
much better than ext2 (10,015/sec vs 729/sec)

--
Michael Peters
Plus Three, LP

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:22:31 von mpeters

Michael Peters wrote:

> According to these benchmarks
> (http://fsbench.netnation.com/new_hardware/2.6.0-test9/scsi/ bonnie.html)
> ReiserFS handles deletes much better than ext2 (10,015/sec vs 729/sec)

But these benchmarks (http://www.debian-administration.org/articles/388) say the following:

For quick operations on large file tree, choose Ext3 or XFS. Benchmarks from other authors have
supported the use of ReiserFS for operations on large number of small files. However, the present
results on a tree comprising thousands of files of various size (10KB to 5MB) suggest than Ext3 or
XFS may be more appropriate for real-world file server operations

But they both say don't use ext2 :)

--
Michael Peters
Plus Three, LP

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:37:29 von Perrin Harkins

On Mon, Nov 24, 2008 at 3:16 PM, Michael Peters wrote:
> Well except for getting 15K disks you probably won't be able to get much
> more improvement from just the hardware.

You don't think so? RAID and SSD can both improve your write
throughput pretty significantly.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:40:53 von Neil Gunton

Michael Peters wrote:
> Michael Peters wrote:
>
> But these benchmarks (http://www.debian-administration.org/articles/388)
> say the following:
>
> For quick operations on large file tree, choose Ext3 or XFS.
> Benchmarks from other authors have
> supported the use of ReiserFS for operations on large number of small
> files. However, the present
> results on a tree comprising thousands of files of various size (10KB
> to 5MB) suggest than Ext3 or
> XFS may be more appropriate for real-world file server operations
>
> But they both say don't use ext2 :)

This may be a tangent, but my understanding is that the only real
difference between ext2 and ext3 is the journaling, which is related to
safety in the event of unclean shutdown rather than everyday
performance. If anything, in fact, ext3 performs a little worse than
ext2 because of the requirement to keep the journal (which means more
writes to the disk for updates). Otherwise, all the optimization
features such as dir_index are, I think, available for ext2 as well as
ext3. I have noticed that for SSD drives (e.g. the Asus Eee PC, which I
have), people recommend using ext2, since it's less likely to result in
the write fatigue that those drives experience over time (you only get
so many writes). And for laptops, ext2 results in fewer io writes.
Finally, I have noticed my iowait times go down since I moved from using
ext3 to ext2 on the server (previously I always used ext3, but for a
recent rebuild I switched to ext2 to see how it did).

Of course I may be wrong about all this, but my experience seems to
favor ext2 over ext3, at least for performance. Since I back everything
up on the server anyway (using RAID0, a necessity), I am more concerned
with performance than unclean shutdowns. In any case the server is in a
datacenter with UPS, so that is not so likely, though it did happen once
and I didn't lose any data even then.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:46:12 von mpeters

Perrin Harkins wrote:
> On Mon, Nov 24, 2008 at 3:16 PM, Michael Peters wrote:
>> Well except for getting 15K disks you probably won't be able to get much
>> more improvement from just the hardware.
>
> You don't think so? RAID and SSD can both improve your write
> throughput pretty significantly.

He's already using RAID0, which should be the best performance of RAID since it doesn't have to use
any parity blocks/disks right? And from what I've seen about SSD (can't find a link now) filesystems
haven't caught up to it to make a real difference with one over the other. They do have much lower
powser usage though (which is why they find their way into laptops).

--
Michael Peters
Plus Three, LP

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:47:07 von Neil Gunton

André Warnier wrote:
> Neil Gunton wrote:
> [...]
> Hi.
> I am not really an expert on large websites, caches and so on, but in
> our applications we are managing a large number of files.
> One of the things we have learned over the years, is that even on modern
> operating systems, having large numbers of entries in each directory is
> an absolute performance killer.
> This may thus be or not relevant to your particular problem, but what is
> the average number of entries you have *per directory* ?

I'm not sure what the average number of files per directory is
currently. Is there a linux tool which gives that kind of statistic?

Looking at one random bucket, there were only 2 files in there.

I think the issue here is the large size of the directory tree itself -
simply traversing this seems to be a problem. I started off a du this
morning on that tree, at around 9am, and it's now after 12 midday and
the command is still not done yet. Meanwhile my iowait has doubled on
the server as a result. Obviously it's a lot of work just traversing
this tree, since du is not even doing any pruning, just walking the
directory tree. It makes me wonder if there's something wrong with my
system, though it seems ok in all other respects. I think this is just a
not-very-efficient datastructure, at least with respect to this
filesystem, hence my original question about reiserfs. I think I need
either a filesystem better suited to traversing large directory trees,
or else a different tool that keeps track of the cache in a different
manner.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 21:59:44 von Holger Kipp

On Mon, Nov 24, 2008 at 03:37:29PM -0500, Perrin Harkins wrote:
> On Mon, Nov 24, 2008 at 3:16 PM, Michael Peters wrote:
> > Well except for getting 15K disks you probably won't be able to get much
> > more improvement from just the hardware.
>
> You don't think so? RAID and SSD can both improve your write
> throughput pretty significantly.

Using squid he could define one cache-directory for every disk,
so striping won't increase performance of the disks that much.
more important might be how the os is caching write changes to
mitigate limited bandwidth (io) of the disks.

With ReiserFS I have seen some benchmarks that are not really in
favour, like

http://linuxgazette.net/122/TWDT.html#piszcz

and my experience with UFS2 (albeit on FreeBSD) was much better
than with Linux/ReiserFS on the same machine. Neither were tuned, though,
so ymmv.

Regards,
Holger Kipp

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 22:00:37 von John Hallam

On Mon, 24 Nov 2008, Neil Gunton wrote:

> I think the issue here is the large size of the directory tree itself -
> simply traversing this seems to be a problem. I started off a du this
> morning on that tree, at around 9am, and it's now after 12 midday and
> the command is still not done yet. Meanwhile my iowait has doubled on
> the server as a result.

Just a random thought... The O(n) directory search/traversal in
filesystems only hits you if you have directories with many many files in.
If your directories are like the one you sampled, with few items in, then
maybe you are thrashing one of the filesystem caches -- inodes, vnodes or
such -- while traversing the tree. I don't recall off-hand how you check
this, though looking at the output of iostat and vmstat would give you
some idea of where the traffic is in the VM and block IO subsystems.

Best wishes,

John

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 22:08:46 von Perrin Harkins

On Mon, Nov 24, 2008 at 3:46 PM, Michael Peters wrote:
> He's already using RAID0, which should be the best performance of RAID since
> it doesn't have to use any parity blocks/disks right?

Yes, I missed that. He could still improve the throughput by adding more disks.

> And from what I've
> seen about SSD (can't find a link now) filesystems haven't caught up to it
> to make a real difference with one over the other. They do have much lower
> powser usage though (which is why they find their way into laptops).

We're talking high-end SSD, not the stuff they put in laptops. It's
fast, and you can make a RAID array of them, and it's within a
reasonable price range now.

A ton of RAM in the server might help too.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 22:15:10 von Neil Gunton

Perrin Harkins wrote:
> A ton of RAM in the server might help too.

I've already got 4GB in there.

Well, the du just finished, it took 214 minutes to complete. I just took
a look at one of the directories in the cache. Now, I have it set for a
depth of 3, so I looked at d/d/d just randomly selected. Then I did a du
there. Here's the output:

server:/var/cache/www/d/d/d# du -h
4.0K ./2BykLs49Xm7cnV6MrWA.header.vary/Y/z/m
8.0K ./2BykLs49Xm7cnV6MrWA.header.vary/Y/z
12K ./2BykLs49Xm7cnV6MrWA.header.vary/Y
16K ./2BykLs49Xm7cnV6MrWA.header.vary
4.0K ./YFPZLpyo_NRtEUoJQQA.header.vary/k/a/y
8.0K ./YFPZLpyo_NRtEUoJQQA.header.vary/k/a
12K ./YFPZLpyo_NRtEUoJQQA.header.vary/k
16K ./YFPZLpyo_NRtEUoJQQA.header.vary
16K ./UM@uZ0AwL5n@QqLWnrA.header.vary/F/O/b
20K ./UM@uZ0AwL5n@QqLWnrA.header.vary/F/O
24K ./UM@uZ0AwL5n@QqLWnrA.header.vary/F
28K ./UM@uZ0AwL5n@QqLWnrA.header.vary
4.0K ./FrakgI6EKDUjb4dgMXQ.header.vary/G/N/n
8.0K ./FrakgI6EKDUjb4dgMXQ.header.vary/G/N
12K ./FrakgI6EKDUjb4dgMXQ.header.vary/G
16K ./FrakgI6EKDUjb4dgMXQ.header.vary
80K .

So you see, there are actually a lot more directories there than you
might assume based on a 3-level tree! I didn't know it was doing all
this as well, it makes more sense now that it would take a long time to
traverse - we're talking about a huge number of directories after you do
3 levels, one for each letter (large and small case) at each level, then
throw in those additional sub-levels... for EVERY leaf of the 3-level
tree, that's staggering. I need to look into the documentation for
mod_cache to see if there is something I need to tweak with this "vary"
stuff - maybe it's doing more than it has to, but I just don't know.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 22:23:22 von Perrin Harkins

On Mon, Nov 24, 2008 at 4:15 PM, Neil Gunton wrote:
> Perrin Harkins wrote:
>>
>> A ton of RAM in the server might help too.
>
> I've already got 4GB in there.

Some desktop machines ship with that much these days. You could bump
it up to 16 or 32 (assuming it's 64-bit) pretty inexpensively and let
the VM system help you out.

A software change could be cheaper if it's simple, but if it requires
you to do a lot rewriting you might save money by buying some RAM.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 23:07:40 von Neil Gunton

Neil Gunton wrote:
> Well, the du just finished, it took 214 minutes to complete. I just took
> a look at one of the directories in the cache. Now, I have it set for a
> depth of 3, so I looked at d/d/d just randomly selected. Then I did a du
> there. Here's the output:
>
> server:/var/cache/www/d/d/d# du -h
> 4.0K ./2BykLs49Xm7cnV6MrWA.header.vary/Y/z/m
> 8.0K ./2BykLs49Xm7cnV6MrWA.header.vary/Y/z
> 12K ./2BykLs49Xm7cnV6MrWA.header.vary/Y
> 16K ./2BykLs49Xm7cnV6MrWA.header.vary
> 4.0K ./YFPZLpyo_NRtEUoJQQA.header.vary/k/a/y
> 8.0K ./YFPZLpyo_NRtEUoJQQA.header.vary/k/a
> 12K ./YFPZLpyo_NRtEUoJQQA.header.vary/k
> 16K ./YFPZLpyo_NRtEUoJQQA.header.vary
> 16K ./UM@uZ0AwL5n@QqLWnrA.header.vary/F/O/b
> 20K ./UM@uZ0AwL5n@QqLWnrA.header.vary/F/O
> 24K ./UM@uZ0AwL5n@QqLWnrA.header.vary/F
> 28K ./UM@uZ0AwL5n@QqLWnrA.header.vary
> 4.0K ./FrakgI6EKDUjb4dgMXQ.header.vary/G/N/n
> 8.0K ./FrakgI6EKDUjb4dgMXQ.header.vary/G/N
> 12K ./FrakgI6EKDUjb4dgMXQ.header.vary/G
> 16K ./FrakgI6EKDUjb4dgMXQ.header.vary
> 80K .
>
> So you see, there are actually a lot more directories there than you
> might assume based on a 3-level tree! I didn't know it was doing all
> this as well, it makes more sense now that it would take a long time to
> traverse - we're talking about a huge number of directories after you do
> 3 levels, one for each letter (large and small case) at each level, then
> throw in those additional sub-levels... for EVERY leaf of the 3-level
> tree, that's staggering. I need to look into the documentation for
> mod_cache to see if there is something I need to tweak with this "vary"
> stuff - maybe it's doing more than it has to, but I just don't know.

It seems like this might have something to do with mod_deflate, which I
am using in combination with mod_disk_cache. This page gives a clue that
there might be a problem with the way files are cached when these
modules are both enabled:

http://www.digitalsanctuary.com/tech-blog/general/apache-mod _deflate-and-mod_cache-issues.html

Seems like a very recent post (Nov 18th).

Any ideas? Seems like a big problem, if you're trying to use a reverse
proxy on a large dynamic site, and also optimize bandwidth by using
mod_deflate too.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 23:19:07 von Adam Prime

Neil Gunton wrote:
>
> It seems like this might have something to do with mod_deflate, which
> I am using in combination with mod_disk_cache. This page gives a clue
> that there might be a problem with the way files are cached when these
> modules are both enabled:
>
> http://www.digitalsanctuary.com/tech-blog/general/apache-mod _deflate-and-mod_cache-issues.html
>
>
> Seems like a very recent post (Nov 18th).
>
> Any ideas? Seems like a big problem, if you're trying to use a reverse
> proxy on a large dynamic site, and also optimize bandwidth by using
> mod_deflate too.
>
> Neil
That does look like a big deal, if i were in your situation, I'd try
running with only mod_deflate, then only mod_cache, and see what
happens. There are benefits to running the reverse proxy alone (without
mod_cache), so that'd be the first scenario i'd try.

Adam

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 23:19:51 von mpeters

Adam Prime wrote:

> That does look like a big deal, if i were in your situation, I'd try
> running with only mod_deflate, then only mod_cache, and see what
> happens. There are benefits to running the reverse proxy alone (without
> mod_cache), so that'd be the first scenario i'd try.

Or split them up. If you have any static assets that can benefit from mod_deflate (Javascript, CSS,
etc) then put mod_deflate on the proxies and mod_perl, mod_cache on the backend.

--
Michael Peters
Plus Three, LP

Re: Best filesystem type for mod_cache in reverse proxy?

am 24.11.2008 23:54:42 von Neil Gunton

Neil Gunton wrote:
> Neil Gunton wrote:
> It seems like this might have something to do with mod_deflate, which I
> am using in combination with mod_disk_cache. This page gives a clue that
> there might be a problem with the way files are cached when these
> modules are both enabled:
>
> http://www.digitalsanctuary.com/tech-blog/general/apache-mod _deflate-and-mod_cache-issues.html

I have just been doing some experimentation on my development
workstation. It seems that with mod_deflate enabled, mod_cache doesn't
cache properly, or at least not as I would expect: I tested with two
browsers (Mozilla and Opera), both with no cookies related the site, and
loading the same page from each. Both requests were passed through to
the back-end, i.e. were cached separately. This is with mod_deflate
enabled for html pages. So I disabled mod_deflate (just commented out
that one line), restarted the servers, cleared the caches of both
browsers and mod_cache, and tried again. This time, the first request
was passed through to the backend (as expected), but the second request,
from the other browser for the same page, was this time retrieved from
mod_cache. Also, the cache directories on the server end look a lot
simpler, I guess because the Vary header is no longer being set by
mod_deflate. This is very interesting, I'm going to do some more testing
on the production server, by clearing the mod_disk_cache cache and
disabling mod_deflate for a while to see how things run. I know the
content transmitted will be larger and thus slower for people on slow
connections, but right now I'm interested in seeing how this affects the
performance of htcacheclean, and even du - see if times for traversing
the directories gets much better without all those extra Vary subdirs.
In any case, it would seem that the cache wasn't really working after
all, which might explain the large number of cache directories -
multiple versions of the same content. Yikes.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 25.11.2008 00:04:36 von aw

Neil Gunton wrote:
[...]
At the risk of stating the obvious, but since you are talking about
mod_perl (and thus I suppose perl), the basic module File::Find is a
good starting point to collect all kinds of statistics about a file
hierarchy.
Such as how many levels maximum and average, how many files per
directory or per depth, sizes etc..
You can easily build a script that will run regularly on your file
structure and take some snapshots over time.
Real numbers are generally a better base for optimisation than mere
impressions.

Re: Best filesystem type for mod_cache in reverse proxy?

am 25.11.2008 19:30:38 von Neil Gunton

Neil Gunton wrote:
> Neil Gunton wrote:
>> Neil Gunton wrote:
>> It seems like this might have something to do with mod_deflate, which
>> I am using in combination with mod_disk_cache. This page gives a clue
>> that there might be a problem with the way files are cached when these
>> modules are both enabled:
>>
>> http://www.digitalsanctuary.com/tech-blog/general/apache-mod _deflate-and-mod_cache-issues.html
>
> I have just been doing some experimentation on my development
> workstation. It seems that with mod_deflate enabled, mod_cache doesn't
> cache properly, or at least not as I would expect: I tested with two
> browsers (Mozilla and Opera), both with no cookies related the site, and
> loading the same page from each. Both requests were passed through to
> the back-end, i.e. were cached separately. This is with mod_deflate
> enabled for html pages. So I disabled mod_deflate (just commented out
> that one line), restarted the servers, cleared the caches of both
> browsers and mod_cache, and tried again. This time, the first request
> was passed through to the backend (as expected), but the second request,
> from the other browser for the same page, was this time retrieved from
> mod_cache. Also, the cache directories on the server end look a lot
> simpler, I guess because the Vary header is no longer being set by
> mod_deflate. This is very interesting, I'm going to do some more testing
> on the production server, by clearing the mod_disk_cache cache and
> disabling mod_deflate for a while to see how things run. I know the
> content transmitted will be larger and thus slower for people on slow
> connections, but right now I'm interested in seeing how this affects the
> performance of htcacheclean, and even du - see if times for traversing
> the directories gets much better without all those extra Vary subdirs.
> In any case, it would seem that the cache wasn't really working after
> all, which might explain the large number of cache directories -
> multiple versions of the same content. Yikes.

Well, that seemed to do the trick! So the caveat seems to be: Be careful
using both mod_deflate and mod_cache (mod_disk_cache specifically)
together if you have a large dynamic website that can generate a large
number of distinct pages. Mod_deflate produces a Vary header, which
forces mod_cache to store multiple versions of the same content. To
compound this, every version involves additional subdirs in the cache,
which makes traversing it in any fashion very, very time consuming,
producing high iowait even for a fast 4 disk SCSI RAID0 setup.

It took more than three hours just to delete the old cache.

Once I disabled mod_deflate, the new cache looks a lot cleaner - just
the three levels of directory that I specified in the config via
CacheDirLevels, and none of the extra .vary sub-levels.

Additionally, du now just takes a few seconds to traverse the cache,
which currently is set at 1GB. Htcacheclean seems to be keeping up well
in daemon mode, with -i -n options. The large, ongoing iowait on the
server has disappeared completely.

Web pages seem to render a little faster in the browser too. That may be
my imagination and/or placebo effect, but it might make sense if there
isn't that additional compression/decompression going on both ends.

The only downside is that people on extremely slow dialup connections
might notice longer download times for page text... but I have to wonder
if that's really an issue today. Back in 1998 perhaps you might care
about something being 20KB rather than 80KB, but surely not today. In
any case, don't dialup ISPs often implement their own compression now?

Anyway, hope that's helpful to anybody running large dynamic websites
behind a reverse proxy. Keep mod_cache, maybe think about ditching
mod_deflate. The combination does technically work, but for large
numbers of pages, it can make your cache size (and your iowait) explode.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

am 25.11.2008 19:55:22 von Perrin Harkins

On Tue, Nov 25, 2008 at 1:30 PM, Neil Gunton wrote:
> The only downside is that people on extremely slow dialup connections might
> notice longer download times for page text... but I have to wonder if that's
> really an issue today. Back in 1998 perhaps you might care about something
> being 20KB rather than 80KB, but surely not today. In any case, don't dialup
> ISPs often implement their own compression now?

Compressing is pretty important:
http://developer.yahoo.net/blog/archives/2007/07/high_perfor manc_3.html

I wonder if there's a way to make the mod_deflate Vary header a bit
saner, so it just reflects compressed or not, rather than every
possible User-Agent.

There are also alternative ways to cache pages, like pre-publishing
them as static files or doing page caching with mod_perl handlers that
intercept the request before the response phase and serve a cached
copy. It's very convenient to use mod_cache though.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

am 26.11.2008 08:46:29 von Raymond Wan

Hi

Neil Gunton wrote:
> Well, that seemed to do the trick! So the caveat seems to be: Be
> careful using both mod_deflate and mod_cache (mod_disk_cache
> specifically) together if you have a large dynamic website that can
> generate a large number of distinct pages. Mod_deflate produces a


This is probably a digression from your discussion, but I'm not sure if
any of you have used gzip + md5sum together before. I have, and it can
be annoying especially if you are playing with large data files like I
do. This is because gzip seems to (not 100% sure) store some time
information in the archive. So, if you create two archives of the same
files, they aren't identical...their md5sums do not match.

As deflate is essentially the same algorithm as gzip, it is somewhat the
same annoyance...


> Web pages seem to render a little faster in the browser too. That may
> be my imagination and/or placebo effect, but it might make sense if
> there isn't that additional compression/decompression going on both ends.
>
> The only downside is that people on extremely slow dialup connections
> might notice longer download times for page text... but I have to
> wonder if that's really an issue today. Back in 1998 perhaps you might
> care about something being 20KB rather than 80KB, but surely not
> today. In any case, don't dialup ISPs often implement their own
> compression now?


I had looked at the effect compression has on web pages a while ago.
Though not relevant to modperl, there is obviously a cost to compression
and since most HTML pages are small, sometimes it is hard to justify.
If users are downloading XML files of data, though, then that is of
course worth it...but one could argue that if you are making XML files
available for download, then wouldn't it be better to compress it
yourself rather than asking Apache to compress on-the-fly.

As for dialup, if I remember from those dark modem days :-), even many
of them had compression built in. In fact, I think they had some form
of the deflate/gzip/sliding window algorithm. And for those of us who
have tried gzipping an already-gzipped file, adding compression to
something that is already compressed is generally counter-productive...

Anyway, I don't think it is much of an issue...might be more helpful to
educate web page creators to not put MBs of images on a single page. :-)

Ray




>
> Anyway, hope that's helpful to anybody running large dynamic websites
> behind a reverse proxy. Keep mod_cache, maybe think about ditching
> mod_deflate. The combination does technically work, but for large
> numbers of pages, it can make your cache size (and your iowait) explode.

Re: Best filesystem type for mod_cache in reverse proxy?

am 26.11.2008 15:27:11 von mpeters

Raymond Wan wrote:

> I had looked at the effect compression has on web pages a while ago.
> Though not relevant to modperl, there is obviously a cost to compression
> and since most HTML pages are small, sometimes it is hard to justify.

Not to discredit the work you did researching this, but a lot of people are studying the same thing
and coming to different conclusions:

http://developer.yahoo.com/performance/rules.html

Yes, backend performance matters, but more and more we realize that the front end tweaks we can make
give a better performance for users.

Take google as an example. The overhead of compressing their content and decompressing it on the
browser takes less time than sending the same content uncompressed over the network. I'd say the
same is true for most other applications too.

> As for dialup, if I remember from those dark modem days :-)

Even non dialup customers can benefit. Many "broadband" connections aren't very fast, especially in
rural places (I'm thinking large portions of the US).

But all this talk is really useless in the abstract. Take a tool like YSlow for a spin and see how
your sites perform with and without compression. Especially looking at the waterfall display.

--
Michael Peters
Plus Three, LP

Re: Best filesystem type for mod_cache in reverse proxy?

am 26.11.2008 17:14:54 von Raymond Wan

Hi Michael,


Michael Peters wrote:
> Raymond Wan wrote:
>> I had looked at the effect compression has on web pages a while ago.
>> Though not relevant to modperl, there is obviously a cost to
>> compression and since most HTML pages are small, sometimes it is hard
>> to justify.
>
> Not to discredit the work you did researching this, but a lot of
> people are studying the same thing and coming to different conclusions:
>
> http://developer.yahoo.com/performance/rules.html
>
> Yes, backend performance matters, but more and more we realize that
> the front end tweaks we can make give a better performance for users.
>
> Take google as an example. The overhead of compressing their content
> and decompressing it on the browser takes less time than sending the
> same content uncompressed over the network. I'd say the same is true
> for most other applications too.


It's ok; I don't consider another opinion as discrediting my work. :-)
Actually, it was a while ago and it was only one aspect of my work and
in a smaller test bed. My fault for handwaving in my reply, though.

The point is actually the "sometimes"... My research was more in
general compression and web compression was only one aspect. My point
is if you take a one byte file and run gzip -9 on it (again, the same
algorithm as deflate), you get a 24 byte file. As you increase that
file size, you will reach a point where it becomes more beneficial to
compress. Though my example is both silly and pathological, it just
shows that there are cases when compression may not be beneficial. And
one can imagine the average file size of a web site to be some kind of
knob and as it turns (average file size increases as you go from site to
site), the benefits become more and more evident.

For example, compressing an already compressed file is generally
pointless (if it was done right the first time). MP3, JPEG, GIF, etc.
are all file formats that have or may have compression incorporated.
PDFs can be compressed too if someone selected that option when creating
it. English text compresses well (25%, in general?) but two-byte
encodings such as Chinese and Japanese (I think) get around 40-50%
[handwaving again :-) there are more updated numbers out there]. Also,
compression works if it is a uniform file; if a web page has a mix of
text, images, etc., then each one has to be compressed individually.

As for Google, you are right -- I can imagine why it would work well for
Google. However, I can also hypothesize that it might be a special
case. I presume you mean the results of a query. The result we get is
a list of results which all are related to each other. i.e., if you
searched for "apache2 modperl", we can expect those two words to be in
every result and the type of words to be similar from result to result
[they would all be computer-oriented]. As compression aims to reduce
redundancy, their results are perfect for it. Especially if

Anyway, what I wanted to say is that there ought to be instances when
compression is beneficial and when it isn't. I think it is fine to do
what the Yahoo site says and have it "on" by default; but if someone
examines the traffic and data and realizes it should be "off", that
isn't beyond reason.


>> As for dialup, if I remember from those dark modem days :-)
>
> Even non dialup customers can benefit. Many "broadband" connections
> aren't very fast, especially in rural places (I'm thinking large
> portions of the US).
>
> But all this talk is really useless in the abstract. Take a tool like
> YSlow for a spin and see how your sites perform with and without
> compression. Especially looking at the waterfall display.
>

Well, one good thing about deflate is that it is *fast*. Very fast.
So, while my silly one byte file example shows there are exceptions, it
might be closer to one byte. :-)

One cost savings might be to pre-compress files since it is more
time-consuming to compress than decompress using deflate. i.e., have
them reside on the server in compressed form. Of course, that offers
many problems and is one reason why things like Stacker didn't really
catch on (much)...

Ray