combining multiple filtered files into a single response

am 19.11.2010 00:15:54 von Brian

Hello,

I've read through the documentation and searched the list archives and elsewhere
for information on how to do this properly, but I've come up short so I'm
turning to the users list for help.

What I'm trying to do is determine the proper way, modperl-style, to iterate the
process of passing a bucket brigade of file contents through an output filter
for a series of files, all within a single request/response cycle. And I want
to do it right.

So for example example, start with a simple handler and filter, where the
handler will use add_output_filter() to add a filter and call sendfile() on a
given file to generate filtered output. In this example imagine that the filter
prefixes every line of the file with the filename in the output so that the
lines of the filtered output stream all begin with "[filename]:". (But for the
solution, the filter needs to be able to operate on binary content.)

Now, imagine a request where a user submits to a form three filenames: "fileA",
"fileB" and "fileC". The goal is to have the handler perform the equivalent of
a sendfile() on each, have each pass through the filter in succession, and
output the concatenated content to the client connection in a single response.
The main handler would perform other functions too like inserting header or
footer sections and possibly additional metadata at the file boundaries. The
response headers can all be set at the beginning. What is the proper way to do
this? Subrequests? (The documentation on subrequests is poor and doesn't
suggest how you might combine multiple subrequests). Special boundary buckets to
indicate content switch?

I want to do this the right way but I'm not sure which modperl tools are right
for the job. I've thought of a few potential solutions which I'm not sure how to
implement.

One is to iterate over the filenames with subrequests (if this is even
possible/supported), so that each can be passed internally to a single request
as in the simple (single-file) handler described in the example above. If the
output of the subrequests can be captured then they can be combined into a
single response. That idea seems to be the cleanest, if not the most efficient.

If that doesn't work then I can imagine iterating over the files with calls to
"sendfile()" and using a modified filter to guess at file boundaries. However
since the filter needs to be able to handle binary content it can't do this by
reading the data itself (nor should it, since that's inefficient), but it could
do so by counting bytes if it knows the size of the files ahead of time, or some
other out-of-band signal like a "flush" bucket that indicates a file boundary.
However that solution seems messy and prone to error.

I'd also like to avoid the last resort which would be to run a long process to
process each file, save them to a temporary directory, and then re-read them at
the end into a single output stream. This defeats the purpose because I'd like
to be able to start writing the output of the first filtered file to the client
as soon as it's processed.

Any advice? Really what I want is a response handler that can generate: [apache
headers] + HeaderSection + {sendfile(file1)->output_filter} +
{sendfile(file2)->output_filter} + ... + {sendfile{fileN)->output_filter} +
FooterSection. The trick being that the "output_filter" is doing a
file-specific thing for each file and can't operate on a single stream of data
without being able to recognize the file boundaries.

Thanks for your help,

Brian

Re: combining multiple filtered files into a single response

am 19.11.2010 00:53:37 von aw

Brian wrote:
....
>
> I'd also like to avoid the last resort which would be to run a long
> process to process each file, save them to a temporary directory, and
> then re-read them
(one after the other)
at the end
(and send them out)
a single output stream. This defeats
> the purpose because I'd like to be able to start writing the output of
> the first filtered file to the client as soon as it's processed.
>
Why is that "the last resort" ?
It seems to me to be the logical way of achieving what you want.
Anyway, when the user is posting the three files (as 3 "file" boxes in a form), this is
sent by the browser as one long stream (multi-part encoded, but still one long stream).
The server is receiving it as one long stream, parsing it, saving each of these files to a
temporary file on disk, and then handing you a filehandle to each one of them.
Both apreq and CGI will do that for you.
That code is already developed, debugged, optimised, and tested a million times.
Why make your life more complicated than it is, and risk making mistakes re-developing
your own version of it ?

In addition, you can't really start sending back the result to the client, before this
client has finished sending the whole form (including all the uploaded files). The client
will not even start to read the server response before then.

Re: combining multiple filtered files into a single response

am 19.11.2010 01:10:31 von Brian

On 11/18/10 6:53 PM, André Warnier wrote:
>> I'd also like to avoid the last resort which would be to run a long process to
>> process each file, save them to a temporary directory, and then re-read them
>
> Why is that "the last resort" ?
> It seems to me to be the logical way of achieving what you want.
> Anyway, when the user is posting the three files (as 3 "file" boxes in a form),
> this is sent by the browser as one long stream

Sorry, maybe I wasn't clear: the client isn't sending the files, they are being
read on the server locally and served to the client in a single output stream.
The client is simply providing the FILENAMES in the request; the server has to
do all the work, and already has all the data.

My point there was to say that if the client says "send file1+file2+file3" then
I want the server to be able to start outputting the data for file1 immediately
before file2 has even been read from disk. But in addition it needs to pass
through an output filter, and each will pass through that same filter, but the
filter needs to know when it hits an EOF (as opposed to an EOS).

I thought a simple solution would be that in a sub-request, the filter only
processes one file so it's one stream, but then I have to combine sub-requests
at the top level and I don't know how to do that. But ideally I want something
like this:

for my $file (@requested_files) {
my $header = generate_header($file);
my $footer = generate_footer($file);
$r->write($header);
my $subr = $r->lookup_uri("/fetch?name=$file", $myfilter);
$r->write(...output of $subr...?);
$r->write($footer);
}

It's a little more complicated than that, but hopefully you get the idea...

Brian

Re: combining multiple filtered files into a single response

am 19.11.2010 17:58:33 von Ryan Gies

On 11/18/2010 06:15 PM, Brian wrote:
> One is to iterate over the filenames with subrequests (if this is even
> possible/supported), so that each can be passed internally to a single
> request as in the simple (single-file) handler described in the
> example above. If the output of the subrequests can be captured then
> they can be combined into a single response. That idea seems to be the
> cleanest, if not the most efficient.
Although you could get them to work, I don't think sub-requests are your
answer. They run through all of the handler phases and are expected to
return full HTTP responses.
>
> If that doesn't work then I can imagine iterating over the files with
> calls to "sendfile()" and using a modified filter to guess at file
> boundaries. However since the filter needs to be able to handle
> binary content it can't do this by reading the data itself (nor should
> it, since that's inefficient), but it could do so by counting bytes if
> it knows the size of the files ahead of time, or some other
> out-of-band signal like a "flush" bucket that indicates a file
> boundary. However that solution seems messy and prone to error.
Because your out-of-band signal may be split across buckets, the
output-filter approach is probably not your answer either. Once again
it can be done, however introduces [seemingly] unneeded complexity. I
would say the same for tracking boundaries according to their offset.

Unless there is some constraint, the most straight-forward approach may
be to implement your routine to modify the file contents as they are
read from disk:

send_headers();

$r->print($content_header);

foreach my $path (@files) {
my $file = Your::FileFilter->new($path) or die;
$file->open or die;
while(my $buf = $file->read) {
$r->print($buf);
}
$file->close or die;
$r->rflush();
}

$r->print($content_footer);

Re: combining multiple filtered files into a single response

am 19.11.2010 18:55:06 von Brian

On 11/19/10 11:58 AM, Ryan Gies wrote:
>> One is to iterate over the filenames with subrequests (if this is even
>> possible/supported), so that each can be passed internally to a single request
> Although you could get them to work, I don't think sub-requests are your answer.
> They run through all of the handler phases and are expected to return full HTTP
> responses.

I think that's right. I gave up on that idea after an exchange with André.

>> If that doesn't work then I can imagine iterating over the files with calls to
>> "sendfile()" and using a modified filter to guess at file boundaries.
> Because your out-of-band signal may be split across buckets, the output-filter
> approach is probably not your answer either. Once again it can be done, however
> introduces [seemingly] unneeded complexity. I would say the same for tracking
> boundaries according to their offset.

Well there are lots of cases where you have to worry about data being split
across buckets (or even brigades) in an output filter, but there are known
solutions for this by maintaining context. The reason I don't trust custom
in-band signals is because the filter is handling binary data so I can't predict
what will pass through reliably. Tracking by offset is more promising, even
though like you say there's complexity and openings for error.

> Unless there is some constraint, the most straight-forward approach may be to
> implement your routine to modify the file contents as they are read from disk:

Yeah, André came up with the same idea, but I'm hoping to avoid that for the
same reason that I gave up on the idea of subrequests: efficiency (and re-use).
I already have a filter that does what I want and operates on bucket brigades;
it's designed for binary files and in most cases only needs to read the first
few kilobytes of multi-megabyte files before deciding that it can pass on the
rest untouched. For efficiency it's much better to skip the calls to read() and
have the data read only only when it's written to the client, rather than
multiple times and into memory by the response handler.

So the only way I think to maintain this efficiency for multiple files in a
single stream would be to have their filehandles going in succession into bucket
brigades and having the filter track the boundaries by offsets. I know I can't
rely on brigade boundaries or flush buckets because single files can be spread
over multiple brigades, and I'm not confident that I can control where flush
buckets appear unless I insert a filter directly before to strip them out except
at the boundaries (does anyone know whether flush buckets are predictable?).
It's a bit messy and I'm still hoping someone here may offer a cleaner
mechanism, but if not then I'll try that.

Thanks,

Brian