storing binary files / memory limit

storing binary files / memory limit

am 11.10.2008 18:26:53 von Tomas Vondra

Hi,

I'm kind of struggling with storing binary files inside the database.
I'd like to store them in BYTEA columns, but the problem is the files
are quite large (a few megabytes, usually), so the PHP memory_limit is
reached when escaping the data.

It does not matter whether I use old-fashioned pg_* functions or the new
PDO extension - with BYTEA columns both do behave the same. For example
with a 16MB file (called input.data) and 8MB memory_limit, this throws
an 'Allowed memory exhausted' exception due to the fact that all the
data (read from the file) have to be escaped at once:

------------------------------------------------------------ ------------

$pdo = new PDO('pgsql: ... ');

// data_table (id INTEGER, data BYTEA)
$stmt = $pdo->prepare('INSERT INTO data_table VALUES (:id, :data)');

$id = 1;
$input = fopen('/tmp/input.data', 'rb');

$stmt->bindParam(':id', $id, PDO::PARAM_INT);
$stmt->bindValue(':data', $input, PDO::PARAM_LOB);

$stmt->execute();

------------------------------------------------------------ ------------

Till now I've found two solutions, but none of them meets all my
requirements:

1) using LOBs - with LOBs a 'streaming of data' is possible, i.e. you
can do this:

------------------------------------------------------------ ---------
$pdo->beginTransaction();
$input = fopen('/tmp/input.data', 'rb');
$oid = $pdo->pgsqlLOBCreate();
$stream = $pdo->pgsqlLOBOpen($oid, 'w');
stream_copy_to_stream($input, $stream);
$pdo->commit();
------------------------------------------------------------ ---------

This is nice because it saves memory, but a serious disadvantage (for
me) is a lack of refferential integrity - you can delete a LOB even
if it's referenced from some table (i.e. an OID is stored in it). So
it's basically the same as storing the data directly in a filesystem,
and storing just a path to it.

2) using BYTEA columns and do the 'streaming' on my own - just store the
data as usual, but upload them 'by pieces' (say 100kB). This is
achieved by an initial INSERT (with say 100kB of data), followed by a
number of updates with 'data = data || '... new data ...' appending
the data. When reading the data, you have to do the same - read them
piece by piece.

Yes, it's kind of dirty (especially for large files - this may cause
a lot of queries), but preserves all the nice BYTEA colunm features
(no dangling / missing LOBs, etc.).

Is there any other way to solve storing of large files in PostgreSQL?
These are the most important requirements of the solution:

- I do want to store the files inside the database, and I don't want to
store just the paths. It's quite difficult to combine transactional
(database) and non-transactional (filesystem) resources properly.

- I want to preserve as much 'nice' PostgreSQL features as possible (for
example referential integrity is a nice feature).

- Storing / retrieving of the files has to work with a quite small
memory_limit PHP option (say 8MB). I can't (and don't want to) modify
this option (it's a shared server).

- Optimization is a serious criterion, as is reliability.

Thanks for all your recommendations.

regards
Tomas

--
Sent via pgsql-php mailing list (pgsql-php@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-php

Re: storing binary files / memory limit

am 11.10.2008 22:23:39 von Andrew McMillan

On Sat, 2008-10-11 at 18:26 +0200, Tomas Vondra wrote:
>
> Is there any other way to solve storing of large files in PostgreSQL?

No, not until there are functions that let you fopen() on the bytea
column.

Also, your "... || more_column" solution will generate large numbers of
dead rows and require frequent vacuuming.


> - Optimization is a serious criterion, as is reliability.

If you're using tables with very large columns, make sure you index on
every other column you're going to access it by. If PostgreSQL has to
resort to full-table scans on this table, and especially with a low
memory constraint, you could easily end up with it doing an on-disk sort
on a copy of the data.

If you *have* to store it in a table column (and it really isn't the
most efficient way of doing it) then create a separate table for it
which is just SERIAL + data.

Cheers,
Andrew McMillan.
------------------------------------------------------------ ------------
Andrew @ McMillan .Net .NZ Porirua, New Zealand
http://andrew.mcmillan.net.nz/ Phone: +64(272)DEBIAN
It is often easier to tame a wild idea than to breathe life into a
dull one. -- Alex Osborn

------------------------------------------------------------ ------------



--
Sent via pgsql-php mailing list (pgsql-php@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-php

Re: storing binary files / memory limit

am 11.10.2008 23:41:00 von Tomas Vondra

> On Sat, 2008-10-11 at 18:26 +0200, Tomas Vondra wrote:
>> Is there any other way to solve storing of large files in PostgreSQL?
>
> No, not until there are functions that let you fopen() on the bytea
> column.

:-(

> Also, your "... || more_column" solution will generate large numbers of
> dead rows and require frequent vacuuming.

Yes, I'm aware of this and it's one more deficiency of the solution. But
inserting the data is not so frequent (a few inserts a day), especially
when compared to reading (several hundred SELECTs).

>> - Optimization is a serious criterion, as is reliability.
>
> If you're using tables with very large columns, make sure you index on
> every other column you're going to access it by. If PostgreSQL has to
> resort to full-table scans on this table, and especially with a low
> memory constraint, you could easily end up with it doing an on-disk sort
> on a copy of the data.

Well, the schema is quite well optimized I believe, so this shouldn't be
a problem. All the necessary indexes are created etc. so the queries are
executed quite fast (we have a comprehensive performance logging of
queries, so I'm quite sure about this).

But I'm not sure what you mean by 'low memory contraint' - the memory
limit I've been talking about is purely PHP feature, so it's related to
inserting / reading and escaping / unescaping data.

Anyway 99% of queries returning multiple rows do not return BYTEA
columns - these columns are references in queries returning single row,
so there is no problem with sorting / memory occupied by the postmaster
process.

> If you *have* to store it in a table column (and it really isn't the
> most efficient way of doing it) then create a separate table for it
> which is just SERIAL + data.

I *want* to store it in a table column, because I'm not able to come up
with a better solution. As I understand it, creating a separate table to
store the binary data is almost the same as using plain bytea columns.
We don't have problems with performance (thanks to creating proper
indexes and TOAST architecture), and it does not solve the problem I've
described in my original post (hitting the PHP memory limit).

OK, it would save a little bit of space when using the 'append' strategy
described in my previous post (data = data || new_data), but table
usually consists of a small amount of metadata plus large amount of
binary data. So the amount of space wasted because of storing metadata
in dead rows is negligible compared to space wasted because of dead rows
and bytea columns.

For example, one of our tables is used to store documents, so it has
about this structure

Documents (
id SERIAL,
created DATE,
title VARCHAR(128),
author VARCHAR(128),
description TEXT, -- short (usually less than 2000 characters)
data BYTEA
)

The 'data' column may have even several megabytes, so the metadata
occupies less than 1% of the row.

If you know a better way to store binary data, please describe it here.
The only other way I'm aware of is LOB - it solves the problem of
inserting data (by streaming), but has other disadvantages (no
referential integrity, etc.)

But creating a separate table for the binary data looks interesting
under one condition - the file will be stored splitted. This basically
mimics the LOB storage (pg_largeobject table). It still does not have
the support for streaming the data, but it solves the problem with PHP
memory limit and does not create large amount of dead rows.

regards
Tomas

--
Sent via pgsql-php mailing list (pgsql-php@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-php

Re: storing binary files / memory limit

am 12.10.2008 09:25:07 von Andrew McMillan

On Sat, 2008-10-11 at 23:41 +0200, Tomas Vondra wrote:
> >
> > If you're using tables with very large columns, make sure you index on
> > every other column you're going to access it by. If PostgreSQL has to
> > resort to full-table scans on this table, and especially with a low
> > memory constraint, you could easily end up with it doing an on-disk sort
> > on a copy of the data.
>
> But I'm not sure what you mean by 'low memory contraint' - the memory
> limit I've been talking about is purely PHP feature, so it's related to
> inserting / reading and escaping / unescaping data.

In this case I'm not referring to PHP memory, but to PostgreSQL memory.
If you're on a memory constrained shared system then it's not just PHP
which will be configured for a smaller memory footprint...


> I *want* to store it in a table column

Yes, that's certainly what you seem to be saying. Personally I would
steer clear of storing many megabytes in a bytea column on a memory
constrained system, but you're closer to the application and will make
your own decision.


> If you know a better way to store binary data, please describe it here.
> The only other way I'm aware of is LOB - it solves the problem of
> inserting data (by streaming), but has other disadvantages (no
> referential integrity, etc.)

Yes, your trade-off is essentially efficiency vs. referential integrity.
This is a common trade-off, and if you have tight control over how rows
will be inserted/deleted from your table then referential integrity is
merely a nice-to-have.

If people will be creating / deleting these things all over the
application, without the benefit of an API to do so, then referential
integrity obviously becomes much more important.

Cheers,
Andrew.
------------------------------------------------------------ ------------
andrew (AT) morphoss (DOT) com +64(272)DEBIAN
Writing is turning one's worst moments into money.
-- J.P. Donleavy

------------------------------------------------------------ ------------



--
Sent via pgsql-php mailing list (pgsql-php@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-php