MIME vs. ZIP for file archives?

am 10.11.2005 19:31:14 von Jon Noring

I lead the OpenReader Publication Working Group (ORPWG) which is
developing an open-standards digital publication format. For the
technically-oriented specifics about OpenReader, refer to:

http://www.openreader.org/index.php?option=com_content&task= category§ionid=34&id=53&Itemid=157

I have a sort of open-ended inquiry (and to some maybe an odd one)
regarding stand-alone, open standards file archive formats,
specifically comparing the use of ZIP versus a MIME-based approach
(to-be-defined) to create stand-alone archive files of file sets.

The issue is as follows:

We need to specify a way to encapsulate/wrap/contain all the files
associated with an OpenReader Publication (XML, CSS, images, etc.)
into a single file for distribution purposes.

Obviously, ZIP appears to be the way to accomplish this (if it's good
enough for Sun and Microsoft -- jar, OpenOffice/OpenDocument, Metro,
OfficeXML, etc. -- it should be good enough for us.) However, a few
are noting that ZIP is not a *true* open standard -- to some there's
still ambiguities about the legal status of the ZIP format which
might come back to bite us at a future time.

So, it's been proposed that instead we build our archive format based
on MIME specs (as defined by IETF RFCs 2045-2049) and other truly open
specifications (such as for file compression) as defined by IETF, W3C,
etc.

Apparently few have used a MIME-based approach to create stand-alone,
portable file archives (e.g., refer to the Wikipedia article on file
archive and compression formats.) The MIME specs are primarily used
for email transport and the like. There is one MIME-based file archive
use I've found: Ebook Technologies Inc. today uses something like was
proposed to, but not approved by, the Open eBook Forum back in 1999.
Here's a link to that proposed 1999 specification:

http://web.archive.org/web/20000926004335/www.nuvomedia.com/ oebff/OEBFile1DRAFT001.htm

(What is proposed for OpenReader would be similar to that, with
possible additions in the header for a Content-Length field to aid in
retrieval of files, and custom fields for assigning info like
checksum, a digital signature, etc. Of course all binary objects must
be embedded with a content-type of 'binary', not represented in
base64.)

What I'd like to better understand are the relative *technical*
advantages and disadvantages between ZIP and a MIME-based approach for
file archives -- I'm especially interested in the technical comparison
relevant to user agent processing of such file archives especially for
rendering/presentation applications. Of course, tool set support is
one technical issue, but at the moment I'd rather focus on the
technical merits relating to user agents unwrapping and using the
content resources within the archive file. Note that OpenReader
Publications may be used on somewhat limited resource hardware (such
as handheld devices), so the resources required to access files within
the file archive is an important consideration.

For those interested, the public discussion of ORPWG is found at:

http://groups.yahoo.com/group/openreader-format/

Feel free to subscribe and contribute to the discussion.

Thanks!

Jon Noring

Re: MIME vs. ZIP for file archives?

am 10.11.2005 21:56:08 von Some Fred

Hi Jon,

What's the advantage of a new "openreader" format over e.g. PDF? 90% of the
world already uses PDF, which is also an open specification.

But to come back to your original question; I think ZIP is really a
*compression* format. It's arguable if this is what you need; to extract
individual files from a ZIP archive takes more time than extracting it from
an uncompressed file format.

Usually, the files that are needed by an ebook are already compressed (think
of e.g. jpeg pictures that are embedded in pages). The additional ZIP
compression doesn't bring much. Furthermore, if users want to have the
smallest file possible, they could always compress the openreader format
further using their favourite compression tool.

Esp if an openreader file would be edited a lot, it is favourable to have
the individual files in uncompressed form. This will speed up dealing with
the archive, since it will involve lots and lots of file updating
transactions.

I would just propose a simple and reliable new archive format, which
consists of:

- Small header, containing an autograph making it recognisable easily. The
header might also contain some meta-data like author, copyright, etc
- Index following the header. The index will contain a list of files in the
archive, with their position in the stream, size, and file name (make sure
to use 64bit offsets, so the archive is suitable for file sizes > 4Gb)
- Files in the archive follow the index sequentially

Such a simple format has some advantages:
- No patent issues, since no patented technology is used. All plain and
simple
- Everyone with some experience can write a reader for it in no-time
- Suitable for partial (streaming) retrieval, since the index follows the
header directly, and one could optimize the content such that the files
needed for the first page follow as the first few files in the archive.

Nils Haeck
www.simdesign.nl

"Jon Noring" schreef in bericht
news:2947n1ldji7m4ths6js9ha8gd9jcsdghg4@4ax.com...
>I lead the OpenReader Publication Working Group (ORPWG) which is
> developing an open-standards digital publication format. For the
> technically-oriented specifics about OpenReader, refer to:
>
>
> http://www.openreader.org/index.php?option=com_content&task= category§ionid=34&id=53&Itemid=157
>
> I have a sort of open-ended inquiry (and to some maybe an odd one)
> regarding stand-alone, open standards file archive formats,
> specifically comparing the use of ZIP versus a MIME-based approach
> (to-be-defined) to create stand-alone archive files of file sets.
>
> The issue is as follows:
>
> We need to specify a way to encapsulate/wrap/contain all the files
> associated with an OpenReader Publication (XML, CSS, images, etc.)
> into a single file for distribution purposes.
>
> Obviously, ZIP appears to be the way to accomplish this (if it's good
> enough for Sun and Microsoft -- jar, OpenOffice/OpenDocument, Metro,
> OfficeXML, etc. -- it should be good enough for us.) However, a few
> are noting that ZIP is not a *true* open standard -- to some there's
> still ambiguities about the legal status of the ZIP format which
> might come back to bite us at a future time.
>
> So, it's been proposed that instead we build our archive format based
> on MIME specs (as defined by IETF RFCs 2045-2049) and other truly open
> specifications (such as for file compression) as defined by IETF, W3C,
> etc.
>
> Apparently few have used a MIME-based approach to create stand-alone,
> portable file archives (e.g., refer to the Wikipedia article on file
> archive and compression formats.) The MIME specs are primarily used
> for email transport and the like. There is one MIME-based file archive
> use I've found: Ebook Technologies Inc. today uses something like was
> proposed to, but not approved by, the Open eBook Forum back in 1999.
> Here's a link to that proposed 1999 specification:
>
>
> http://web.archive.org/web/20000926004335/www.nuvomedia.com/ oebff/OEBFile1DRAFT001.htm
>
> (What is proposed for OpenReader would be similar to that, with
> possible additions in the header for a Content-Length field to aid in
> retrieval of files, and custom fields for assigning info like
> checksum, a digital signature, etc. Of course all binary objects must
> be embedded with a content-type of 'binary', not represented in
> base64.)
>
>
> What I'd like to better understand are the relative *technical*
> advantages and disadvantages between ZIP and a MIME-based approach for
> file archives -- I'm especially interested in the technical comparison
> relevant to user agent processing of such file archives especially for
> rendering/presentation applications. Of course, tool set support is
> one technical issue, but at the moment I'd rather focus on the
> technical merits relating to user agents unwrapping and using the
> content resources within the archive file. Note that OpenReader
> Publications may be used on somewhat limited resource hardware (such
> as handheld devices), so the resources required to access files within
> the file archive is an important consideration.
>
> For those interested, the public discussion of ORPWG is found at:
>
> http://groups.yahoo.com/group/openreader-format/
>
> Feel free to subscribe and contribute to the discussion.
>
> Thanks!
>
> Jon Noring
>

Re: MIME vs. ZIP for file archives?

am 10.11.2005 22:11:49 von DFS

Jon Noring wrote:

> I lead the OpenReader Publication Working Group (ORPWG) which is
> developing an open-standards digital publication format. For the
> technically-oriented specifics about OpenReader, refer to:

> The issue is as follows:

> We need to specify a way to encapsulate/wrap/contain all the files
> associated with an OpenReader Publication (XML, CSS, images, etc.)
> into a single file for distribution purposes.

I would stay away from MIME. MIME is complex and extremely easy to
misinterpret. There are endless security holes caused by malformed
MIME e-mails; we don't want to port all those problems to a document
format.

Wht not use an open, obviously-unencumbered archiving format like
"tar" in conjunction with an open, obviously-unencumbered compression
format like "bzip2"?

I believe there are implementations for both tar and bzip2 on just about
any platform you'd care to name.

> What I'd like to better understand are the relative *technical*
> advantages and disadvantages between ZIP and a MIME-based approach for
> file archives --

In my opinion, MIME has a huge technical disadvantage in that it's
complicated and sometimes ambiguous, and was designed as a hack to
let non-plain-text material travel safely over plan-ascii-text SMTP
connections. It wasn't really designed for the job. tar, however,
is perfect for distributing a bunch of files in a single blob, because
that was exactly what it was designed to do.

Regards,

David.

Re: MIME vs. ZIP for file archives?

am 10.11.2005 23:04:08 von cr88192

"Jon Noring" wrote in message
news:2947n1ldji7m4ths6js9ha8gd9jcsdghg4@4ax.com...

>
> Apparently few have used a MIME-based approach to create stand-alone,
> portable file archives (e.g., refer to the Wikipedia article on file
> archive and compression formats.) The MIME specs are primarily used
> for email transport and the like. There is one MIME-based file archive
> use I've found: Ebook Technologies Inc. today uses something like was
> proposed to, but not approved by, the Open eBook Forum back in 1999.
> Here's a link to that proposed 1999 specification:
>
>
> http://web.archive.org/web/20000926004335/www.nuvomedia.com/ oebff/OEBFile1DRAFT001.htm
>
> (What is proposed for OpenReader would be similar to that, with
> possible additions in the header for a Content-Length field to aid in
> retrieval of files, and custom fields for assigning info like
> checksum, a digital signature, etc. Of course all binary objects must
> be embedded with a content-type of 'binary', not represented in
> base64.)
>
I once did an archive format sort of like this. it was similar, but differed
in more subtle ways (it was not based on mime, rather, more on the mime-like
parts of http).

I could have made it a little more "standard", but I didn't care that much.
the format is simple enough that a person could likely just look at the file
contents and get the idea...

it had a content-length as well (actually, content length was pretty much
required for everything). a possible exception to this though was things
which were "chunked", which adopted the convention used in http 1.1 (a
seperate length for each chunk, followed by the data).

yeah, there were a few potentially frivolous features though (chunking and
multiplexing files, from what I remember), but oh well, these could be
ommited in the name of a simpler spec.

>
> What I'd like to better understand are the relative *technical*
> advantages and disadvantages between ZIP and a MIME-based approach for
> file archives -- I'm especially interested in the technical comparison
> relevant to user agent processing of such file archives especially for
> rendering/presentation applications. Of course, tool set support is
> one technical issue, but at the moment I'd rather focus on the
> technical merits relating to user agents unwrapping and using the
> content resources within the archive file. Note that OpenReader
> Publications may be used on somewhat limited resource hardware (such
> as handheld devices), so the resources required to access files within
> the file archive is an important consideration.
>
dunno.

on one hand, a format like this would be fairly easily extensible.
since the numeric fields are in ascii, you can have however much range is
desired (the bigger deal is the implementation, which may have a certain
upper limit, but not so much the format).

also, the format would be fairly simple to figure out by quick examination
(aka: "human readable"), which may or may not matter (not such a big deal vs
zip given how common zip is).

on another hand, zip is a lot more common, which may or may not matter.

a vague concern had been header size, but now I figure header size is
irrelevant probably vs the size of the contained files (and may still be
less than the size of zip headers in many cases).

an idea had been considered before was that the headers could be compressed
as well (probably with a specialized version of lz77). now, however, I think
that the idea is not worth it in terms of the likely increase in
implementation complexity.

a possible issue: imo, zip is an ugly mess of a file format (ok, it is not
the only one, but many formats are a lot cleaner).

as for resources/..., not much signifigant difference should exist imo. it
may be slightly slower to parse textual headers vs. binary ones, but anymore
even for embedded devices this shouldn't matter much (a smaller number of
headers are proccessed not-so-often).

a bigger deal might be the lack of a central directory, maybe.
if it is necessary to randomly access the files, one would either need to
include one, or rebuild one via a pass over the archive headers (should be
both trivial and not unreasonably expensive given modern io devices). I
suspect mostly included central directories were needed in the days of
floppies to speed access, but are unlikely to be neccessary on anything much
faster (pretty much everything).

if needed though, such a directory could easily enough be put at the
start/end of an archive though (putting it at the start probably makes the
most sense, since one can read just the first few headers and know the
general archive contents, if at the end, why bother? it can be rebuilt
probably at no extra cost directly from the headers it encounters on the
way...).

I don't know.

I don't have much specific to say on the matter, so most of this is just my
opinions...

> For those interested, the public discussion of ORPWG is found at:
>
> http://groups.yahoo.com/group/openreader-format/
>
> Feel free to subscribe and contribute to the discussion.
>
> Thanks!
>
> Jon Noring
>

Re: MIME vs. ZIP for file archives?

am 11.11.2005 04:21:46 von Jon Noring

Nils wrote:

> What's the advantage of a new "openreader" format over e.g. PDF? 90%
> of the world already uses PDF, which is also an open specification.

Good question, but not exactly a discussion we should pursue in these
newsgroups as it is off-topic. The "technical" section at the
OpenReader.org site goes into the rationale:

http://www.openreader.org/index.php?option=com_content&task= category§ionid=34&id=53&Itemid=157

Now to address some of your comments:

> But to come back to your original question; I think ZIP is really a
> *compression* format. It's arguable if this is what you need; to
> extract individual files from a ZIP archive takes more time than
> extracting it from an uncompressed file format.

Well, one important requirement is that all the content resources
(especially XML documents) be compressed. Of course, the supported
image types (PNG, JPG) are already compressed binaries.

Also note that if a DRM layer is optionally applied (not my
preference, but we are building this format for the publishing
community) then certain content documents will be encrypted, so there
is processing needed anyway.

> Usually, the files that are needed by an ebook are already
> compressed (think of e.g. jpeg pictures that are embedded in pages).
> The additional ZIP compression doesn't bring much.

Understood.

In ZIP, I believe it is possible to turn off compression for
particular files. I need to relook at the ZIP format, but I believe
compression is file-by-file, and can be turned off for particular
files.

> Furthermore, if users want to have the smallest file possible, they
> could always compress the openreader format further using their
> favourite compression tool.

Well, for distribution and retail reasons, and archival reasons, the
content must be compressed as it is encapsulated.

> Esp if an openreader file would be edited a lot, it is favourable to
> have the individual files in uncompressed form. This will speed up
> dealing with the archive, since it will involve lots and lots of
> file updating transactions.

OpenReader is not intended to be a word-processing format. It is an
end-user format. But since it is an open format, end-users may be
able to edit the content. That is why digital signatures will be
optionally supported, to aid in digital integrity for document types
where the author/publisher deem it is important.

> I would just propose a simple and reliable new archive format, which
> consists of:
>
> [snip]
>
> - Index following the header. The index will contain a list of files
> in the archive, with their position in the stream, size, and file
> name (make sure to use 64bit offsets, so the archive is suitable
> for file sizes > 4Gb)

Yes, we definitely need an index. ZIP offers a built-in index, while
with an MP-MIME approach we need to add an index with pointers to the
start/end of each resource, or at least to point to the header for
each resource.

> Such a simple format has some advantages:

It is enticing to "roll our own", since then we can tailor it exactly
as we wish and not worry about patent issues (on the other hand,
with an established format like ZIP we may have more assurance that
it is patent-safe.)

On the downside with a "roll our own" solution we have to consider
available toolset support, both now and in the distant future. With
a "roll our own" we have to build the first encapsulator.

Anyway, much appreciation for your thoughts. We are collecting the
suggestions of many people, and each has provided valuable
perspectives and insights.

Jon

Re: MIME vs. ZIP for file archives?

am 11.11.2005 04:36:46 von Jon Noring

David F. Skoll wrote:
> Jon Noring wrote:

>> The issue is as follows:
>>
>> We need to specify a way to encapsulate/wrap/contain all the files
>> associated with an OpenReader Publication (XML, CSS, images, etc.)
>> into a single file for distribution purposes.

> I would stay away from MIME. MIME is complex and extremely easy to
> misinterpret. There are endless security holes caused by malformed
> MIME e-mails; we don't want to port all those problems to a document
> format.

Well, if we employ Multi-Part MIME (MP-MIME) for building a file
archive format, we would strictly define its use, so I'm not too
concerned with this.

But definitely something to keep in mind should we choose MP-MIME
as the base for our document format -- keep it simple and use only
the bare minimum of the RFC as needed.

> Why not use an open, obviously-unencumbered archiving format like
> "tar" in conjunction with an open, obviously-unencumbered
> compression format like "bzip2"?

I recall in our working group talking about 'tar', but most rejected
it because, from what I understand, it is not indexed. It is a
streaming format. We also have to consider how to add a DRM layer
in it. We have to consider some reading systems which because of
hardware resource limitations (and DRM when applied) cannot
pre-unpack everything -- thus files have to be randomly accessed and
opened up only when needed.

Of course, if I'm wrong in some of these perceptions about tar, let me
know!

(Btw, compression will be applied at the file level, so if we used
tar, the files would be compressed first, then tar'd. Hmmm, as I think
about it, I suppose an index could be appended at the end of the
tar, with pointers to the start and end bytes of each resource.)

>> What I'd like to better understand are the relative *technical*
>> advantages and disadvantages between ZIP and a MIME-based approach
>> for file archives --

> In my opinion, MIME has a huge technical disadvantage in that it's
> complicated and sometimes ambiguous, and was designed as a hack to
> let non-plain-text material travel safely over plan-ascii-text SMTP
> connections. It wasn't really designed for the job. tar, however,
> is perfect for distributing a bunch of files in a single blob,
> because that was exactly what it was designed to do.

Thanks for your input. As I noted in the reply to Nils message, I
appreciate all the replies since they provide different perspectives
to look at the issue.

Jon

Re: MIME vs. ZIP for file archives?

am 11.11.2005 15:29:09 von DFS

Jon Noring wrote:

> I recall in our working group talking about 'tar', but most rejected
> it because, from what I understand, it is not indexed.

Ah, well, neither is MIME, so I wasn't aware of that requirement.

> We also have to consider how to add a DRM layer in it.

Well, right there, you've lost my interest. Please see my
article "Digital Media and the Disappearance of History" at
http://www.monitor.ca/monitor/issues/vol10iss4/feature5.html

I strongly urge everyone to stay away from helping anyone define
a standard that has anything to do with DRM.

Regards,

David.

Re: MIME vs. ZIP for file archives?

am 12.11.2005 06:35:19 von Tim Smith

In article ,
"Nils" wrote:
> What's the advantage of a new "openreader" format over e.g. PDF? 90% of the
> world already uses PDF, which is also an open specification.

OpenReader seems to be aimed to ebooks and such, and so a given document
needs to display well on a variety of different devices, ranging from
PDAs with small displays to desktop computers with large displays. I
don't think PDF works well for such a large range of display sizes.

--
--Tim Smith

Re: MIME vs. ZIP for file archives?

am 12.11.2005 16:34:41 von Jon Noring

Tim Smith wrote:
> Nils wrote:

>> What's the advantage of a new "openreader" format over e.g. PDF? 90% of the
>> world already uses PDF, which is also an open specification.

>OpenReader seems to be aimed to ebooks and such, and so a given document
>needs to display well on a variety of different devices, ranging from
>PDAs with small displays to desktop computers with large displays. I
>don't think PDF works well for such a large range of display sizes.

Exactly. OpenReader is designed to be "typeset" at the end-user's
side, not the publisher's side. However, publishers/authors may
include one or more CSS style sheet sets which govern the end-user
rendering, just like is done today with web pages. Depending upon the
user agent, end-users will have significant freedom to tweak the
default styling settings. We are even planning font embedding similar
to what PDF does. And we plan SVG and MathML support (plus eventual
support for XLink, XInclude, TEI, DocBook, NewsML, etc.)

Jon