[RFC] File::SplitStream - iterate over files >2GB when large filesupport unavailable

am 21.08.2005 00:37:24 von AJ

I am implementing a module and seek the community's input about its
suitability for placement on CPAN. File::SplitStream (I am open to
better names) is designed to be used when an OS supports large files,
but the Perl interpreter does not have large file support enabled
(specifically, Red Hat Linux did this for awhile). It uses the Unix
split command to split the large file into <2GB chunks, the generates an
iterator to allow the calling routine to transparently read the file
chunks as if they are still one large file.

Below is a draft for the documentation for this module. I searched CPAN
and did not find anything similar. Your input and suggestions regarding
structure, functionality, and documentation improvements will be greatly
appreciated!

NAME
File::SplitStream - iterate over multiple files as if they were one
file. Optionally split a large file into smaller files before
iteration.

SYNOPSIS
use File::SplitStream;

# split a file into parts
my $filestream = new File::SplitStream;
$filestream->file('/path/to/inputfile');
$filestream->lines(19000000);
$filestream->genFileStream() || die("cannot generate filestream: $!");

--OR--

# or use a group of pre-existing files
my @inputfiles = qw(file01.txt file02.txt file03.txt);
my $filestream = new File::SplitStream;
$filestream->files(@inputfiles);

# regardless of how you set things up, you can
# now iterate over the files as if they're one file
while (my $line = $filestream->nextLine()->() ) {
...do stuff on each line of all of the files...
}

--OR--

# you can use a function call rather than instantiating an object
use File::SplitStream qw(genFileStream);

my $filestream = genFileStream('/path/to/inputfile', 19000000);
while ( my $line = $filestream->() ) {
...do stuff on each line of all of the files...
}

DESCRIPTION
File::SplitStream can be used to split a large text file (or
optionally,
to use a list of pre-existing files) and iterate over the files as if
they were a single file. This class is designed to help work with large
files (>2GB) when large file support is unavailable. Perhaps the
programmer does not have permissions to recompile the available Perl
interpreter, or simply does not have the time. Regardless of reason,
this module can help fill in the gap when large file support is
unavailable.

In order for File::SplitStream to work properly, the Unix split and cat
commands should be in your $PATH. The split command is used to split up
the large file into more manageable chunks, while the cat command is
used to buffer input of the files. The number of lines in each of your
split files will depending on how much data is in each line. Shorter
lines will allow you to put many more lines into a file before it
crosses the 2GB barrier. Longer lines will require you to decrease the
lines/file value.

ACCESSOR METHODS
These accessor methods can be used directly or set by passing them to
the new() method.

(set to split a single file apart and iterate over the pieces)
file the file to split apart
lines maximum number of lines in each file chunk

(set to use a pre-existing set of files as a single file)
files reference to a list of files to iterate over

OTHER METHODS
new(%options)
Use the new() method to create a new File::SplitStream object. You will
need to do this to use the module in an object-oriented way. You can
pass options to the new() method to set the file(), lines(), and
files()
values.

Examples:

# new File::SplitStream with no options
my $fss = new File::SplitStream;

# new File::SplitStream with options to split a single file
my $fss = new File::SplitStream(FILE => '/path/to/file', LINES =>
1000000);

# new File::SplitStream with option to use a list of pre-existing
files
my $fss = new File::SplitStream(FILES => ['/path/to/file1',
'/path/to/file2',
'/path/to/file3'
] );

init(%options)
If options are passed to new(), init() is invoked by new() to set the
appropriate object attributes given the options. Normally init() is
only
invoked by new(), but can be used to (re)set your File::SplitStream
object's attributes if you want.

Example:

my $fss = new File::SplitStream;
$fss->init(FILE => '/path/to/file', LINES => 15000000);

genFileStream($filepath, $number_of_lines)
The workhorse of File::SplitStream is the genFileStream()
method/function. It splits the large data file (if necessary) using the
Unix split command, then generates an iterator function to return each
line of the split files in order, transparently opening and closing the
split files as necessary. If you have specified a list of pre-existing
files, the iterator will open each in the order you gave.

In an object-oriented context, genFileStream() will take the values of
the file() and lines() accessors (or the files() accessor in the
case of
pre-existing files) as its parameters. If you explicitly pass
genFileStream() parameters, these will override the object's
attributes.
In a procedural context, obviously you will have to explicitly pass
these parameters.

In object-oriented style, genFileStream() will assign the iterator
function to the nextLine() accessor and return 1 to the calling
routine;
this way the calling routine does not need yet another variable to hold
the "filestream." In procedural style, the iterator will be returned to
the calling routine.

When the data in all of the files have been exhausted, the iterator
function will return undef. If there is a problem generating the
iterator (usually a problem with the split), or a problem is
encountered
while the split files are being read, the program will die() with the
error being written to STDERR.

Examples:

# OO way
use File::SplitStream;
my $fss = new File::SplitStream;
$fss->file('/data/largefile.dat');
$fss->lines(1000000);
$fss->genFileStream();
while ( $line = $fss->nextLine()->() ) {
...process the file...
}

# procedural
use File::SplitStream qw(genFileStream);
my $stream = genFileStream('/data/largefile.dat',1000000);
while ( $line = $stream->() ) {
...process the file...
}

EXPORT
None by default. You can import genFileStream() into your namespace if
you wish to use it in procedural style.

Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

am 21.08.2005 10:49:19 von Brian McCauley

AJ wrote:

> I am implementing a module and seek the community's input about its
> suitability for placement on CPAN. File::SplitStream (I am open to
> better names) is designed to be used when an OS supports large files,
> but the Perl interpreter does not have large file support enabled
> (specifically, Red Hat Linux did this for awhile). It uses the Unix
> split command to split the large file into <2GB chunks, the generates an
> iterator to allow the calling routine to transparently read the file
> chunks as if they are still one large file.

This seems rather complex and involves very big temporary files.

Is there some problem with just doing...

open my $fh, '-|, 'cat', $huge_file or die "Cannot read $huge_file: $!";

Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

am 21.08.2005 11:09:57 von AJ

Brian McCauley wrote:
> AJ wrote:
>
>> I am implementing a module and seek the community's input about its
>> suitability for placement on CPAN. File::SplitStream (I am open to
>> better names) is designed to be used when an OS supports large files,
>> but the Perl interpreter does not have large file support enabled
>> (specifically, Red Hat Linux did this for awhile). It uses the Unix
>> split command to split the large file into <2GB chunks, the generates
>> an iterator to allow the calling routine to transparently read the
>> file chunks as if they are still one large file.
>
>
> This seems rather complex and involves very big temporary files.
>
> Is there some problem with just doing...
>
> open my $fh, '-|, 'cat', $huge_file or die "Cannot read $huge_file: $!";
>
Yes. In my case, it didn't work. I received a 'File too large' error
after the input pipe passed the 2GB limit. Thus, this solution.
Obviously, it's not very pretty, but it does work.

Re: [RFC] File::SplitStream - iterate over files >2GB when large file support unavailable

am 21.08.2005 21:03:14 von xhoster

AJ wrote:
> I am implementing a module and seek the community's input about its
> suitability for placement on CPAN. File::SplitStream (I am open to
> better names) is designed to be used when an OS supports large files,
> but the Perl interpreter does not have large file support enabled
> (specifically, Red Hat Linux did this for awhile). It uses the Unix
> split command to split the large file into <2GB chunks, the generates an
> iterator to allow the calling routine to transparently read the file
> chunks as if they are still one large file.

I don't understand the need for this. It doesn't appear to implement
"seek" and "tell", only streaming. It has been a while since I've used
a small-file perl, but I never knew there was a problem in streaming large
files in the first place. I thought it was only seek and tell (and
truncate, and maybe other non-streaming things) which elicited the problem.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

am 25.08.2005 10:57:01 von AJ

xhoster@gmail.com wrote:
> AJ wrote:
>
>>I am implementing a module and seek the community's input about its
>>suitability for placement on CPAN. File::SplitStream (I am open to
>>better names) is designed to be used when an OS supports large files,
>>but the Perl interpreter does not have large file support enabled
>>(specifically, Red Hat Linux did this for awhile). It uses the Unix
>>split command to split the large file into <2GB chunks, the generates an
>>iterator to allow the calling routine to transparently read the file
>>chunks as if they are still one large file.
>
>
> I don't understand the need for this. It doesn't appear to implement
> "seek" and "tell", only streaming. It has been a while since I've used
> a small-file perl, but I never knew there was a problem in streaming large
> files in the first place. I thought it was only seek and tell (and
> truncate, and maybe other non-streaming things) which elicited the problem.
>
> Xho
>

I can tell you that seek() and tell() are not the only things that don't
work when trying to access a large file without large file support
enabled. In my original case, merely trying to open the file in
question (~20GB size) yielded a "File too large" error immediately.
Rewriting my code to cat the file through a pipe worked until I read
past the 2GB threshold; at that point, the "File too large" error
resurfaced. This system is using an older OS whose perl was not
compiled with large file support enabled and, if given the chance, I
would have upgraded the Perl (and the OS, for that matter). But for
several reasons I am unable to do this. A solution similar to this
module (though not using the same code) seemed to provide the necessary
workaround. My thought was, if I experienced this problem, others might
too. It may be messy, since you're having to carve up a file and double
your required disk space, but it *works*, and in a situation like that,
*working* may be exactly what you need.

I should also point out the module does not *have* to split the original
file up; it can work from a list of files that are already separate for
whatever reason (autorotated log files come to mind). Sure, you can
just cat them, but what if their total size is >2GB? Without large file
support, the perl interpreter will give up after it has read past the
2GB threshold. This module will prevent that from happening. Again,
this is a very specific set of circumstances that ideally one would
avoid. But if you're in such a position, as I was recently, having a
module to give you a helping hand would be a very good thing.