[RFC] File::SplitStream - iterate over files >2GB when large filesupport unavailable
am 21.08.2005 00:37:24 von AJI am implementing a module and seek the community's input about its
suitability for placement on CPAN. File::SplitStream (I am open to
better names) is designed to be used when an OS supports large files,
but the Perl interpreter does not have large file support enabled
(specifically, Red Hat Linux did this for awhile). It uses the Unix
split command to split the large file into <2GB chunks, the generates an
iterator to allow the calling routine to transparently read the file
chunks as if they are still one large file.
Below is a draft for the documentation for this module. I searched CPAN
and did not find anything similar. Your input and suggestions regarding
structure, functionality, and documentation improvements will be greatly
appreciated!
NAME
File::SplitStream - iterate over multiple files as if they were one
file. Optionally split a large file into smaller files before
iteration.
SYNOPSIS
use File::SplitStream;
# split a file into parts
my $filestream = new File::SplitStream;
$filestream->file('/path/to/inputfile');
$filestream->lines(19000000);
$filestream->genFileStream() || die("cannot generate filestream: $!");
--OR--
# or use a group of pre-existing files
my @inputfiles = qw(file01.txt file02.txt file03.txt);
my $filestream = new File::SplitStream;
$filestream->files(@inputfiles);
# regardless of how you set things up, you can
# now iterate over the files as if they're one file
while (my $line = $filestream->nextLine()->() ) {
...do stuff on each line of all of the files...
}
--OR--
# you can use a function call rather than instantiating an object
use File::SplitStream qw(genFileStream);
my $filestream = genFileStream('/path/to/inputfile', 19000000);
while ( my $line = $filestream->() ) {
...do stuff on each line of all of the files...
}
DESCRIPTION
File::SplitStream can be used to split a large text file (or
optionally,
to use a list of pre-existing files) and iterate over the files as if
they were a single file. This class is designed to help work with large
files (>2GB) when large file support is unavailable. Perhaps the
programmer does not have permissions to recompile the available Perl
interpreter, or simply does not have the time. Regardless of reason,
this module can help fill in the gap when large file support is
unavailable.
In order for File::SplitStream to work properly, the Unix split and cat
commands should be in your $PATH. The split command is used to split up
the large file into more manageable chunks, while the cat command is
used to buffer input of the files. The number of lines in each of your
split files will depending on how much data is in each line. Shorter
lines will allow you to put many more lines into a file before it
crosses the 2GB barrier. Longer lines will require you to decrease the
lines/file value.
ACCESSOR METHODS
These accessor methods can be used directly or set by passing them to
the new() method.
(set to split a single file apart and iterate over the pieces)
file the file to split apart
lines maximum number of lines in each file chunk
(set to use a pre-existing set of files as a single file)
files reference to a list of files to iterate over
OTHER METHODS
new(%options)
Use the new() method to create a new File::SplitStream object. You will
need to do this to use the module in an object-oriented way. You can
pass options to the new() method to set the file(), lines(), and
files()
values.
Examples:
# new File::SplitStream with no options
my $fss = new File::SplitStream;
# new File::SplitStream with options to split a single file
my $fss = new File::SplitStream(FILE => '/path/to/file', LINES =>
1000000);
# new File::SplitStream with option to use a list of pre-existing
files
my $fss = new File::SplitStream(FILES => ['/path/to/file1',
'/path/to/file2',
'/path/to/file3'
] );
init(%options)
If options are passed to new(), init() is invoked by new() to set the
appropriate object attributes given the options. Normally init() is
only
invoked by new(), but can be used to (re)set your File::SplitStream
object's attributes if you want.
Example:
my $fss = new File::SplitStream;
$fss->init(FILE => '/path/to/file', LINES => 15000000);
genFileStream($filepath, $number_of_lines)
The workhorse of File::SplitStream is the genFileStream()
method/function. It splits the large data file (if necessary) using the
Unix split command, then generates an iterator function to return each
line of the split files in order, transparently opening and closing the
split files as necessary. If you have specified a list of pre-existing
files, the iterator will open each in the order you gave.
In an object-oriented context, genFileStream() will take the values of
the file() and lines() accessors (or the files() accessor in the
case of
pre-existing files) as its parameters. If you explicitly pass
genFileStream() parameters, these will override the object's
attributes.
In a procedural context, obviously you will have to explicitly pass
these parameters.
In object-oriented style, genFileStream() will assign the iterator
function to the nextLine() accessor and return 1 to the calling
routine;
this way the calling routine does not need yet another variable to hold
the "filestream." In procedural style, the iterator will be returned to
the calling routine.
When the data in all of the files have been exhausted, the iterator
function will return undef. If there is a problem generating the
iterator (usually a problem with the split), or a problem is
encountered
while the split files are being read, the program will die() with the
error being written to STDERR.
Examples:
# OO way
use File::SplitStream;
my $fss = new File::SplitStream;
$fss->file('/data/largefile.dat');
$fss->lines(1000000);
$fss->genFileStream();
while ( $line = $fss->nextLine()->() ) {
...process the file...
}
# procedural
use File::SplitStream qw(genFileStream);
my $stream = genFileStream('/data/largefile.dat',1000000);
while ( $line = $stream->() ) {
...process the file...
}
EXPORT
None by default. You can import genFileStream() into your namespace if
you wish to use it in procedural style.