Parsing Huge File

am 13.08.2007 15:05:55 von Majnu

Hello,

I have a strange problem. I have a flat file of about 6 million rows,
each row of 600 bytes. I read the file line by line after opening like
this:

open(IN, "cat $InputFile |") or die "Failed to open the File";
##This was changed from open(IN, "< $InputFile") because Perl
outrightly refused to open the file.
while() {......

The problem is that, at times, Perl just stops after reading 3,334,601
records. No error message printed. And this is not a problem whih
occurs always. It just happens sporadically and hence difficult to
track because, if I re-process the file, it gets read completely.

Would someone please shed light on how this could be happening? Is
this something related with memory?

Re: Parsing Huge File

am 13.08.2007 16:20:16 von Christian Winter

Majnu wrote:
> I have a strange problem. I have a flat file of about 6 million rows,
> each row of 600 bytes. I read the file line by line after opening like
> this:
>
> open(IN, "cat $InputFile |") or die "Failed to open the File";
> ##This was changed from open(IN, "< $InputFile") because Perl
> outrightly refused to open the file.

With which error message did Perl refuse to open the file?
Normally this shouldn't happen. If your numbers are correct,
even a Perl compiled without uselargefiles=define should be
able to open a file of under 2GB on any halfway recent platform.

> while() {......
>
> The problem is that, at times, Perl just stops after reading 3,334,601
> records. No error message printed.

That's really strange. Giving no error message at all shouldn't
happen. Does your script still react to Ctrl+C then, or is the
whole Perl process hung up? If yes, is there nothing in the OS kernel's
log that gives a hint?

> And this is not a problem whih
> occurs always. It just happens sporadically and hence difficult to
> track because, if I re-process the file, it gets read completely.

And you're sure that the error _always_ occurs after exactly 3,334,601
records?

> Would someone please shed light on how this could be happening? Is
> this something related with memory?

This really sounds strange to me. Does the file reside on a local
disk (otherwise it could be some network fs issue)? Does it also
happen with a copy of the file in a different directory / partition
/ disk?

What platform (OS, Filesystems) are you running your script on, and
which Perl version are you using (output from "perl -V")?

-Chris

Re: Parsing Huge File

am 13.08.2007 17:14:41 von Ian Wilson

Majnu wrote:
> Hello,
>
> I have a strange problem. I have a flat file of about 6 million rows,
> each row of 600 bytes. I read the file line by line after opening like
> this:
>
> open(IN, "cat $InputFile |") or die "Failed to open the File";
> ##This was changed from open(IN, "< $InputFile") because Perl
> outrightly refused to open the file.

I'd write that as

open my $in, '<', $InputFile
or die "Failed to open '$InputFile' because $!";

Your `cat` doesn't seem to be doing anything useful.
I always print information about the cause of the error ($!).

I'd get Perl to say exactly *why* it "refused to open the file" before
introducing further complications like `cat`.

> while() {......

while(<$in>) { ...

>
> The problem is that, at times, Perl just stops after reading 3,334,601
> records. No error message printed. And this is not a problem whih
> occurs always. It just happens sporadically and hence difficult to
> track because, if I re-process the file, it gets read completely.
>
> Would someone please shed light on how this could be happening? Is
> this something related with memory?
>

I guess that would depend on what sort of processing you are doing.

Perhaps a side effect of your script changes the run-time environment
for the re-run. e.g. it creates a log-file that didn't exist and that
your script expects to be present. Maybe there's resource conflicts with
other running processes. Still it is unusual to get no messages, have
you been assiduous in checking for errors in all statements that can
potentially report run-time errors?

Re: Parsing Huge File

am 13.08.2007 18:53:16 von xhoster

Re: Parsing Huge File - must use version of perl compiled for hugefiles.

am 14.08.2007 04:40:28 von Joe Smith

Re: Parsing Huge File

am 14.08.2007 04:43:56 von Joe Smith

Ian Wilson wrote:

>> open(IN, "cat $InputFile |") or die "Failed to open the File";
>> ##This was changed from open(IN, "< $InputFile")
>
> Your `cat` doesn't seem to be doing anything useful.

It is useful if your shell doesn't handle large files; it allows
programs expecting 32-bit file sizes to process 2GB of data
instead of blowing up immediately. It is a kludge.
-Joe

Re: Parsing Huge File - must use version of perl compiled for huge files.

am 17.08.2007 10:26:13 von Majnu

On Aug 14, 7:40 am, Joe Smith wrote:
> Majnu wrote:
> > Hello,
>
> > I have a strange problem. I have a flat file of about 6 million rows,
> > each row of 600 bytes. I read the file line by line after opening like
> > this:
>
> > open(IN, "cat $InputFile |") or die "Failed to open the File";
> > ##This was changed from open(IN, "< $InputFile") because Perl
> > outrightly refused to open the file.
> > while() {......
>
> > The problem is that, at times, Perl just stops after reading 3,334,601
> > records.
>
> You neglected to mention which version of perl and which OS, but I
> can see the cause of both of your problems.
>
> % perl -le 'print 600*3_334_601'
> 2000760600
>
> % perl -V | grep large
> useperlio=define d_sfio=undef uselargefiles=define
>
> You cannot process more than 2 gigabytes (can't even open a file
> bigger than 2 GB) when using a version of perl that expects file
> sizes to fit into a signed 32-bit int.
>
> Check your 'perl -V'; I'm sure it is _not_ compiled with 'uselargefiles=define'.
>
> You need to upgrade to perl v5.8.x immediately.
>
> -Joe

Thanks for the replies.

Yes. The Perl version seems to be the problem here. Though the perl
available was 5.8, withing the script, 5.2 was used, which,
unfortunately was not copiled with uselargefiles option.