diff says memory exhausted need help with perl

diff says memory exhausted need help with perl

am 28.03.2008 22:26:56 von tc314

I've got two similar large files with one word per line and they're
sorted.
Each file has a few words not in the other.
I typically identify the unique words in the file using diff,grep,cut.
When the files are too big (2Gig) diff dies with "memory exhausted".

I want to search for the unique words in file1 but I might need to
ping-pong since neither file is a superset of the other.
I don't want to be limited by physical RAM as the file sizes exceed
RAM.

I assume I'm not the first to have this problem.
Can someone point me to perl code?
TIA


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: diff says memory exhausted need help with perl

am 28.03.2008 23:54:41 von Lawrence Statton

If you're using Gnu diff (i.e. the diff that comes with most Linuces)
--speed-large-files might help you, without having to jump through a
perl hoop.

--L


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: diff says memory exhausted need help with perl

am 29.03.2008 01:03:54 von krahnj

tc314@hotmail.com wrote:
> I've got two similar large files with one word per line and they're
> sorted.
> Each file has a few words not in the other.
> I typically identify the unique words in the file using diff,grep,cut.
> When the files are too big (2Gig) diff dies with "memory exhausted".
>
> I want to search for the unique words in file1 but I might need to
> ping-pong since neither file is a superset of the other.
> I don't want to be limited by physical RAM as the file sizes exceed
> RAM.
>
> I assume I'm not the first to have this problem.
> Can someone point me to perl code?

This appears to do what you require:

#!/usr/bin/perl
use warnings;
use strict;


my ( $file1, $file2 ) = ( 'file1', 'file2' );

open my $F1, '<', $file1 or die "Cannot open '$file1' $!";
open my $F2, '<', $file2 or die "Cannot open '$file2' $!";


my ( $first, $second ) = ( '', '' );

do {
if ( $first eq $second ) {
$first = <$F1> || '~'; # because ~ is the last ASCII character
$second = <$F2> || '~';
}
elsif ( $first lt $second ) {
print "$file1: $first";
$first = <$F1> || '~';
}
elsif ( $first gt $second ) {
print "$file2: $second";
$second = <$F2> || '~';
}
} until eof $F1 and eof $F2;

__END__



John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: diff says memory exhausted need help with perl

am 29.03.2008 01:08:43 von tc314

On Mar 28, 6:54 pm, lawre...@cluon.com (Lawrence Statton) wrote:
> If you're using Gnu diff (i.e. the diff that comes with most Linuces)
> --speed-large-files might help you, without having to jump through a
> perl hoop.
>
> --L

Problems:
1) it runs out of memory 8Gig of files with 2GB RAM
2) it assumes a number of lines (3999) because it doesn't know if it
will
find a difference in one line or a million lines.
(2b: this goes against the *nix pipe concept because it then pushes
this
unwieldy block to the next pipe 'cut' rather than gracefully streaming
from pipe to pipe.)
3) The heiristic approach is an imprecise solution to an exact
problem.
It doesn't work perfectly every time.

For most files the simple bash scripts a clean, self-documenting and
fine.
It's natural in perl.
I'm battling syntax and trying to avoid physical RAM issues entirely.

Thanks


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: diff says memory exhausted need help with perl

am 29.03.2008 20:11:30 von Rob Dixon

tc314@hotmail.com wrote:
>
> On Mar 28, 6:54 pm, lawre...@cluon.com (Lawrence Statton) wrote:
>> If you're using Gnu diff (i.e. the diff that comes with most Linuces)
>> --speed-large-files might help you, without having to jump through a
>> perl hoop.
>>
>> --L
>
> Problems:
> 1) it runs out of memory 8Gig of files with 2GB RAM
> 2) it assumes a number of lines (3999) because it doesn't know if it
> will
> find a difference in one line or a million lines.
> (2b: this goes against the *nix pipe concept because it then pushes
> this
> unwieldy block to the next pipe 'cut' rather than gracefully streaming
> from pipe to pipe.)
> 3) The heiristic approach is an imprecise solution to an exact
> problem.
> It doesn't work perfectly every time.
>
> For most files the simple bash scripts a clean, self-documenting and
> fine.
> It's natural in perl.
> I'm battling syntax and trying to avoid physical RAM issues entirely.

The diff utility is a general purpose application that will generate a
minimal edit (list of changes) to translate one file into another. If
both files are sorted the problem reduces enormously, and using diff is
overkill. Please take a look at the code that John posted. If it doesn't
do what you require in a tiny fraction of the time taken by your current
method I will be astonished.

Rob



(For those interested, the algorithm employed by diff has a performance
between O(N) (no changes) and O(N^2). It is documented at
http://www.xmailserver.org/diff2.pdf)

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: diff says memory exhausted need help with perl

am 29.03.2008 20:14:28 von Rob Dixon

tc314@hotmail.com wrote:
>
> On Mar 28, 6:54 pm, lawre...@cluon.com (Lawrence Statton) wrote:
>>
>> If you're using Gnu diff (i.e. the diff that comes with most Linuces)
>> --speed-large-files might help you, without having to jump through a
>> perl hoop.
>
> Problems:
[snip]
>
> 3) The heiristic approach is an imprecise solution to an exact
> problem. It doesn't work perfectly every time.
>
[snip]

Come to think of it:

What heuristic approach?

Rob


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/