Manipulating large text files

Manipulating large text files

am 20.12.2007 21:06:12 von my0373

Hi All,

Hopefully a nice easy question, apologies for the cross posting I made
a newbie mistake by posting in comp.os.linux, i'm trying to attain
salvation by posting here instead :)

I have 8x16gb files that are basically just giant lists.

each row contains only one word.

I need to compare each file against every other file and list
duplicates.

I've done a sort -u on the file and I know that each one is unique, I
just don't know how to list duplicates amongst the files. Its a case
of needing to know which duplicates there are, rather than just
cat'ing them together and running a sort -u.

I've written a perl script that will do it... eventually but its
going
to take about 2 weeks to finish and tbh, i'm not sure my disks will
last that long!

I have no access to a database, and its an isolated network.

Any suggestions ?

P.S. Apologies if i've broken any rules, its the first time i've
tried
usenet :) and I couldn't find an comp.os.linux.fiddlingwithbigfiles
forum !

Re: Manipulating large text files

am 20.12.2007 23:44:02 von Rikishi 42

On 2007-12-20, my0373@googlemail.com wrote:
> I have 8x16gb files that are basically just giant lists.
Sixteen Gigabyte text files ?????
Jeezus H. Tapdancing Christ...
What kind of application is this ?

> each row contains only one word.
> I need to compare each file against every other file and list
> duplicates.
Right.

> I've done a sort -u on the file and I know that each one is unique, I
> just don't know how to list duplicates amongst the files. Its a case
> of needing to know which duplicates there are, rather than just
> cat'ing them together and running a sort -u.
I'm even amazed that sort would handle files this big. Especially in a
doable time.

> I've written a perl script that will do it... eventually but its going to
> take about 2 weeks to finish and tbh, i'm not sure my disks will last that
> long!
Can you give us a clue as to the thight you came up with ?
How abouth sharing some info on the files? How long are these words?

> Any suggestions ?
I can think of one approach. Must think it out, though.

Let us know a bit more about the problem, please.



--
There is an art, it says, or rather, a knack to flying.
The knack lies in learning how to throw yourself at the ground and miss.
Douglas Adams

Re: Manipulating large text files

am 21.12.2007 01:49:45 von Icarus Sparry

On Thu, 20 Dec 2007 12:06:12 -0800, my0373@googlemail.com wrote:

> Hi All,
>
> Hopefully a nice easy question, apologies for the cross posting I made a
> newbie mistake by posting in comp.os.linux, i'm trying to attain
> salvation by posting here instead :)
>
> I have 8x16gb files that are basically just giant lists.
>
> each row contains only one word.
>
> I need to compare each file against every other file and list
> duplicates.
>
> I've done a sort -u on the file and I know that each one is unique, I
> just don't know how to list duplicates amongst the files. Its a case of
> needing to know which duplicates there are, rather than just cat'ing
> them together and running a sort -u.

Sorry, I don't understand what you are asking for exactly. Can you give
us an example of the input and output you are looking for? For example if
you had 3 input files

file1:
cat
dog
horse
pig

file2:
dog
man

file3:
bird
dog
pig

Are the files sorted?

> I've written a perl script that will do it... eventually but its going
> to take about 2 weeks to finish and tbh, i'm not sure my disks will last
> that long!
>
> I have no access to a database, and its an isolated network.
>
> Any suggestions ?
>
> P.S. Apologies if i've broken any rules, its the first time i've tried
> usenet :) and I couldn't find an comp.os.linux.fiddlingwithbigfiles
> forum !

If the files are sorted, then you can find the duplicates between a pair
of files using the "comm" command. To find all the duplicates you will
need to run comm 28 times, which is not too bad, but will read about
900Gb.

If the files are sorted however you can use

sort -m file1 file2 file3 file4 file5 file6 file7 file8 | uniq -d > dups

which will "only" read 256Gb (assuming that dups is fairly small). This
will give you a list of duplicated words. Then you can use "comm -3 dups
file1", "comm -3 dups file2" etc for another 256Gb to tell you in which
files the duplicated words are )if that is what you want).

Do you expect there to be a lot of duplicates, or only a few?

Re: Manipulating large text files

am 21.12.2007 22:43:01 von Rikishi 42

On 2007-12-21, Icarus Sparry wrote:
> Sorry, I don't understand what you are asking for exactly. Can you give
> us an example of the input and output you are looking for? For example if
> you had 3 input files

> Are the files sorted?

He did say the files where sorted with sort -u. So they're sorted and each
one doesn't contains doubles.

>> I've written a perl script that will do it... eventually but its going
>> to take about 2 weeks to finish and tbh, i'm not sure my disks will last
>> that long!

> If the files are sorted, then you can find the duplicates between a pair
> of files using the "comm" command. To find all the duplicates you will
> need to run comm 28 times, which is not too bad, but will read about
> 900Gb.
>
> If the files are sorted however you can use
>
> sort -m file1 file2 file3 file4 file5 file6 file7 file8 | uniq -d > dups
>
> which will "only" read 256Gb (assuming that dups is fairly small).
You might be right, but I don't get it. Why would you read 256GB?
If you read all 8 files of 16GB each, you only read 128GB. They're cat'ed
together for the sort, perhaps? Mmm.
>This
> will give you a list of duplicated words. Then you can use "comm -3 dups
> file1", "comm -3 dups file2" etc for another 256Gb to tell you in which
> files the duplicated words are )if that is what you want).
Surely, here we'd only read 128 GB, plus 8x the size of the dupes?

> Do you expect there to be a lot of duplicates, or only a few?
What I'd like to know.

I'd like a few examples of such 'words', too.

I'm still working on this, but I think it can be done by read the files only
once, therefore limiting the reads to 8x16GB=128GB. The origin of the
doubles would be included.

I wish the OP would please post a few lines of example words...


--
There is an art, it says, or rather, a knack to flying.
The knack lies in learning how to throw yourself at the ground and miss.
Douglas Adams

Re: Manipulating large text files

am 22.12.2007 03:43:01 von Icarus Sparry

On Fri, 21 Dec 2007 22:43:01 +0100, Rikishi 42 wrote:

> On 2007-12-21, Icarus Sparry wrote:
>> Sorry, I don't understand what you are asking for exactly. Can you give
>> us an example of the input and output you are looking for? For example
>> if you had 3 input files
>
>> Are the files sorted?
>
> He did say the files where sorted with sort -u. So they're sorted and
> each one doesn't contains doubles.

He said that he had run sort -u on the files so he knew the files had no
duplicates within themselves. However he did not say that he had kept
these files.

>>> I've written a perl script that will do it... eventually but its going
>>> to take about 2 weeks to finish and tbh, i'm not sure my disks will
>>> last that long!
>
>> If the files are sorted, then you can find the duplicates between a
>> pair of files using the "comm" command. To find all the duplicates you
>> will need to run comm 28 times, which is not too bad, but will read
>> about 900Gb.
>>
>> If the files are sorted however you can use
>>
>> sort -m file1 file2 file3 file4 file5 file6 file7 file8 | uniq -d >
>> dups
>>
>> which will "only" read 256Gb (assuming that dups is fairly small).
> You might be right, but I don't get it. Why would you read 256GB? If you
> read all 8 files of 16GB each, you only read 128GB. They're cat'ed
> together for the sort, perhaps? Mmm.

No, this is me not being able to do mental artithmetic correctly. You do
*NOT* want to cat the files together if you are using the "sort -m"
approach.

>>This
>> will give you a list of duplicated words. Then you can use "comm -3
>> dups file1", "comm -3 dups file2" etc for another 256Gb to tell you in
>> which files the duplicated words are )if that is what you want).
> Surely, here we'd only read 128 GB, plus 8x the size of the dupes?

Yes. Having made the mistake once I reused the same mental apparatus to
get the same wrong result a second time.

>> Do you expect there to be a lot of duplicates, or only a few?
> What I'd like to know.
>
> I'd like a few examples of such 'words', too.
>
> I'm still working on this, but I think it can be done by read the files
> only once, therefore limiting the reads to 8x16GB=128GB. The origin of
> the doubles would be included.

Of course it can, but not with standard unix utilities. The problem was
underspecified, but essentially all you need is to read one record from
each file, sort these 8 into order. If the first record is duplicated
then write whatever outputput you want about the duplicates (it is not
clear if the OP wanted to know which files they come from from the
description
::>I've done a sort -u on the file and I know that each one is
::>unique, I just don't know how to list duplicates amongst the
::>files. Its a case of needing to know which duplicates there
::>are, rather than just cat'ing them together and running a
::>sort -u.
), then discard the first record and its duplicates (if any) and replace
them from the files they came from. This is one linear pass over the 8
input files.

The question you have to decide is how the cost of writing this program
compares to the cost of writing the shell script I outlined. The cost
will have many factors, the time taken to write the program, the time it
takes to run, how often it must be run, what deadlines have to be met etc.

#!/bin/sh
sort -m "$@" | uniq -d > dups
for i
do
comm -3 "$i" dups | sed "s|^|$i:|"
done

takes very little time to write and one is happy that it is reasonably
bug free. (OK, there are problems if one of the input files is called
"dups", there are problems if any of the input files has a pipe symbol in
its name, if someone sets IFS to a weird value, if there are other
programs called "sort", "uniq", "comm" or "sed" in the PATH that don't do
what they should, if /bin/sh is not roughly a Bourne or Posix shell...)

> I wish the OP would please post a few lines of example words...

Re: Manipulating large text files

am 25.12.2007 19:54:35 von Bruce Barnett

"my0373@googlemail.com" writes:

> I've done a sort -u on the file and I know that each one is unique, I
> just don't know how to list duplicates amongst the files. Its a case
> of needing to know which duplicates there are, rather than just
> cat'ing them together and running a sort -u.

> I've written a perl script that will do it... eventually but its
> going to take about 2 weeks to finish and tbh, i'm not sure my disks
> will last that long!

If each file is sorted, then write a program to read one word from
each of the three files. Look for dups. Then find the word that is
first in the list (lexically), and read the next word from that
corresponding file.

if the same word is in two files (a duplicate in files A and B), and
these words are listed before the third word (in C), then advance by
reading one word from both A and B.

This way it only takes one pass though the files. No sorting. No
large memory issues. One pass through each of the three files.



--
Posted via a free Usenet account from http://www.teranews.com