how to tranpose a huge text file

am 08.08.2007 22:56:24 von Jie

I have a huge text file with 1000 columns and about 1 million rows,
and I need to transpose this text file so that row become column and
column become row. (in case you are curious, this is a genotype file).

Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

Thank you very much!

Jie

Re: how to tranpose a huge text file

am 08.08.2007 23:09:05 von Mirco Wahab

Re: how to tranpose a huge text file

am 09.08.2007 01:53:35 von Jim Gibson

In article <1186606584.143555.119720@k79g2000hse.googlegroups.com>, Jie
wrote:

> I have a huge text file with 1000 columns and about 1 million rows,
> and I need to transpose this text file so that row become column and
> column become row. (in case you are curious, this is a genotype file).
>
> Can someone recommend me an easy and efficient way to transpose such a
> large dataset, hopefully with Perl ?

So you want a file with 1000 rows and 1 million columns? Ouch!

First thing to try is to attempt to read the entire file into memory at
once. Then, as Mirco suggested, use substr to extract columns from each
row.

How big are your fields? Just one character?

If you can't read the whole file in at once, you can consider other
options:

1. If the rows are all the same length, you can use seek() to move
around in the file and read all the data from each column in every row.
It's going to be slow, though. (perldoc -f seek)

2. You can write out each field with its column number and row number
prepended to another file, e.g. "col:row:data\n". Then sort that file
by column number and row number using an external sort. However, your
temporary file is going to be very big. Unix sort can handle very large
files, though.

3. You might consider putting your data into a database and making the
appropriate queries.

--
Jim Gibson

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Re: how to tranpose a huge text file

am 09.08.2007 02:19:38 von xhoster

Re: how to tranpose a huge text file

am 09.08.2007 04:00:41 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Jie
], who wrote in article <1186606584.143555.119720@k79g2000hse.googlegroups.com>:
>
> I have a huge text file with 1000 columns and about 1 million rows,
> and I need to transpose this text file so that row become column and
> column become row. (in case you are curious, this is a genotype file).
>
> Can someone recommend me an easy and efficient way to transpose such a
> large dataset, hopefully with Perl ?

If your CRTL allows opening a 1000 output files, read a line, and
append the entries into corresponding files. Then cat the files
together.

If your CRTL allows opening only 32 output files, you need 3 passes,
not 2. First break into 32 files, 32 colums per file; then repeat
breaking for 32 generated files. Again, you get 1000 output files;
cat them together.

Hope this helps,
Ilya

P.S. If all your output data should fit into memory, use scalars
instead of files (preallocate scalars to be extra safe: $a =
'a'; $a x= 4e6; $a = '' preallocates 4MB of buffer for a
variable).

Read file line-by-line, appending to 1000 strings in memory.
Then write them out to a file.

Re: how to tranpose a huge text file

am 09.08.2007 04:52:46 von paduille.4061.mumia.w+nospam

On 08/08/2007 09:00 PM, Ilya Zakharevich wrote:
> [A complimentary Cc of this posting was sent to
> Jie
> ], who wrote in article <1186606584.143555.119720@k79g2000hse.googlegroups.com>:
>> [...]
>> Can someone recommend me an easy and efficient way to transpose such a
>> large dataset, hopefully with Perl ?
>
> If your CRTL allows opening a 1000 output files, read a line, and
> append the entries into corresponding files. Then cat the files
> together.
>
> If your CRTL allows opening only 32 output files, you need 3 passes,
> not 2. [...]

FileCache might also be useful here.

Re: how to tranpose a huge text file

am 09.08.2007 05:20:14 von Petr Vileta

Re: how to tranpose a huge text file

am 09.08.2007 12:39:20 von bugbear

Re: how to tranpose a huge text file

am 09.08.2007 12:53:45 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Mumia W.
], who wrote in article <13bl688igm6eb03@corp.supernews.com>:
> On 08/08/2007 09:00 PM, Ilya Zakharevich wrote:
> > [A complimentary Cc of this posting was sent to
> > Jie
> > ], who wrote in article <1186606584.143555.119720@k79g2000hse.googlegroups.com>:
> >> [...]
> >> Can someone recommend me an easy and efficient way to transpose such a
> >> large dataset, hopefully with Perl ?
> >
> > If your CRTL allows opening a 1000 output files, read a line, and
> > append the entries into corresponding files. Then cat the files
> > together.
> >
> > If your CRTL allows opening only 32 output files, you need 3 passes,
> > not 2. [...]
>
> FileCache might also be useful here.

Do not think so. What gave you this idea? You want to open files 1e9
times?

Puzzled,
Ilya

Re: how to tranpose a huge text file

am 09.08.2007 18:01:09 von Jim Gibson

In article , Petr Vileta
wrote:

> Jie wrote:
> > I have a huge text file with 1000 columns and about 1 million rows,
> > and I need to transpose this text file so that row become column and
> > column become row. (in case you are curious, this is a genotype file).
> >
> > Can someone recommend me an easy and efficient way to transpose such a
> > large dataset, hopefully with Perl ?
> >
> > Thank you very much!
> >
> > Jie
> Maybe you will regard my solution as bizzare, but ...
>
> 1) open your file for read
> 2) open 1000 files for write
> 3) read row of your file
> 4) parse columns
> 5) write 1st column to 1st file, 2nd column to 2nd file etc. but without end
> of line "\n"
> 6) goto 3 until end of file
> 7) close readed file
> 8) write "\n" to each of 1000 files and close all
> 9) simply join all files to one using system command (cp 0.txt + 1.txt ....
> > final.txt)
>
> This solution not consumpt much memory and should be relatively quick.

Open files do consume a bit of memory: file control blocks,
input/output buffers, etc.

>
> BTW: I'm curious to reactions of this programmer community :-)

That is a fine approach, except for the practical matter that most
operating systems will not allow a normal user to have 1000 files open
at one time. The limit is 256 for my current system (Mac OS 10.4):

jgibson 34% ulimit -n
256

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Re: how to tranpose a huge text file

am 09.08.2007 18:17:04 von patrick

On Aug 8, 1:56 pm, Jie wrote:
> I have a huge text file with 1000 columns and about 1 million rows,
> and I need to transpose this text file so that row become column and
> column become row. (in case you are curious, this is a genotype file).
>
> Can someone recommend me an easy and efficient way to transpose such a
> large dataset, hopefully with Perl ?
>
> Thank you very much!
>
> Jie

If you're on UNIX and the columns are fixed length or delimited
you may want to consider using the cut command inside a loop.

Loop from 1 to 1000 to process each column
cut -f | perl -e 'while (<>)
{chomp;print;print "|";}print "\n"' >>

Patrick

Re: how to tranpose a huge text file

am 09.08.2007 18:38:14 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Jim Gibson
], who wrote in article <090820070901097436%jgibson@mail.arc.nasa.gov>:
> > This solution not consumpt much memory and should be relatively quick.
>
> Open files do consume a bit of memory: file control blocks,
> input/output buffers, etc.

Peanuts of you do not open a million of files.

> > BTW: I'm curious to reactions of this programmer community :-)
>
> That is a fine approach, except for the practical matter that most
> operating systems will not allow a normal user to have 1000 files open
> at one time. The limit is 256 for my current system (Mac OS 10.4):
>
> jgibson 34% ulimit -n
> 256

This is not the limit imposed by your operating system. Just the
limit suggested by one of the ancessors of your program. Try raising
ulimit -n; when if fails, it would indicate the limit given by the OS.

Hope this helps,
Ilya

Re: how to tranpose a huge text file

am 09.08.2007 21:32:33 von Jie

Hi, Thank you so much for all the responses.

First, here is a sample dataset, but the real one is much bigger, with
1,000 columns instead of 14.
http://www.humanbee.com/BIG.txt

I could think of two ways to transpose this file.

Option1: write a line as a column and append, something like below

open IN, " open OUT, ">transposed_file.txt";
while (IN) {
append ??????????????
}

Option 2: generate a huge 2-dimentional array and write it out in the
other way

$row=0;
while (IN) {
$big_ARRAY[$row][] = split(/ /);
}

foreach $row (0 ..$row) {
foreach $column (0 ..$#big_ARRAY) {
print OUT$big_ARRAY[$column][$row];
}
}

But I really doubt that either will work. So, can someone please throw
some idea and hopefully code here?!

Thank you!!

Jie

Re: how to tranpose a huge text file

am 09.08.2007 22:28:38 von RedGrittyBrick

bugbear wrote:
> Jie wrote:
>> I have a huge text file with 1000 columns and about 1 million rows,
>> and I need to transpose this text file so that row become column and
>> column become row. (in case you are curious, this is a genotype file).
>>
>> Can someone recommend me an easy and efficient way to transpose such a
>> large dataset, hopefully with Perl ?
>
> This is very analagous to the problems
> of rotating a (much larger than memory) photograph.
>

Are you suggesting
encode as GIF
invoke Image::Magik rotate 90
decode
?

Feasible* but a little weird :-)

--
RGB
Assuming the number of distinct genotype elements is less than the max
number of color values in a GIF (256). Otherwise try PNG?

Re: how to tranpose a huge text file

am 09.08.2007 22:48:00 von Mirco Wahab

Jie wrote:
> But I really doubt that either will work. So, can someone please throw
> some idea and hopefully code here?!

PLEASE give the slightest hint on how
this file *looks like*. If you could
bring up an example of .. say the first
10 rows => the left 20 and the right 20
columns in each, so everybody can guess
what you're talking about.

Regards

M.

Re: how to tranpose a huge text file

am 10.08.2007 06:01:57 von Jie

Hi, I already posted a sample dataset here http://www.humanbee.com/BIG.txt
, as I mentioned in the previous message..........

Re: how to tranpose a huge text file

am 10.08.2007 17:33:30 von Ted Zlatanov

On Wed, 08 Aug 2007 20:56:24 -0000 Jie wrote:

J> I have a huge text file with 1000 columns and about 1 million rows,
J> and I need to transpose this text file so that row become column and
J> column become row. (in case you are curious, this is a genotype
J> file).

J> Can someone recommend me an easy and efficient way to transpose such
J> a large dataset, hopefully with Perl ?

I think your file-based approach is inefficient. You need to do this
with a database. They are built to handle this kind of data; in fact
your data set is not that big (looks like 10GB at most). Once your data
is in the database, you can generate output any way you like, or do
operations directly on the contents, which may make your job much
easier.

You could try SQLite as a DB engine, but note the end of
http://www.sqlite.org/limits.html which says basically it's not designed
for large data sets. Consider PostgreSQL, for example (there are many
others in the market, free and commercial).

To avoid the 1000-open-files solution, you can do the following:

my $size;
my $big = "big.txt";
my $brk = "break.txt";
open F, '<', $big or die "Couldn't read from $big file: $!";
open B, '>', $brk or die "Couldn't write to $brk file: $!";

while ()
{
chomp;
my @data = split ' '; # you may want to ensure the size of @data is the same every time
$size = scalar @data; # but $size will have the LAST size of @data
print B join("\n", @data), "\n";
}

close F;
close B;

Basically converting a MxN matrix into (MN)x1

Let's assume you will just have 1000 columns for this example.

Now you can write each inverted output line by looking in break.txt,
reading every line, chomp() it, and append it to your current output
line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
Write "\n" to end the current output line.

Now you can do the next output line, which requires lines 1, 1001, 2001,
etc. You can reopen the break.txt file or just seek to the beginning.

I am not writing out the whole thing because it's tedious and I think
you should consider a database instead. It could be optimized, but
you're basically putting lipstick on a pig when you spend your time
optimizing the wrong solution for your needs.

Ted

Re: how to tranpose a huge text file

am 10.08.2007 17:52:09 von anno4000

Ted Zlatanov wrote in comp.lang.perl.misc:
> On Wed, 08 Aug 2007 20:56:24 -0000 Jie wrote:
>
> J> I have a huge text file with 1000 columns and about 1 million rows,
> J> and I need to transpose this text file so that row become column and
> J> column become row. (in case you are curious, this is a genotype
> J> file).

[Nice single-file solution snipped]

> $size = scalar @data; # but $size will have the LAST size of @data

Useless use of scalar() here.

Anno

Re: how to tranpose a huge text file

am 10.08.2007 18:39:28 von Ted Zlatanov

On 10 Aug 2007 15:52:09 GMT anno4000@radom.zrz.tu-berlin.de wrote:

a> Ted Zlatanov wrote in comp.lang.perl.misc:
>> $size = scalar @data; # but $size will have the LAST size of @data

a> Useless use of scalar() here.

I like to make scalar context explicit. IMHO it makes code more
legible. It's my style, and AFAIK it doesn't cause problems (I also
like to say "shift @_" and "shift @ARGV" to be explicit as to what I'm
shifting).

I'll be the first to admit my style is peculiar, e.g. single-space
indents, but at least I'm consistent :)

Ted

Re: how to tranpose a huge text file

am 10.08.2007 18:40:41 von xhoster

Jie wrote:

> But I really doubt that either will work. So, can someone please throw
> some idea and hopefully code here?!

Hi Jie,

We've already thrown out several ideas. Some take a lot of memory, some
take a lot of file-handles, some need to re-read the file once for each
column.

You haven't really commented on the suitability of any of these methods,
and the new information you provided is very minimal. So I wouldn't expect
to get many more ideas just be asking the same question again!

What did you think of the ideas we already gave you? Do they exceed your
system's memory? Do they exceed your file-handle limit? Do they just take
too long?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Re: how to tranpose a huge text file

am 10.08.2007 22:52:50 von xhoster

Jie wrote:
> I have a huge text file with 1000 columns and about 1 million rows,
> and I need to transpose this text file so that row become column and
> column become row. (in case you are curious, this is a genotype file).
>
> Can someone recommend me an easy and efficient way to transpose such a
> large dataset, hopefully with Perl ?
>
> Thank you very much!

Out of curiosity, I made this fairly general purpose program for
transposing files (as long as they are tab separated, and rectangular). It
will revert to using multiple passes if it can't open enough temp files to
do it all in one pass. There are more efficient ways of doing it in that
case, but they are more complicated and I'm lazy. On my computer, it seems
to be even faster than reading all the data into memory and buidling
in-memory strings, and of course uses a lot less memory.

It doesn't work to C on ARGV, so I had to open the (single) input
file explicitly instead. Writes to STDOUT.

use strict;
use warnings;

open my $in, $ARGV[0] or die "$ARGV[0] $!";
my @cols=split /\t/, scalar <$in>;
my $cols=@cols;

my $i=0; ## the first unprocessed column
while ($i<$cols) {
my @fh;
my $j;
## open as many files as the fd limit will let us
foreach ($j=$i;$j<$cols; $j++) {
open $fh[@fh], "+>",undef or do {
die "$j $!" unless $!{EMFILE};
pop @fh;
last;
};
};
$j--;
##warn "working on columns $i..$j";
seek $in,0,0 or die $!;
while (<$in>) { chomp;
my $x=0;
print {$fh[$x++]} "\t$_" or die $! foreach (split/\t/)[$i..$j];
}
foreach my $x (@fh) {
seek $x,0,0 or die $!;
$_=<$x>;
s/^\t//; # chop the unneeded leading tab
print "$_\n"
}
$i=$j+1;
}

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Re: how to tranpose a huge text file

am 10.08.2007 23:11:57 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
], who wrote in article :
> To avoid the 1000-open-files solution, you can do the following:

> while ()
> {
> chomp;
> my @data = split ' '; # you may want to ensure the size of @data is the same every time
> $size = scalar @data; # but $size will have the LAST size of @data
> print B join("\n", @data), "\n";
> }

This is a NULL operation. You just converted " " to "\n".
Essentially, nothing changed. [And $size is not used.]

> Now you can write each inverted output line by looking in break.txt,
> reading every line, chomp() it, and append it to your current output
> line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
> Write "\n" to end the current output line.

Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!

Hope this helps,
Ilya

Re: how to tranpose a huge text file

am 11.08.2007 00:03:41 von Ted Zlatanov

On Fri, 10 Aug 2007 21:11:57 +0000 (UTC) Ilya Zakharevich wrote:

IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov
IZ> ], who wrote in article :
>> To avoid the 1000-open-files solution, you can do the following:

>> while ()
>> {
>> chomp;
>> my @data = split ' '; # you may want to ensure the size of @data is the same every time
>> $size = scalar @data; # but $size will have the LAST size of @data
>> print B join("\n", @data), "\n";
>> }

IZ> This is a NULL operation. You just converted " " to "\n".
IZ> Essentially, nothing changed.

I disagree, but it's somewhat irrelevant, see at end...

IZ> [And $size is not used.]

It's necessary later when you are jumping $size line forward (I used
1000 in the example later). It's also handy to check if @data is not
the right size compared to the last line. Sorry I didn't mention that.

>> Now you can write each inverted output line by looking in break.txt,
>> reading every line, chomp() it, and append it to your current output
>> line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
>> Write "\n" to end the current output line.

IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!

I suggested a database, actually. I specifically said I don't recommend
doing this with file operations. I agree my file-based approach isn't
better than what you and others have suggested, but it does avoid the
multiple open files, and it has low memory usage.

Ted

Re: how to tranpose a huge text file

am 11.08.2007 00:28:34 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
], who wrote in article :
> >> Now you can write each inverted output line by looking in break.txt,
> >> reading every line, chomp() it, and append it to your current output
> >> line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
> >> Write "\n" to end the current output line.
>
> IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!
>
> I suggested a database, actually.

And why do you think this would decrease the load on head seeks?
Either the data fits in memory (then database is not needed), or it is
read from disk (which would, IMO, imply the same amount of seeks with
database as with any other file-based operation).

One needs not a database, but a program with build-in caching
optimized for non-random access to 2-dimensional arrays. AFAIK,
imagemagick is mostly memory-based. On the other side of spectrum,
GIMP is based on tile-caching algorithms; if there were a way to
easily hook into this algorithm (with no screen display involved), one
could handle much larger datasets.

Yet another way might be compression; suppose that there are only
(e.g.) 130 "types" of entries; then one can compress the matrix into
1GB of data, which should be handled easily by almost any computer.

Hope this helps,
Ilya

Re: how to tranpose a huge text file

am 11.08.2007 02:31:01 von Ted Zlatanov

On Fri, 10 Aug 2007 22:28:34 +0000 (UTC) Ilya Zakharevich wrote:

IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov
IZ> ], who wrote in article :
>> >> Now you can write each inverted output line by looking in break.txt,
>> >> reading every line, chomp() it, and append it to your current output
>> >> line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
>> >> Write "\n" to end the current output line.
>>
IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!
>>
>> I suggested a database, actually.

IZ> And why do you think this would decrease the load on head seeks?
IZ> Either the data fits in memory (then database is not needed), or it is
IZ> read from disk (which would, IMO, imply the same amount of seeks with
IZ> database as with any other file-based operation).

Look, databases are optimized to store large amounts of data
efficiently. You can always create a hand-tuned program that will do
one task (e.g. transposing a huge text file) well, but you're missing
the big picture: future uses of the data. I really doubt the only thing
anyone will ever want with that data is to transpose it.

IZ> One needs not a database, but a program with build-in caching
IZ> optimized for non-random access to 2-dimensional arrays. AFAIK,
IZ> imagemagick is mostly memory-based. On the other side of spectrum,
IZ> GIMP is based on tile-caching algorithms; if there were a way to
IZ> easily hook into this algorithm (with no screen display involved), one
IZ> could handle much larger datasets.

You and everyone else are overcomplicating this.

Rewrite the original input file for fixed-length records. Then you just
need to seek to a particular offset to read a record, and the problem
becomes transposing a matrix piece by piece. This is fairly simple.

IZ> Yet another way might be compression; suppose that there are only
IZ> (e.g.) 130 "types" of entries; then one can compress the matrix into
IZ> 1GB of data, which should be handled easily by almost any computer.

You need 5 bits per item: it has 16 possible values ([ACTG]{2}), plus
"--".

A database table, to come back to my point, would store these items as
enums. Then you, the user, don't have to worry about the bits per item
in the storage, and you can just use the database.

Ted

Re: how to tranpose a huge text file

am 11.08.2007 03:17:08 von xhoster

Ted Zlatanov wrote:
> On Fri, 10 Aug 2007 22:28:34 +0000 (UTC) Ilya Zakharevich
> wrote:
>
> IZ> [A complimentary Cc of this posting was sent to
> IZ> Ted Zlatanov
> IZ> ], who wrote in article
> :
> >> >> Now you can write each inverted output line by looking in
> >> >> break.txt, reading every line, chomp() it, and append it to your
> >> >> current output line if it's divisible by 1000 (so 0, 1000, 2000,
> >> >> etc. will match). Write "\n" to end the current output line.
> >>
> IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good
> luck!
> >>
> >> I suggested a database, actually.
>
> IZ> And why do you think this would decrease the load on head seeks?
> IZ> Either the data fits in memory (then database is not needed), or it
> is IZ> read from disk (which would, IMO, imply the same amount of seeks
> with IZ> database as with any other file-based operation).
>
> Look, databases are optimized to store large amounts of data
> efficiently.

For some not very general meanings of "efficiently", sure. They generally
expand the data quite a bit upon storage; they aren't very good at straight
retrieval unless you have just the right index structures in place and your
queries have a high selectivity; most of them put a huge amount of effort
into transactionality and concurrency which maybe not be needed here but
imposes a high overhead whether you use it or not. One of the major
gene-chip companies was very proud that in one of their upgrades, they
started using a database instead of plain files for storing the data. And
then their customers were very pleased when in a following upgrade they
abandoned that, and went back to using plain files for the bulk data and
using the database just for the small DoE metadata.

> You can always create a hand-tuned program that will do
> one task (e.g. transposing a huge text file) well, but you're missing
> the big picture: future uses of the data. I really doubt the only thing
> anyone will ever want with that data is to transpose it.

And I really doubt that any single database design is going to support
everything that anyone may ever want to do with the data, either.

>
> IZ> One needs not a database, but a program with build-in caching
> IZ> optimized for non-random access to 2-dimensional arrays. AFAIK,
> IZ> imagemagick is mostly memory-based. On the other side of spectrum,
> IZ> GIMP is based on tile-caching algorithms; if there were a way to
> IZ> easily hook into this algorithm (with no screen display involved),
> one IZ> could handle much larger datasets.
>
> You and everyone else are overcomplicating this.
>
> Rewrite the original input file for fixed-length records.

Actually, that is just what I initially did recommended.

> Then you just
> need to seek to a particular offset to read a record, and the problem
> becomes transposing a matrix piece by piece. This is fairly simple.

I think you are missing the big picture. Once you make a seekable file
format, that probably does away with the need to transpose the data in the
first place--whatever operation you wanted to do with the transposition can
be probably be done on the seekable file instead.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Re: how to tranpose a huge text file

am 11.08.2007 05:51:27 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
], who wrote in article :
> On Fri, 10 Aug 2007 22:28:34 +0000 (UTC) Ilya Zakharevich wrote:
> IZ> And why do you think this would decrease the load on head seeks?
> IZ> Either the data fits in memory (then database is not needed), or it is
> IZ> read from disk (which would, IMO, imply the same amount of seeks with
> IZ> database as with any other file-based operation).

> Look, databases are optimized to store large amounts of data
> efficiently.

Words words words. You can't do *all the things* efficiently.
Databases are optimized for some particular access patterns. I doubt
that even "good databases" are optimized for *this particular* access
pattern. And, AFAIK, MySQL is famous for its lousiness...

> You can always create a hand-tuned program that will do
> one task (e.g. transposing a huge text file) well, but you're missing
> the big picture: future uses of the data. I really doubt the only thing
> anyone will ever want with that data is to transpose it.

If the transposed form is well tuned for further manipulation (of
which we were not informed), then databasing looks like an overkill.
If not, then indeed.

> You and everyone else are overcomplicating this.
>
> Rewrite the original input file for fixed-length records. Then you just
> need to seek to a particular offset to read a record, and the problem
> becomes transposing a matrix piece by piece. This is fairly simple.

Sure. Do 1e9 seeks in your spare time...

> IZ> Yet another way might be compression; suppose that there are only
> IZ> (e.g.) 130 "types" of entries; then one can compress the matrix into
> IZ> 1GB of data, which should be handled easily by almost any computer.
>
> You need 5 bits per item: it has 16 possible values ([ACTG]{2}), plus
> "--".

> A database table, to come back to my point, would store these items as
> enums. Then you, the user, don't have to worry about the bits per item
> in the storage, and you can just use the database.

Of course one does not care about anything - IF the solution using the
database is going to give an answer during the following month. Which
I doubt...

Hope this helps,
Ilya

Re: how to tranpose a huge text file

am 11.08.2007 17:08:15 von Jie

I really appreciate all the reponses here!!

I did not jump in because I want to test each proposed solution first
before providing any feedback.

I will definitely try to understand and test all the solutions and let
everybody know what I got.

Thank you guys so much!

Jie

On Aug 10, 7:31 pm, Ted Zlatanov wrote:
> On Fri, 10 Aug 2007 22:28:34 +0000 (UTC) Ilya Zakharevich wrote:
>
> IZ> [A complimentary Cc of this posting was sent to
> IZ> Ted Zlatanov
> IZ> ], who wrote in article :>> >> Now you can write each inverted output line by looking in break.txt,
> >> >> reading every line, chomp() it, and append it to your current output
> >> >> line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
> >> >> Write "\n" to end the current output line.
>
> IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!
>
>
>
> >> I suggested a database, actually.
>
> IZ> And why do you think this would decrease the load on head seeks?
> IZ> Either the data fits in memory (then database is not needed), or it is
> IZ> read from disk (which would, IMO, imply the same amount of seeks with
> IZ> database as with any other file-based operation).
>
> Look, databases are optimized to store large amounts of data
> efficiently. You can always create a hand-tuned program that will do
> one task (e.g. transposing a huge text file) well, but you're missing
> the big picture: future uses of the data. I really doubt the only thing
> anyone will ever want with that data is to transpose it.
>
> IZ> One needs not a database, but a program with build-in caching
> IZ> optimized for non-random access to 2-dimensional arrays. AFAIK,
> IZ> imagemagick is mostly memory-based. On the other side of spectrum,
> IZ> GIMP is based on tile-caching algorithms; if there were a way to
> IZ> easily hook into this algorithm (with no screen display involved), one
> IZ> could handle much larger datasets.
>
> You and everyone else are overcomplicating this.
>
> Rewrite the original input file for fixed-length records. Then you just
> need to seek to a particular offset to read a record, and the problem
> becomes transposing a matrix piece by piece. This is fairly simple.
>
> IZ> Yet another way might be compression; suppose that there are only
> IZ> (e.g.) 130 "types" of entries; then one can compress the matrix into
> IZ> 1GB of data, which should be handled easily by almost any computer.
>
> You need 5 bits per item: it has 16 possible values ([ACTG]{2}), plus
> "--".
>
> A database table, to come back to my point, would store these items as
> enums. Then you, the user, don't have to worry about the bits per item
> in the storage, and you can just use the database.
>
> Ted

Re: how to tranpose a huge text file

am 11.08.2007 17:41:48 von Ted Zlatanov

On Sat, 11 Aug 2007 03:51:27 +0000 (UTC) Ilya Zakharevich wrote:

IZ> Words words words. You can't do *all the things* efficiently.
IZ> Databases are optimized for some particular access patterns. I doubt
IZ> that even "good databases" are optimized for *this particular* access
IZ> pattern.

You mean "SELECT column FROM table" (which is what you need in order to
write each line of the transposed file)? I think that's a pretty common
access pattern, and would perform well in most databases.

Ted

Re: how to tranpose a huge text file

am 11.08.2007 18:00:06 von Ted Zlatanov

On 11 Aug 2007 01:17:08 GMT xhoster@gmail.com wrote:

x> Ted Zlatanov wrote:
>> Look, databases are optimized to store large amounts of data
>> efficiently.

x> For some not very general meanings of "efficiently", sure.

Actually, it's the general meaning that I had in mind. For specific
efficiency and optimization, databases are not always the right tool.
For instance, a database will never perform as quickly as a fixed-offset
file for individual seeks to retrieve a record. Based on the OP's data,
I think a database is the right solution.

x> One of the major gene-chip companies was very proud that in one of
x> their upgrades, they started using a database instead of plain files
x> for storing the data. And then their customers were very pleased
x> when in a following upgrade they abandoned that, and went back to
x> using plain files for the bulk data and using the database just for
x> the small DoE metadata.

I can give you many examples where databases improved a business; I've
worked with large sets of data to do data analysis and storing it in a
database was ridiculously better than the equivalent work on a flat-file
database. For the OP's data set, I think database storage is a
reasonable, cost-efficient, and maintainable solution.

>> You can always create a hand-tuned program that will do
>> one task (e.g. transposing a huge text file) well, but you're missing
>> the big picture: future uses of the data. I really doubt the only thing
>> anyone will ever want with that data is to transpose it.

x> And I really doubt that any single database design is going to support
x> everything that anyone may ever want to do with the data, either.

My point is that the single-task program to transpose a file will be
much less useful than the database setup for general use. You're taking
the argument to an absurd extreme ("everything that anyone may ever
want").

x> I think you are missing the big picture. Once you make a seekable
x> file format, that probably does away with the need to transpose the
x> data in the first place--whatever operation you wanted to do with the
x> transposition can be probably be done on the seekable file instead.

Usually the rewrite is for another program that wants that data as
input, so no, you can't do "whatever" operation on the seekable file
without rewriting the consumer program's input routine.

Ted

Re: how to tranpose a huge text file

am 12.08.2007 02:45:03 von xhoster

Ted Zlatanov wrote:
> On 11 Aug 2007 01:17:08 GMT xhoster@gmail.com wrote:
>
>
> x> One of the major gene-chip companies was very proud that in one of
> x> their upgrades, they started using a database instead of plain files
> x> for storing the data. And then their customers were very pleased
> x> when in a following upgrade they abandoned that, and went back to
> x> using plain files for the bulk data and using the database just for
> x> the small DoE metadata.
>
> I can give you many examples where databases improved a business;

Well I should hope you could, and so can I. I just don't think from what
we have seen that this is particularly likely to be one of those cases.

>
> x> And I really doubt that any single database design is going to support
> x> everything that anyone may ever want to do with the data, either.
>
> My point is that the single-task program to transpose a file will be
> much less useful than the database setup for general use. You're taking
> the argument to an absurd extreme ("everything that anyone may ever
> want").

While it may or may not be useful for purposes invisible to us, I don't
think it will be very useful for the one desired task which we *do* know
about.

> x> I think you are missing the big picture. Once you make a seekable
> x> file format, that probably does away with the need to transpose the
> x> data in the first place--whatever operation you wanted to do with the
> x> transposition can be probably be done on the seekable file instead.
>
> Usually the rewrite is for another program that wants that data as
> input, so no, you can't do "whatever" operation on the seekable file
> without rewriting the consumer program's input routine.

If the program you are trying to use only accepts one input format and you
are unwilling to change it, then what is putting the data it needs into a
database rather than into the necessary format going to get you? If I have
to make gigantic transposed file, I'd far rather start with the given text
file than with a database containing that data.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Re: how to tranpose a huge text file

am 12.08.2007 08:19:37 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
], who wrote in article :
> On Sat, 11 Aug 2007 03:51:27 +0000 (UTC) Ilya Zakharevich wrote:
>
> IZ> Words words words. You can't do *all the things* efficiently.
> IZ> Databases are optimized for some particular access patterns. I doubt
> IZ> that even "good databases" are optimized for *this particular* access
> IZ> pattern.
>
> You mean "SELECT column FROM table" (which is what you need in order to
> write each line of the transposed file)? I think that's a pretty common
> access pattern, and would perform well in most databases.

I have very little experience with databases. However, my hunch
(based on my experience with other types of software) is that they are
not as optimized as an optimistic point of view may imply.

Doing one "SELECT column FROM table" (which results in a megaarray)
may be not *that* slow - given quick enough hardware. Doing it 1000
times would unravel all the missing optimizations in the
implementation of the database.

Hope this helps,
Ilya

Re: how to tranpose a huge text file

am 12.08.2007 14:29:22 von Ted Zlatanov

On Sun, 12 Aug 2007 06:19:37 +0000 (UTC) Ilya Zakharevich wrote:

IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov
IZ> ], who wrote in article :

>> You mean "SELECT column FROM table" (which is what you need in order to
>> write each line of the transposed file)? I think that's a pretty common
>> access pattern, and would perform well in most databases.

IZ> I have very little experience with databases. However, my hunch
IZ> (based on my experience with other types of software) is that they are
IZ> not as optimized as an optimistic point of view may imply.

I respect your experience and skills you have demonstrated repeatedly in
this newsgroup and in Perl's source code base. I think you may want to
investigate database technology further; software like PostgreSQL and
SQLite has gathered significant traction and respect, and is adaptable
to many technical tasks.

IZ> Doing one "SELECT column FROM table" (which results in a megaarray)
IZ> may be not *that* slow - given quick enough hardware. Doing it 1000
IZ> times would unravel all the missing optimizations in the
IZ> implementation of the database.

You don't have to get all the results of a SELECT at once. You can
fetch each row of the result set, which does not generate a large array.
I don't know the internal mechanics of result set management in the
various databases on the market, nor do I want to; from practical
experience with large result sets under Oracle I can tell you that
performance is quite good there. Do you want benchmarks to be convinced?

Ted

Re: how to tranpose a huge text file

am 13.08.2007 01:39:00 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
], who wrote in article :
> On Sun, 12 Aug 2007 06:19:37 +0000 (UTC) Ilya Zakharevich wrote:
> IZ> Doing one "SELECT column FROM table" (which results in a megaarray)
> IZ> may be not *that* slow - given quick enough hardware. Doing it 1000
> IZ> times would unravel all the missing optimizations in the
> IZ> implementation of the database.
>
> You don't have to get all the results of a SELECT at once. You can
> fetch each row of the result set, which does not generate a large array.

Fetching by rows is trivial (with data at hand). It is fetching by
columns which leads to very hard-to-optimize signature of disk access.

> I don't know the internal mechanics of result set management in the
> various databases on the market, nor do I want to; from practical
> experience with large result sets under Oracle I can tell you that
> performance is quite good there. Do you want benchmarks to be convinced?

Sure, I would be very much interested.

E.g., on a machine with 512MB memory, create a 1e3 x 1e6 array with 5bit
entries. Then access all its columns one-by-one, then all its rows
one-by-one. This would mimic the problem at hand.

Thanks,
Ilya

Re: how to tranpose a huge text file

am 13.08.2007 14:51:28 von bugbear

RedGrittyBrick wrote:
> bugbear wrote:
>> Jie wrote:
>>> I have a huge text file with 1000 columns and about 1 million rows,
>>> and I need to transpose this text file so that row become column and
>>> column become row. (in case you are curious, this is a genotype file).
>>>
>>> Can someone recommend me an easy and efficient way to transpose such a
>>> large dataset, hopefully with Perl ?
>>
>> This is very analagous to the problems
>> of rotating a (much larger than memory) photograph.
>>
>
> Are you suggesting
> encode as GIF
> invoke Image::Magik rotate 90
> decode
> ?
>
> Feasible* but a little weird :-)
>

No, I'm suggesting some of the approaches
and concepts used when rotating images
might be applicable.

BugBear

Re: how to tranpose a huge text file

am 13.08.2007 21:12:01 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Jie
], who wrote in article <1186844895.953259.318160@q4g2000prc.googlegroups.com>:
>
> I really appreciate all the reponses here!!
>
> I did not jump in because I want to test each proposed solution first
> before providing any feedback.

*If* you have 512 MB of memory, and *if* each of your 1e6 x 1e3
entries is compressable to 1 byte, then the following solution should
be the optimal of what is proposed:

a) Preallocate an array of 334 strings of length 1e6 bytes;

b1) Assign '' to each string;

c1) Read file line-by-line, for each line: split; ignore all results
but the first 333; compress each of 333 results to 1 byte; append
each byte to the corresponding string;

d1) When read: for each of the corresponding 333 strings: uncompress
each byte, print space-separated.

b2,c2,d2): Likewise with results 334..666;
b3,c3,d3): Likewise with results 667..1000.

This is 3 passes through the original file. If you skip compression,
you can do the same with 10 passes - should also be not slow with
contemporary hard drives...

Hope this helps,
Ilya