Out of memory in vec

Out of memory in vec

am 10.08.2007 06:02:36 von paduille.4061.mumia.w+nospam

I have 512mb RAM and 800mb swap, Perl 5.8.4 (and 5.9.4) under i386 Linux.

I decided to try out Jie Huang's idea of transposing a large array of
"bioinfomatics" data. Since I don't know what bioinfomatics data is, I
just assumed that they were sequences of the amino acids A, C, G, and T.

And since Jie never gave a sample of his/her data, I created a program
to do so:

#!/usr/bin/perl
use strict;
use warnings;

my @bases = qw/A C T G/;
my $rows = shift() || 5;
my $cols = shift() || 10;

for my $n (1 .. ($rows * $cols)) {
my $base = $bases[rand @bases];
print $base;
print "\n" if (0 == ($n % $cols));
}
print "\n";

__END__

I then went to writing the transposition program. A gigabyte (1e6 rows ×
1000 columns) is a lot of memory, so I wanted to avoid forcing each item
to take an entire byte, and since A, C, T and G are only four distinct
values, an entire byte is not needed to encode them, so I opted to use
four bits for each item and the vec() function:

#!/usr/bin/perl
use strict;
use warnings;
use Fatal qw/open close/;
use constant WIDTH => 4;

my $infile = shift() || 'data/bases-small';
my $outfile = 'out';
my %bases = (A => 0, C => 1, G => 2, T => 3);
my %rbases = reverse %bases;

my $buffer = '';
open my $ifh, '<', $infile;
open my $ofh, '>', $outfile;

my $offset = 0;
my $maxset = 0;
my $reclen;
my $maxrow = -1;

while (<$ifh>) {
$reclen = length($_)-1 unless defined $reclen;
while (/([ACTG])/g) {
vec($buffer, $offset++, WIDTH) = $bases{$1};
}
$maxrow++;
}

$maxset = $offset;
system(ps => 'up', $$);

for my $col (0 .. $reclen-1) {
for my $row (0 .. $maxrow-1) {
my $base = vec($buffer, $col + ($row * $reclen), WIDTH);
print $ofh $rbases{$base};
}
print $ofh "\n";
}

print $ofh "\n";


close $ofh;
close $ifh;

__END__

This program works with a small file of 10MB, but it falls over with
"Out of memory!" for the one gigabyte file. I shouldn't be all too
surprised if vec() a gigabyte of data, but is there some documented
limit on vec?

I absolutely never had a chance to run out of swap while running this
program. It ran for about 20 minutes, consumed over 230MB of total
memory, then aborted.

Re: Out of memory in vec

am 10.08.2007 06:44:14 von paduille.4061.mumia.w+nospam

On 08/09/2007 11:02 PM, Mumia W. wrote:
> [...] I shouldn't be all too
> surprised if vec() [can't handle] a gigabyte of data, but is there some documented
> limit on vec?
>

That's what I meant to say.

Re: Out of memory in vec

am 10.08.2007 09:35:48 von anno4000

Mumia W. wrote in comp.lang.perl.misc:

[...]

> I then went to writing the transposition program. A gigabyte (1e6 rows ×
> 1000 columns) is a lot of memory, so I wanted to avoid forcing each item
> to take an entire byte, and since A, C, T and G are only four distinct
> values, an entire byte is not needed to encode them, so I opted to use
> four bits for each item and the vec() function:

Why four? You only need two bits to encode four bases.

[...]

> This program works with a small file of 10MB, but it falls over with
> "Out of memory!" for the one gigabyte file. I shouldn't be all too
> surprised if vec() a gigabyte of data, but is there some documented
> limit on vec?

No. Like any Perl string a vec can be as big as your memory allows.

Anno

Re: Out of memory in vec

am 10.08.2007 18:26:17 von Ted Zlatanov

On 10 Aug 2007 07:35:48 GMT anno4000@radom.zrz.tu-berlin.de wrote:

a> Mumia W. wrote in comp.lang.perl.misc:
a> [...]

>> I then went to writing the transposition program. A gigabyte (1e6 rows ×
>> 1000 columns) is a lot of memory, so I wanted to avoid forcing each item
>> to take an entire byte, and since A, C, T and G are only four distinct
>> values, an entire byte is not needed to encode them, so I opted to use
>> four bits for each item and the vec() function:

a> Why four? You only need two bits to encode four bases.

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted

Re: Out of memory in vec

am 10.08.2007 20:22:09 von paduille.4061.mumia.w+nospam

On 08/10/2007 11:26 AM, Ted Zlatanov wrote:
> On 10 Aug 2007 07:35:48 GMT anno4000@radom.zrz.tu-berlin.de wrote:
>
> a> Mumia W. wrote in comp.lang.perl.misc:
> a> [...]
>
>>> I then went to writing the transposition program. A gigabyte (1e6 rows ×
>>> 1000 columns) is a lot of memory, so I wanted to avoid forcing each item
>>> to take an entire byte, and since A, C, T and G are only four distinct
>>> values, an entire byte is not needed to encode them, so I opted to use
>>> four bits for each item and the vec() function:
>
> a> Why four? You only need two bits to encode four bases.
>
> "--" was also allowed as data besides a pair of letters, so you have
>
> [ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
> plus "--" as a value = 5 bits
>
> Ted

I started programming before Jie submitted any sample data, so I thought
of [ACGT] as an item. I would've liked to use three bits, but vec() does
not seem to allow that.

This program
----------------------
my $buffer = '';
vec($buffer, 0, 3) = 7;
----------------------
produces "Illegal number of bits in vec at ...", so I'm stuck with using
four bits.

Now that Jie has give us some data, I can find a way to get [ACGT][ACGT]
into a single four-bit group.

Re: Out of memory in vec

am 11.08.2007 05:47:09 von paduille.4061.mumia.w+nospam

On 08/10/2007 01:22 PM, Mumia W. wrote:
> On 08/10/2007 11:26 AM, Ted Zlatanov wrote:
>> On 10 Aug 2007 07:35:48 GMT anno4000@radom.zrz.tu-berlin.de wrote:
>>
>> a> Why four? You only need two bits to encode four bases.
>>
>> "--" was also allowed as data besides a pair of letters, so you have
>>
>> [ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
>> plus "--" as a value = 5 bits
>>
>> Ted
> [...]
>
> Now that Jie has give us some data, I can find a way to get [ACGT][ACGT]
> into a single four-bit group.

Duh. Sorry Anno. Sorry Ted. Of course I only need two bits for [ACGT]

I guess it's obvious that I haven't needed to count in binary for a while
:-(

But the "_" characters do throw a monkey wrench into my plans.

I wrote another version of the program that used bytes. It took over 80
minutes to complete and consumed 930MB of memory--mostly swap.

However, I'll still try to get a vec() version working because keeping
the data out of the swap space will probably speed things up by 100 times.

One half of 930MB is 465MB. If I exit Gnome and don't use my PC for
browsing the web while the program is running, I'll probably have enough
RAM for the program to run without using any swap. But I've got to get
vec() working.

PS.
Thanks again Anno. I changed the program to use two bits, and it worked.

Re: Out of memory in vec

am 11.08.2007 08:12:41 von paduille.4061.mumia.w+nospam

On 08/10/2007 11:26 AM, Ted Zlatanov wrote:
>
> "--" was also allowed as data besides a pair of letters, so you have
>
> [ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
> plus "--" as a value = 5 bits
>
> Ted

Alright, I admit I cheated :-)

Five bits were needed, so I used five bits. No, I didn't hack the vec()
function in the Perl core. I just used two buffers:

#!/usr/local/bin/perl5.9.4
use strict;
use warnings;
use Fatal qw/open close/;
use Data::Dumper;

my $infile = shift() || 'BIG.txt';
my $outfile = 'out';

# A C G T _
my %bases = (
0,'AA',1,'AC',2,'AG',3,'AT',4,'A-',5,'CA',6,'CC',
7,'CG',8,'CT',9,'C-',10,'GA',11,'GC',12,'GG',13,'GT',
14,'G-',15,'TA',16,'TC',17,'TG',18,'TT',19,'T-',
20,'-A',21,'-C',22,'-G',23,'-T',24,'--',
);
my %rbases = reverse %bases;

my $buffer1 = '';
my $buffer2 = '';
my $maxcols = 0;
my $maxrows = 0;
my $pos = 0;

my (%inp, %out);
open $inp{hand}, '<', $infile;
open $out{hand}, '>', $outfile;

scalar readline $inp{hand};
my $startpos = tell($inp{hand});

while (readline $inp{hand}) {
my @f = split; shift @f;

bpstore($_, $pos++) for @f;
# print "@f\n";

$maxcols = @f if @f > $maxcols;
$maxrows++;
# last if $inp{lines}++ > 4;
}

select $out{hand};

for my $col (0 .. $maxcols-1) {
for my $row (0 .. $maxrows-1) {
# print "($row, $col) ";
my $npos = $col + ($row * $maxcols);
my $bp = $bases{bpfetch($npos)};
print "$bp ";
}
print "\n";
}

close $inp{hand};
close $out{hand};

system(ps => 'up', $$);
# system(cat => $outfile);

##############################

sub bpstore {
my ($bp, $pos) = @_;
$bp = $rbases{$bp};
vec($buffer1, $pos, 4) = 0xF & $bp;;
vec($buffer2, $pos, 1) = 1 & ($bp >> 4);
}

sub bpfetch {
my $pos = shift;
my $bp = vec($buffer1, $pos, 4);
$bp = $bp | (vec($buffer2, $pos, 1) << 4);
}

__END__

This works on Jie's submitted data, and it takes less than two minutes
to run and requires only 10MB of memory. I'm happy about that, but I
suspect it may encounter problems when dealing with over a gigabyte of
input data.

Re: Out of memory in vec

am 11.08.2007 18:09:50 von Ted Zlatanov

On Sat, 11 Aug 2007 01:12:41 -0500 "Mumia W." wrote:

MW> Alright, I admit I cheated :-)

MW> Five bits were needed, so I used five bits. No, I didn't hack the
MW> vec() function in the Perl core. I just used two buffers:

....

MW> This works on Jie's submitted data, and it takes less than two minutes
MW> to run and requires only 10MB of memory. I'm happy about that, but I
MW> suspect it may encounter problems when dealing with over a gigabyte of
MW> input data.

Cool solution. I thought about using a separate list to store the "--"
instances, so the storage would be 4 bits per pair plus a sorted list of
offsets (or hash of offsets, for faster lookups). I think your solution
is better, though.

Ted

Re: Out of memory in vec

am 11.08.2007 18:54:27 von rvtol+news

Mumia W. schreef:

> I have 512mb RAM and 800mb swap,


Yeah, millibits won't get you far.
;-)

--
Affijn, Ruud

"Gewoon is een tijger."

Re: Out of memory in vec

am 11.08.2007 19:13:17 von rvtol+news

Mumia W. schreef:

> my %bases = (
> 0,'AA',1,'AC',2,'AG',3,'AT',4,'A-',5,'CA',6,'CC',
> 7,'CG',8,'CT',9,'C-',10,'GA',11,'GC',12,'GG',13,'GT',
> 14,'G-',15,'TA',16,'TC',17,'TG',18,'TT',19,'T-',
> 20,'-A',21,'-C',22,'-G',23,'-T',24,'--',
> );
> my %rbases = reverse %bases;

Maybe a base-5 approach will give somebody some ideas:

0 => 00 => --
1 => 01 => -A
2 => 02 => -C
3 => 03 => -G
4 => 04 => -T
5 => 10 => A-
6 => 11 => AA
7 => 12 => AC
8 => 13 => AG
9 => 14 => AT
A => 20 => B-
B => 21 => BA
C => 22 => BC
D => 23 => BG
E => 24 => BT
F => 30 => G-
G => 31 => GA
H => 32 => GC
I => 33 => GG
J => 34 => GT
K => 40 => T-
L => 41 => TA
M => 42 => TC
N => 43 => TG
O => 44 => TT

Maybe a combination of a sparse array, index being the result of
t/-ACGT/01234/, is faster than a hash lookup?

--
Affijn, Ruud

"Gewoon is een tijger."

Re: Out of memory in vec

am 11.08.2007 21:29:22 von Sherm Pendley

"Dr.Ruud" writes:

> Mumia W. schreef:
>
>> I have 512mb RAM and 800mb swap,
>
>
> Yeah, millibits won't get you far.
> ;-)

Every little bit helps... :-)

sherm--

--
Web Hosting by West Virginians, for West Virginians: http://wv-www.net
Cocoa programming in Perl: http://camelbones.sourceforge.net

Re: Out of memory in vec

am 12.08.2007 00:56:03 von paduille.4061.mumia.w+nospam

On 08/11/2007 11:54 AM, Dr.Ruud wrote:
> Mumia W. schreef:
>
>> I have 512mb RAM and 800mb swap,
>
>
> Yeah, millibits won't get you far.
> ;-)
>

Yeah, I keep forgetting. :-)

--
It's a damn poor imagination that can only capitalize scientific
measurement units one way.