Re: Running out of memory? -- revised program

Re: Running out of memory? -- revised program

am 21.07.2007 23:15:32 von mpetersen

Both Brian and Bill -- thanks immensely. I'm learning a lot in the process
-- and have over the last year just reading yours and others postings. I
understood most of the comments and really appreciate the advice. The
problem still persists -- I think I know from where it is coming -- but not
how to fix it.

First, a more narrow question. I am not sure I completely follow your
comment on local and global variables. If I declare a variable inside a
look (e.g. my $newvariable), it will not be available outside the loop
(which is good if I don't need it so it won't consume memory). Looking at
your edits of my program -- this makes sense.

Ok, now for my persistent problem. As the program runs, I can see it use
more and more memory -- until it crashes. I think (and could be wrong) is
that the program is not deleting the tree when it is done. I will enclose
the program below, but let me explain what I have done. The program will
eventually read different input files -- but for testing it uses the same
input file over and over. At the moment (see below) the
my $root = HTML::TreeBuilder->new;
$root->parse($doc);
$root->eof();
are in the loop. I have tried to include
$root->delete();
at the end of the loop, but with no effect.

If I move the commands
my $root = HTML::TreeBuilder->new;
$root->parse($doc);
$root->eof();
outside the loop -- I don't have a memory problem. Thus I think the program
is not releasing the memory of the old tree, when it builds the new one. I
can't have the $root->parse($doc) command outside the loop, as when I
actually use the program -- it will read different files and build the tree
for each one.

P.S.
I couldn't figure out the commands
my @vals = map {s/[,$ =]//g} @col_asset[0,-1];
print join(",", @vals), "\n";
If you could direct me to a manual, that would be fine as well.


Program ----------

use strict;
use warnings;
use HTML::TreeBuilder;

my $txtfile = 'D:/res/edgar/10k/2178_0000002178-06-000013.txt';
my $csvfile = 'D:/res/edgar/match/test2.csv';


# open the CSV file for writing

open OUT, ">$csvfile" or die "create csv: $!($^E)";
select ((select (OUT), $| = 1)[0]); # unbuffer CSV write

# open the text file for reading
open IN, $txtfile or die "open $txtfile: $!($^E)";
my $doc = join '', ; # read file in to $doc variable;
close IN;

my $total = 0;
while ($total <= 3000) {

my $asset_s=0;
my $asset_s2=0;

my $root = HTML::TreeBuilder->new;
$root->parse($doc);
$root->eof();

OUTER_LOOP:
foreach my $table ($root->find_by_tag_name('TABLE')) { # put tables into
array then put each one in $table;
my $txt = $table->as_text_trimmed;
next if ($txt !~ /total asset/is || $txt !~ /(\d|,){4,12}/is); # skip
items not of interest
my @col_asset; # my @col_asset = ();
foreach my $row ($table->find_by_tag_name('tr')) {
next if $row->as_text_trimmed !~ /^total asset/i; # skip rows not of
interest
foreach my $column ($row->find_by_tag_name('td')) {
my $col_text = $column->as_text_trimmed;
if ($col_text =~ /[\d,\.]{4,12}/) {
push @col_asset, $col_text if $col_text =~ /([\d,\.]{4,12})/;
}
}
$asset_s = $col_asset[0];
$asset_s2 = $col_asset[-1];
last;
}

$asset_s =~ s/[,$ =]//g; # drop ',', '$', ' ', & '='
$asset_s2 =~ s/[,$ =]//g;
last OUTER_LOOP; # only do 1st table
}

$total++;
print OUT "$asset_s $asset_s2 $total \n";


}
close OUT;

__END__


_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: Running out of memory? -- revised program

am 22.07.2007 03:45:05 von Bill Luebkert

Mitchell A. Petersen wrote:
>
> P.S.
> I couldn't figure out the commands
> my @vals = map {s/[,$ =]//g} @col_asset[0,-1];

Take the first and last elements of the array and remove ',$ =' from them
(element by element) and store in new array.

> print join(",", @vals), "\n";
> If you could direct me to a manual, that would be fine as well.

Most everything you need is in perlfunc man page or one of the several
RE (regular expression) man pages.

> Program ----------
>
> my $total = 0;
> while ($total <= 3000) {

You need to remove the above loop - it's causing you to possibly
re-run the entire process multiple times. If you want to limit
total to 3000, do it inside your other loop (eg:
last if $total > 3000).

use strict;
use warnings;
use HTML::TreeBuilder;

my $txtfile = 'D:/res/edgar/10k/2178_0000002178-06-000013.txt';
my $csvfile = 'D:/res/edgar/match/gcu_unchecked3_junk.csv';

# open the CSV file for writing

open OUT, ">$csvfile" or die "create csv: $!($^E)";
select ((select (OUT), $| = 1)[0]); # unbuffer CSV write

# open the text file for reading

open IN, $txtfile or die "open $txtfile: $!($^E)";
my $doc = join '', ;
close IN;

my $root = HTML::TreeBuilder->new;
$root->parse($doc);
$root->eof();

# get tables in HTML

my @tables = $root->find_by_tag_name('TABLE');

# foreach table in input file

my $total = 0;
foreach my $table (@tables) {

my $txt = $table->as_text_trimmed;

# skip items not of interest

next if ($txt !~ /total asset/is || $txt !~ /(\d|,){4,12}/is);

my @col_asset = ();
foreach my $row ($table->find_by_tag_name('tr')) {

# skip rows not of interest

next if $row->as_text_trimmed !~ /^total asset/i;

# foreach column in row

foreach my $column ($row->find_by_tag_name('td')) {

my $col_text = $column->as_text_trimmed;

# if asset figure, save it

if ($col_text =~ /[\d,\.]{4,12}/) {
push @col_asset, $col_text;
}
}
last; # skip rows after total assset if any
}
$total++;

# print the totals

my @vals = map { s/[,$ =]//g } @col_asset[0,-1];
print join (',', @vals), "\n";

# uncomment one of these as appropriate:
# last; # only do 1st table ??????
# last if $total > 3000; # or 3000 lines ??????
}
close OUT;

__END__


_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs