Re: Running out of memory? -- sample program
am 20.07.2007 05:13:55 von mpetersen
At 08:32 AM 7/19/2007, you wrote:
>Mitchell A. Petersen wrote:
> > I am new to perl, so this may be a dumb question.
> >
> > I have written a perl program that reads firms 10Ks (their financial
> > disclosure) looking for their total assets. Some of the files are in html,
> > so I use HTML::TreeBuilder and HTML::TableContentParser. Some of the files
> > are in text, so I use a regular expression to find the row that says
> "total
> > assets" and then scans across to find the number. I have copied the files
> > to my hard disk to speed up the process. The program searchs through each
> > file sequentially, then writes out a line of output for each file. The
> > program crashes when the outfile reaches 32,768 bytes. I have changed the
> > file files I feed the program in case this is a problem, and it still
> > crashes at 32,768 bytes.
> >
> > I am running the perl program through a dos window (cmd.exe window) under
> > Windows Vista. If there is a smarter way to do this, I'd love to hear. The
> > output file does not contain any data until the program crashes. I
> included
> > the command
> > $|++ which I thought would cause the print buffer to flush to the output
> > file -- but this doesn't seem to be working
> >
> > If there is other info that I should add, please let me know. Thanks.
>
>First, if you aren't putting a newline out every line - add that.
>Make sure you're closing each input file after scanning it.
I was writing out a new line every time, but didn't close the read file
after each use. I added this -- but it didn't solve the problem.
>If that doesn't help, create a complete program snippet that fails as
>you describe (you may not need to read the files if you can reproduce
>it without parsing the files - just write the output as you currently
>are using some static data and see if that fails for you). Assuming
>you can reproduce the error, post that snippet.
Processing the text files is not the problem -- when I read only these the
program doesn't crash. It is the HTML files that are causing the problem.
The program snippet that processes the HTML files is:
use warnings;
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
use HTML::TableContentParser;
my ($asset_s,$asset_s2,@col_asset,@column,$column,@rows,$row,$t otal,$yes);
@col_asset = undef;
@column = undef;
@rows = undef;
$asset_s = 0;
$asset_s2 = 0;
$total = 0;
open (WRITE1, ">\\res\\edgar\\match\\gcu_unchecked3_junk.csv");
my $old_fh = select(WRITE1);
$| = 1;
select($old_fh);
unless (open (READ2,
"d:\\res\\edgar\\10k\\2178_0000002178-06-000013.txt")) {
next;
}
my $doc = join '',
;
while ($total <= 3000) {
my $root =
HTML::TreeBuilder->new;
$root->parse($doc);
$root->eof();
my @tables = undef;
@tables = $root->find_by_tag_name('TABLE');
foreach my $table (@tables) {
if (($table->as_text_trimmed =~ /total asset/is) &&
($table->as_text_trimmed =~ /(\d|,){4,12}/is)) {
@rows = $table->find_by_tag_name('tr');
foreach $row (@rows) {
if ($row->as_text_trimmed =~ /^total
asset/i) {
@column =
$row->find_by_tag_name('td');
foreach $column (@column) {
if
($column->as_text_trimmed =~ m/((\d|,|\.){4,12})/) {
$yes =
$column->as_text_trimmed;
push (@col_asset,
"$yes");
}
}
$asset_s = $col_asset[1];
$asset_s2 = $col_asset[-1];
last;
}
}
$asset_s =~ s/(,|$|
|=)//g;
$asset_s2 =~ s/(,|$| |=)//g;
last;
}
}
print WRITE1 "$asset_s,$asset_s2\n";
$total++
}
close(READ2);
Thanks for the advice.
Mitchell
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
RE: Running out of memory? -- sample program
am 20.07.2007 12:03:47 von Brian Raven
Mitchell A. Petersen <> wrote:
> Processing the text files is not the problem -- when I read only
> these the =
> program doesn't crash. It is the HTML files that are causing the
> problem. =
> The program snippet that processes the HTML files is:
> =
> =
> use warnings;
> use strict;
Good start.
> use LWP::Simple;
I can't see where this is used.
> use HTML::TreeBuilder;
> use HTML::TableContentParser;
I can't see where this is used either.
> =
> my
($asset_s,$asset_s2,@col_asset,@column,$column,@rows,$row,$t otal,$yes);
> @col_asset =3D undef;
> @column =3D undef;
> @rows =3D undef;
Did you realise that the above arrays are not empty? They contain a
single entry whose value is undef. If this is intentional, it seems
strange to me, and therefore warrants a comment explaining why.
> $asset_s =3D 0;
> $asset_s2 =3D 0;
> $total =3D 0;
Avoid global variables as much as possible, declare variables in the
minimum necessary scope. After looking at the rest of the code I can see
no reason for any of these variables to be global, and in some cases, no
real need for the variable. In fact one of those globals may be the
cause of your problem.
> open (WRITE1,
">\\res\\edgar\\match\\gcu_unchecked3_junk.csv");
You should always check the result of open.
> my $old_fh =3D select(WRITE1);
> $| =3D 1;
> select($old_fh);
The above is not strictly necessary, as your output lines are terminated
with "\n" which should cause a flush.
> unless (open (READ2, =
> "d:\\res\\edgar\\10k\\2178_0000002178-06-000013.txt")) {
> next; =
> =
> }
Although you appear to be checking whether the open worked, that 'next'
means you are effectively ignoring the result of that check.
> my $doc =3D join '', =
> ;
> =
> while ($total <=3D 3000) {
for (1..3000) {
> my $root =3D =
> HTML::TreeBuilder->new;
> $root->parse($doc);
> $root->eof(); =
Why the eof call? I checked the documentation and found out why its
necessary. I've learnt something from your post. Thanks for that.
> =
> my @tables =3D undef;
> @tables =3D $root->find_by_tag_name('TABLE');
> foreach my $table (@tables) {
array variable not necessary:
foreach my $table ($root->find_by_tag_name('TABLE')) {
> if (($table->as_text_trimmed =3D~ /total asset/is) && =
> ($table->as_text_trimmed =3D~ /(\d|,){4,12}/is)) {
> @rows =3D $table->find_by_tag_name('tr');
> foreach $row (@rows) {
Similarly for @rows...
> if ($row->as_text_trimmed =3D~ /^total =
> asset/i) {
> @column =3D =
> $row->find_by_tag_name('td');
> foreach $column (@column) {
.... and @column
> if =
> ($column->as_text_trimmed =3D~ m/((\d|,|\.){4,12})/) {
> $yes =3D =
> $column->as_text_trimmed;
> push
(@col_asset, =
> "$yes");
This is could be a problem, I can't see where @col_asset is reset, so it
continues to grow in size throughout program execution. If you declared
@col_asset in the smallest necessary scope you would have avoided that.
> }
> }
> $asset_s =3D $col_asset[1];
> $asset_s2 =3D $col_asset[-1];
> last;
> }
> }
> $asset_s =3D~ s/(,|$| =
> |=3D)//g;
> $asset_s2 =3D~ s/(,|$| |=3D)//g;
> last;
> }
> }
> print WRITE1 "$asset_s,$asset_s2\n";
> $total++
> }
> =
> close(READ2);
The close would be better immediately after you have read the whole file
Fixing the above and a few minor style issues looks like:
-------------------------------------------------
use strict;
use warnings;
use HTML::TreeBuilder;
my $ofn =3D "/res/edgar/match/gcu_unchecked3_junk.csv";
my $ifn =3D "d:/res/edgar/10k/2178_0000002178-06-000013.txt";
open my $ofd, ">", $ofn or die "Failed to open $ofn: $!\n";
my $doc =3D slurp($ifn);
for (1..3000) {
my $root =3D HTML::TreeBuilder->new;
$root->parse($doc);
$root->eof();
OUTER_LOOP:
foreach my $table ($root->find_by_tag_name('TABLE')) {
my $txt =3D $table->as_text_trimmed;
if (($txt =3D~ /total asset/i) && ($txt =3D~ /[\d,]{4,12}/)) {
foreach my $row ($table->find_by_tag_name('tr')) {
if ($row->as_text_trimmed =3D~ /^total asset/i) {
my @col_asset;
foreach my $column ($row->find_by_tag_name('td')) {
my $txt =3D $column->as_text_trimmed;
push @col_asset, $txt if $txt =3D~
/([\d,\.]{4,12})/;
}
my @vals =3D map {s/[,$ =3D]//g} @col_asset[0,-1];
print join(",", @vals), "\n";
last OUTER_LOOP;
}
}
}
}
}
sub slurp {
my $fn =3D shift;
open my $fd, "<", $fn or die "Failed to open $fn: $!\n";
local $/;
my $data =3D <$fd>;
close $fd;
return $data;
}
-------------------------------------------------
I can't test it as I don't have any data, but it compiles.
HTH
-- =
Brian Raven =
==================== =====3D=
================
Atos Euronext Market Solutions Disclaimer
==================== =====3D=
================
The information contained in this e-mail is confidential and solely for the=
intended addressee(s). Unauthorised reproduction, disclosure, modification=
, and/or distribution of this email may be unlawful.
If you have received this email in error, please notify the sender immediat=
ely and delete it from your system. The views expressed in this message do =
not necessarily reflect those of Atos Euronext Market Solutions.
Atos Euronext Market Solutions Limited - Registered in England & Wales with=
registration no. 3962327. Registered office address at 25 Bank Street Lon=
don E14 5NQ United Kingdom. =
Atos Euronext Market Solutions SAS - Registered in France with registration=
no. 425 100 294. Registered office address at 6/8 Boulevard Haussmann 750=
09 Paris France.
L'information contenue dans cet e-mail est confidentielle et uniquement des=
tinee a la (aux) personnes a laquelle (auxquelle(s)) elle est adressee. Tou=
te copie, publication ou diffusion de cet email est interdite. Si cet e-mai=
l vous parvient par erreur, nous vous prions de bien vouloir prevenir l'exp=
editeur immediatement et d'effacer le e-mail et annexes jointes de votre sy=
steme. Le contenu de ce message electronique ne represente pas necessaireme=
nt la position ou le point de vue d'Atos Euronext Market Solutions.
Atos Euronext Market Solutions Limited Soci=E9t=E9 de droit anglais, enregi=
str=E9e au Royaume Uni sous le num=E9ro 3962327, dont le si=E8ge social se =
situe 25 Bank Street E14 5NQ Londres Royaume Uni.
Atos Euronext Market Solutions SAS, soci=E9t=E9 par actions simplifi=E9e, e=
nregistr=E9 au registre dui commerce et des soci=E9t=E9s sous le num=E9ro 4=
25 100 294 RCS Paris et dont le si=E8ge social se situe 6/8 Boulevard Hauss=
mann 75009 Paris France.
==================== =====3D=
================
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
RE: Running out of memory? -- sample program
am 20.07.2007 14:43:15 von Brian Raven
Bill Luebkert <> wrote:
> Brian Raven wrote:
>> =
>> The above is not strictly necessary, as your output lines are
>> terminated with "\n" which should cause a flush.
> =
> With block buffering that wouldn't be true. Blocks would be written
> after BUFSIZ characters have been written - not after each newline. =
True enough. My brain must have still been in STDOUT mode.
Thanks Bill.
-- =
Brian Raven =
==================== =====3D=
================
Atos Euronext Market Solutions Disclaimer
==================== =====3D=
================
The information contained in this e-mail is confidential and solely for the=
intended addressee(s). Unauthorised reproduction, disclosure, modification=
, and/or distribution of this email may be unlawful.
If you have received this email in error, please notify the sender immediat=
ely and delete it from your system. The views expressed in this message do =
not necessarily reflect those of Atos Euronext Market Solutions.
Atos Euronext Market Solutions Limited - Registered in England & Wales with=
registration no. 3962327. Registered office address at 25 Bank Street Lon=
don E14 5NQ United Kingdom. =
Atos Euronext Market Solutions SAS - Registered in France with registration=
no. 425 100 294. Registered office address at 6/8 Boulevard Haussmann 750=
09 Paris France.
L'information contenue dans cet e-mail est confidentielle et uniquement des=
tinee a la (aux) personnes a laquelle (auxquelle(s)) elle est adressee. Tou=
te copie, publication ou diffusion de cet email est interdite. Si cet e-mai=
l vous parvient par erreur, nous vous prions de bien vouloir prevenir l'exp=
editeur immediatement et d'effacer le e-mail et annexes jointes de votre sy=
steme. Le contenu de ce message electronique ne represente pas necessaireme=
nt la position ou le point de vue d'Atos Euronext Market Solutions.
Atos Euronext Market Solutions Limited Soci=E9t=E9 de droit anglais, enregi=
str=E9e au Royaume Uni sous le num=E9ro 3962327, dont le si=E8ge social se =
situe 25 Bank Street E14 5NQ Londres Royaume Uni.
Atos Euronext Market Solutions SAS, soci=E9t=E9 par actions simplifi=E9e, e=
nregistr=E9 au registre dui commerce et des soci=E9t=E9s sous le num=E9ro 4=
25 100 294 RCS Paris et dont le si=E8ge social se situe 6/8 Boulevard Hauss=
mann 75009 Paris France.
==================== =====3D=
================
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs