Problem with blocking

Problem with blocking

am 25.04.2006 02:00:50 von oliver.block

Hello,

I am using HTML::Parser to extract Hyperlinks from a Web Page. I wrote a
Module MyParser wich is based on the above quoted Perl module.

Now I want to ensure, that the Parser finished his work before I query the
links. I have a member function getHyperlinks which returns a hash containing
the links. No big thing:

$p->parse;
my %links = $p->getHyperlinks;

I thougt that the parser may still parse, so I assigned a boolean variable
$isParsing which is set to true by the start_document handler and set to
false by the end_document handler.

sub getHyperlinks {
my $self = shift;
while($isParsing) { }
return %{self->{HYPERLINKS}};
}

Actually the parser seems to be blocking now. What is the best way to ensure
that the parser finished extracting all links without blocking the whole
thing?

Best Regards,

Oliver

Re: Problem with blocking

am 25.04.2006 04:58:02 von Andy

>
> sub getHyperlinks {
> my $self = shift;
> while($isParsing) { }
> return %{self->{HYPERLINKS}};
> }

The while loop is empty. Nothing can change the value of $isParsing.

You may want to investigate any of the number of extant link
extracting modules on CPAN.

For that matter, if you want to just fetch a page and return a list
of links, it can be as simple as:

use WWW::Mechanize;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech->get( "http://myurl" );
for my $link ( $mech->links ) {
print $link->url, "\n";
}

xoxo,
Andy


--
Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance

Re: Problem with blocking

am 25.04.2006 15:45:28 von oliver.block

Am Dienstag, 25. April 2006 04:58 schrieb Andy Lester:
> > sub getHyperlinks {
> > my $self = shift;
> > while($isParsing) { }
> > return %{self->{HYPERLINKS}};
> > }
>
> The while loop is empty. Nothing can change the value of $isParsing.

That's right! At least not in the while loop.

The value is changed in the start_document_handler (true) and
end_document_handler (false). The loop is just to ensure the parser is not
parsing.

But possibly I got confused, because I was handling with the ithread module
and with locking and semaphores the last days!? :)

...
$p->parse;
my %links = $p->getHyperlinks;
...

Is there a chance that the calling program calls $p->getHyperlinks while the
parser is still parsing the page? Or isn't it that the $p->getHyperlinks is
called after the $p->parse returned (though without a value).

I am sorry if I am confusing someone.

Best Regards,

Oliver

Re: Problem with blocking

am 25.04.2006 16:16:11 von Andy

>>> sub getHyperlinks {
>>> my $self = shift;
>>> while($isParsing) { }
>>> return %{self->{HYPERLINKS}};
>>> }
>>
>> The while loop is empty. Nothing can change the value of $isParsing.
>
> That's right! At least not in the while loop.
>

Once you get into that while loop, you can never get out.

--
Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance