HTML::TokeParser

am 16.10.2005 19:07:19 von DVH

Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?gui Language=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

-----------------------------------------------------------

I think the problem lies in the ordering of tags, but that's as far as I've
got with working out what's wrong.

Re: HTML::TokeParser

am 16.10.2005 19:35:54 von Stephen Hildrey

DVH wrote:
> I'm trying to get tokeparser to fetch a series of hyperlinks and print the
> URL followed by the link text.
>
> The following script ("eurofeed.pl") gives me "Can't coerce array into hash
> at eurofeed.pl line 31"
>
> Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')

You probably want ->[1] rather than ->[2]

Regards,
Steve
--
Stephen Hildrey
E-mail: steve@uptime.org.uk / Tel: +442071931337
Jabber: steve@jabber.earth.li / MSN: foo@hotmail.co.uk

Re: HTML::TokeParser

am 16.10.2005 19:35:54 von Stephen Hildrey

Re: HTML::TokeParser

am 16.10.2005 19:45:38 von it_says_BALLS_on_your forehead

DVH wrote:
> Hi,
>
> I'm trying to get tokeparser to fetch a series of hyperlinks and print the
> URL followed by the link text.
>
> The following script ("eurofeed.pl") gives me "Can't coerce array into hash
> at eurofeed.pl line 31"
>
> Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
> {"
>
> The HTML looks like this:
>
> =======================================
>
>
>
>
>
> > href="pressReleasesAction.do?reference=EPSO/05/06">
>
> My link text here
>
>
>
>
>
>
>
> ---------------------------------------------
>
> My script looks like this:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> use LWP::Simple;
>
> use HTML::TokeParser;
>
> use XML::RSS;
>
> my $content =
> et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?gui Language=en&
> hits=500" ) or die $!;
>
> my $stream = HTML::TokeParser->new( \$content ) or die $!;
>
> my ($tag, $headline, $url);
>
> while ( $tag = $stream->get_tag("a") ) {
>
> if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
>
> $url = $tag->[2]{href} || "--";
>
> $headline = $stream->get_trimmed_text('/a')
>
> print $url
>
> print $headline
>
> -----------------------------------------------------------
>
> I think the problem lies in the ordering of tags, but that's as far as I've
> got with working out what's wrong.

after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:

The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:

["/$tag", $text]

....so you get an array reference back. why are you adding {class} into
your code?

Re: HTML::TokeParser

am 16.10.2005 19:45:38 von it_says_BALLS_on_your forehead

Re: HTML::TokeParser

am 16.10.2005 20:02:52 von it_says_BALLS_on_your forehead

it_says_BALLS_on_your forehead wrote:
> DVH wrote:
> > Hi,
> >
> > I'm trying to get tokeparser to fetch a series of hyperlinks and print the
> > URL followed by the link text.
> >
> > The following script ("eurofeed.pl") gives me "Can't coerce array into hash
> > at eurofeed.pl line 31"
> >
> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
> > {"
> >
> > The HTML looks like this:
> >
> > =======================================
> >
> >
> >
> >
> >
> > > > href="pressReleasesAction.do?reference=EPSO/05/06">
> >
> > My link text here
> >
> >
> >
> >
> >
> >
> >
> > ---------------------------------------------
> >
> > My script looks like this:
> >
> > #!/usr/bin/perl -w
> >
> > use strict;
> >
> > use LWP::Simple;
> >
> > use HTML::TokeParser;
> >
> > use XML::RSS;
> >
> > my $content =
> > et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?gui Language=en&
> > hits=500" ) or die $!;
> >
> > my $stream = HTML::TokeParser->new( \$content ) or die $!;
> >
> > my ($tag, $headline, $url);
> >
> > while ( $tag = $stream->get_tag("a") ) {
> >
> > if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
> >
> > $url = $tag->[2]{href} || "--";
> >
> > $headline = $stream->get_trimmed_text('/a')
> >
> > print $url
> >
> > print $headline
> >
> > -----------------------------------------------------------
> >
> > I think the problem lies in the ordering of tags, but that's as far as I've
> > got with working out what's wrong.
>
> after searching on CPAN for HTML::TokeParser, and looking at the
> $p->get_tag( @tags ) method,
> it looks like:
>
> The tag information is returned as an array reference in the same form
> as for $p->get_token above, but the type code (first element) is
> missing. A start tag will be returned like this:
>
> [$tag, $attr, $attrseq, $text]
> The tagname of end tags are prefixed with "/", i.e. end tag is returned
> like this:
>
> ["/$tag", $text]
>
> ...so you get an array reference back. why are you adding {class} into
> your code?

ahh, my mistake...
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");

while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}

....yeah, you need to look at index 1, not index 2.

Re: HTML::TokeParser

am 16.10.2005 20:02:52 von it_says_BALLS_on_your forehead

Re: HTML::TokeParser

am 16.10.2005 23:21:43 von DVH

Stephen Hildrey wrote in message
news:1129484153.30203.0@doris.uk.clara.net...
> DVH wrote:
> > I'm trying to get tokeparser to fetch a series of hyperlinks and print
the
> > URL followed by the link text.
> >
> > The following script ("eurofeed.pl") gives me "Can't coerce array into
hash
> > at eurofeed.pl line 31"
> >
> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq
'docSel-titleLink')
>
> You probably want ->[1] rather than ->[2]

I did. I had thought it would be tag[2] because I was looking for the third
tag within those brackets, but obviously not.

Thank you, that now works. I have a couple more questions (ah they always
do...)

Firstly, the HTML puts a lot of whitespace in the middle of the hrefs. Is
there a reasonably simple way of getting rid of that? The site is at
http://europa.eu.int/rapid/recentPressReleasesAction.do?guiL anguage=en&hits=
10 if you need to see it.

Secondly, I'm working towards getting following those hrefs and then parsing
the text I find there. Would I be better off using WWW::Mechanize to do
this?

Thanks again for your help.

Re: HTML::TokeParser

am 16.10.2005 23:21:43 von DVH

Re: HTML::TokeParser

am 16.10.2005 23:21:44 von DVH

it_says_BALLS_on_your forehead wrote in message
news:1129485772.266262.220750@g43g2000cwa.googlegroups.com.. .
>
> it_says_BALLS_on_your forehead wrote:
> > DVH wrote:
> > > Hi,
> > >
> > > I'm trying to get tokeparser to fetch a series of hyperlinks and print
the
> > > URL followed by the link text.
> > >
> > > The following script ("eurofeed.pl") gives me "Can't coerce array into
hash
> > > at eurofeed.pl line 31"
> > >
> > > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq
'docSel-titleLink')
> > > {"
> > >
> > > The HTML looks like this:
> > >
> > > =======================================
> > >
> > >
> > >
> > >
> > >
> > > > > > href="pressReleasesAction.do?reference=EPSO/05/06">
> > >
> > > My link text here
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ---------------------------------------------
> > >
> > > My script looks like this:
> > >
> > > #!/usr/bin/perl -w
> > >
> > > use strict;
> > >
> > > use LWP::Simple;
> > >
> > > use HTML::TokeParser;
> > >
> > > use XML::RSS;
> > >
> > > my $content =
> > >
t( "http://europa.eu.int/rapid/recentPressReleasesAction.do?gui Language=en&
> > > hits=500" ) or die $!;
> > >
> > > my $stream = HTML::TokeParser->new( \$content ) or die $!;
> > >
> > > my ($tag, $headline, $url);
> > >
> > > while ( $tag = $stream->get_tag("a") ) {
> > >
> > > if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
> > >
> > > $url = $tag->[2]{href} || "--";
> > >
> > > $headline = $stream->get_trimmed_text('/a')
> > >
> > > print $url
> > >
> > > print $headline
> > >
> > > -----------------------------------------------------------
> > >
> > > I think the problem lies in the ordering of tags, but that's as far as
I've
> > > got with working out what's wrong.
> >
> > after searching on CPAN for HTML::TokeParser, and looking at the
> > $p->get_tag( @tags ) method,
> > it looks like:
> >
> > The tag information is returned as an array reference in the same form
> > as for $p->get_token above, but the type code (first element) is
> > missing. A start tag will be returned like this:
> >
> > [$tag, $attr, $attrseq, $text]
> > The tagname of end tags are prefixed with "/", i.e. end tag is returned
> > like this:
> >
> > ["/$tag", $text]
> >
> > ...so you get an array reference back. why are you adding {class} into
> > your code?
>
> ahh, my mistake...
> use HTML::TokeParser;
> $p = HTML::TokeParser->new(shift||"index.html");
>
> while (my $token = $p->get_tag("a")) {
> my $url = $token->[1]{href} || "-";
> my $text = $p->get_trimmed_text("/a");
> print "$url\t$text\n";
> }
>
> ...yeah, you need to look at index 1, not index 2.
>

Thanks. It works with [1].

Re: HTML::TokeParser

am 16.10.2005 23:21:44 von DVH

Re: HTML::TokeParser

am 16.10.2005 23:41:20 von 1usa

"DVH" wrote in
news:diug96$jfj$1@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com:

>
> Stephen Hildrey wrote in message
> news:1129484153.30203.0@doris.uk.clara.net...
>> DVH wrote:
>> > I'm trying to get tokeparser to fetch a series of hyperlinks and
>> > print the URL followed by the link text.
>> >
>> > The following script ("eurofeed.pl") gives me "Can't coerce array
>> > into hash at eurofeed.pl line 31"
>> >
>> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq
> 'docSel-titleLink')
>>
>> You probably want ->[1] rather than ->[2]
>
> I did. I had thought it would be tag[2] because I was looking for the
> third tag within those brackets, but obviously not.
>
> Thank you, that now works. I have a couple more questions (ah they
> always do...)
>
> Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.

ITYM "the HTML contains".

> Is there a reasonably simple way of getting rid of that? The site is
> at
> http://europa.eu.int/rapid/recentPressReleasesAction.do?
guiLanguage=en&
> hits= 10 if you need to see it.
>
> Secondly, I'm working towards getting following those hrefs and then
> parsing the text I find there. Would I be better off using
> WWW::Mechanize to do this?

#!/usr/bin/perl

use strict;
use warnings;

use HTML::LinkExtractor;
use LWP::Simple;

my $url = q{http://europa.eu.int/rapid/recentPressReleasesAction.do?
guiLanguage=en};
my $html = get $url;

die "Cannot get <$url>\n" unless $html;

my $lx = HTML::LinkExtractor->new;
$lx->parse(\$html);

use Data::Dumper;

for my $link ( @{ $lx->links } ) {
if ($link->{class} eq 'docSel-formatLink') {
print Dumper $link;
}
}

__END__

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines .html

Re: HTML::TokeParser

am 16.10.2005 23:41:20 von 1usa

Re: HTML::TokeParser

am 19.10.2005 22:23:52 von DVH

A. Sinan Unur <1usa@llenroc.ude.invalid> wrote in message
news:Xns96F1B3F245A6asu1cornelledu@127.0.0.1...
> "DVH" wrote in
> news:diug96$jfj$1@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com:
>
> >
> > Stephen Hildrey wrote in message
> > news:1129484153.30203.0@doris.uk.clara.net...
> >> DVH wrote:
> >> > I'm trying to get tokeparser to fetch a series of hyperlinks and
> >> > print the URL followed by the link text.
> >> >
> >> > The following script ("eurofeed.pl") gives me "Can't coerce array
> >> > into hash at eurofeed.pl line 31"
> >> >
> >> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq
> > 'docSel-titleLink')
> >>
> >> You probably want ->[1] rather than ->[2]
> >
> > I did. I had thought it would be tag[2] because I was looking for the
> > third tag within those brackets, but obviously not.
> >
> > Thank you, that now works. I have a couple more questions (ah they
> > always do...)
> >
> > Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.
>
> ITYM "the HTML contains".
>
>
> > Is there a reasonably simple way of getting rid of that? The site is
> > at
> > http://europa.eu.int/rapid/recentPressReleasesAction.do?
> guiLanguage=en&
> > hits= 10 if you need to see it.
> >
> > Secondly, I'm working towards getting following those hrefs and then
> > parsing the text I find there. Would I be better off using
> > WWW::Mechanize to do this?
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> use HTML::LinkExtractor;
> use LWP::Simple;
>
> my $url = q{http://europa.eu.int/rapid/recentPressReleasesAction.do?
> guiLanguage=en};
> my $html = get $url;
>
> die "Cannot get <$url>\n" unless $html;
>
> my $lx = HTML::LinkExtractor->new;
> $lx->parse(\$html);
>
> use Data::Dumper;
>
> for my $link ( @{ $lx->links } ) {
> if ($link->{class} eq 'docSel-formatLink') {
> print Dumper $link;
> }
> }
>
>
> __END__

Sorry for getting back to you three days late, but thanks to both of you.

Re: HTML::TokeParser

am 19.10.2005 22:23:52 von DVH

Re: HTML::TokeParser

am 19.10.2005 22:34:49 von 1usa

"DVH" wrote in news:dj6a0n$7a8$1
@nwrdmz01.dmz.ncs.ea.ibs-infra.bt.com:

> A. Sinan Unur <1usa@llenroc.ude.invalid> wrote in message
> news:Xns96F1B3F245A6asu1cornelledu@127.0.0.1...
....
> Sorry for getting back to you three days late, but thanks to both
> of you.

You are welcome. Hope it helped.

Sinan

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines .html

Re: HTML::TokeParser

am 19.10.2005 22:34:49 von 1usa