Website scraper

am 24.09.2005 10:59:29 von DVH

Hi,

I've been working through a Perl tutorial on using HTML::TokeParser, and
trying to adapt the example script it gives.
http://www.perl.com/pub/a/2001/11/15/creatingrss.html

The script is meant to scrape headlines from the BBC website and put them
into an RSS feed. It looks for CSS tags, then extracts the text nearby. I've
modified it because the tags in the example don't match the tags on the site
any more, but the script is still sticking at a certain point.

I *think* it's sticking here:

$headline = $stream->get_trimmed_text('/b') \
if ($tag->[1]{class} =~ /^h[12]$/);

I don't understand what that backslash is doing at the end of the first
line. And I don't see where the loop following the "if" in the second line
actually begins - shouldn't it begin with a curly bracket?

Any advice gratefully received.

DVH.

Re: Website scraper

am 24.09.2005 11:41:19 von Stephen Hildrey

DVH wrote:
> I've been working through a Perl tutorial on using HTML::TokeParser, and
> trying to adapt the example script it gives.
> http://www.perl.com/pub/a/2001/11/15/creatingrss.html

Note: that article was written in 2001. Screen-scrapers are notoriously
fragile - they often break in response to even the slightest change in
the target.

> The script is meant to scrape headlines from the BBC website and put them
> into an RSS feed. It looks for CSS tags, then extracts the text nearby. I've
> modified it because the tags in the example don't match the tags on the site
> any more, but the script is still sticking at a certain point.

As above, IIRC the BBC has changed their news page since 2001.

> I *think* it's sticking here:
>
> $headline = $stream->get_trimmed_text('/b') \
> if ($tag->[1]{class} =~ /^h[12]$/);
>
> I don't understand what that backslash is doing at the end of the first
> line.

I think the author got mixed up between Perl and shell scripting - where
'\' is used to continue across newlines. That line should be:

$headline = $stream->get_trimmed_text('/b')
if ($tag->[1]{class} =~ /^h[12]$/);

> And I don't see where the loop following the "if" in the second line
> actually begins - shouldn't it begin with a curly bracket?

It's an example of Perl's "statement if (cond)" syntax. So, just as you
can say:

if (foo) { bar; }

you can also say:

bar if (foo);

Consequently, the above scraper code is *exactly* the same as:

if ($tag->[1]{class} =~ /^h[12]$/)
{
$headline = $stream->get_trimmed_text('/b');
}

It's just a matter of preference and readability.

HTH,
Steve
--
Stephen Hildrey
E-mail: steve@uptime.org.uk / Tel: +442071931337
Jabber: steve@jabber.earth.li / MSN: foo@hotmail.co.uk

Re: Website scraper

am 24.09.2005 11:41:19 von Stephen Hildrey

Re: Website scraper

am 24.09.2005 16:24:43 von Tad McClellan

[ Newsgroups trimmed. I don't do the alt.* hierarchy ]

Stephen Hildrey wrote:
> DVH wrote:

>> $headline = $stream->get_trimmed_text('/b') \
>> if ($tag->[1]{class} =~ /^h[12]$/);
>>
>> I don't understand what that backslash is doing at the end of the first
>> line.
>
> I think the author got mixed up between Perl and shell scripting - where
> '\' is used to continue across newlines.

So, the backslash at the end of the line is escaping the newline that
follows it (but there is no need to escape that newline, so it does
not do anything that is useful).

> $headline = $stream->get_trimmed_text('/b')
> if ($tag->[1]{class} =~ /^h[12]$/);
>
> > And I don't see where the loop following the "if" in the second line
^^^^^^^^
> > actually begins - shouldn't it begin with a curly bracket?
>
> It's an example of Perl's "statement if (cond)" syntax.

Note that there is no "loop" in the code the OP showed.

--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas

Re: Website scraper

am 24.09.2005 17:11:14 von Stephen Hildrey

Tad McClellan wrote:
> Stephen Hildrey wrote:
>>DVH wrote:
>>>$headline = $stream->get_trimmed_text('/b') \
>>> if ($tag->[1]{class} =~ /^h[12]$/);
>>>
>>>I don't understand what that backslash is doing at the end of the first
>>>line.
>>
>>I think the author got mixed up between Perl and shell scripting - where
>>'\' is used to continue across newlines.
>
> So, the backslash at the end of the line is escaping the newline that
> follows it (but there is no need to escape that newline, so it does
> not do anything that is useful).

No. This is Perl - the backslash is a syntax error:

$ cat > backslash.pl << _EOF && perl backslash.pl
> use strict;
> use warnings;
> my $foo = "foo" \
> if (1);
> _EOF
syntax error at backslash.pl line 3, near "my ="
Execution of backslash.pl aborted due to compilation errors.

>>$headline = $stream->get_trimmed_text('/b')
>> if ($tag->[1]{class} =~ /^h[12]$/);
>>
>>>And I don't see where the loop following the "if" in the second line
>
> ^^^^^^^^
>
>>>actually begins - shouldn't it begin with a curly bracket?
>>
>>It's an example of Perl's "statement if (cond)" syntax.
>
> Note that there is no "loop" in the code the OP showed.

Good spot. I assume he meant "block".

OP: if you are still experiencing difficulties with the code, do post
back - I'm sure we'll be able to help :-)

Steve
--
Stephen Hildrey
E-mail: steve@uptime.org.uk / Tel: +442071931337
Jabber: steve@jabber.earth.li / MSN: foo@hotmail.co.uk

Re: Website scraper

am 24.09.2005 17:11:14 von Stephen Hildrey

Re: Website scraper

am 24.09.2005 17:25:12 von 1usa

"DVH" wrote in
news:dh34hg$d0i$1@nwrdmz01.dmz.ncs.ea.ibs-infra.bt.com:

> The script is meant to scrape headlines from the BBC website and put
> them into an RSS feed.

http://news.bbc.co.uk/rss/newsonline_world_edition/americas/ rss.xml

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines .html

Re: Website scraper

am 24.09.2005 17:25:12 von 1usa

Re: Website scraper

am 24.09.2005 17:36:23 von Stephen Hildrey

A. Sinan Unur wrote:
> "DVH" wrote in
> news:dh34hg$d0i$1@nwrdmz01.dmz.ncs.ea.ibs-infra.bt.com:

>>The script is meant to scrape headlines from the BBC website and put
>>them into an RSS feed.

> http://news.bbc.co.uk/rss/newsonline_world_edition/americas/ rss.xml

A valid point, well made :-)

Still, I think scraping is a useful technique to be aware of - I read
that same article myself, and since have found numerous uses for
scraping [1]:

1. Being paid to write news aggregators,
2. Getting text-message notifications in response to various ebay
events,
3. Being able to enjoy a night in the pub, despite $airline having
lost my luggage (scrape the lost-luggage tracking site, send SMS
hourly :-) )

The possibilities are endless!

Steve

[1] - yes, this may be a bit of a "grey area" in some AUPs/ToSs - YMMV.

--
Stephen Hildrey
E-mail: steve@uptime.org.uk / Tel: +442071931337
Jabber: steve@jabber.earth.li / MSN: foo@hotmail.co.uk

Re: Website scraper

am 24.09.2005 17:36:23 von Stephen Hildrey

Re: Website scraper

am 24.09.2005 17:46:16 von 1usa

Stephen Hildrey wrote in news:1127576183.17734.0
@damia.uk.clara.net:

> A. Sinan Unur wrote:
>> "DVH" wrote in
>> news:dh34hg$d0i$1@nwrdmz01.dmz.ncs.ea.ibs-infra.bt.com:
>
>>>The script is meant to scrape headlines from the BBC website and put
>>>them into an RSS feed.
>
>> http://news.bbc.co.uk/rss/newsonline_world_edition/americas/ rss.xml
>
> A valid point, well made :-)
>
> Still, I think scraping is a useful technique to be aware of - I read
> that same article myself,

Agreed. I was doing it before I knew it was called scraping. Even MS did
it (in the form of being able to import data from HTML tables into Excel
given a page URL).

If the RSS feed exists in the first place, why not go ahead and use it
without mucking about with the internals of some HTML code?

Sinan

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines .html

Re: Website scraper

am 24.09.2005 17:46:16 von 1usa

Re: Website scraper

am 24.09.2005 17:58:55 von Stephen Hildrey

A. Sinan Unur wrote:
> If the RSS feed exists in the first place, why not go ahead and use it
> without mucking about with the internals of some HTML code?

1. If the HTML exposes some information not present in the RSS.
2. Note - there wasn't a BBC RSS feed at the time that article
was written (November 2001).
3. The OP says he wants to "adapt the example script", so I don't
know that he is even going to use it to scrape the BBC.

But I agree in principle with your point - in-house RSS feeds generated
from back-end data sources are far more robust than a bespoke solution
that is based on data acquired through an attempt to reverse a
data-to-presentation transform.

Steve
--
Stephen Hildrey
E-mail: steve@uptime.org.uk / Tel: +442071931337
Jabber: steve@jabber.earth.li / MSN: foo@hotmail.co.uk

Re: Website scraper

am 24.09.2005 17:58:55 von Stephen Hildrey

Re: Website scraper

am 24.09.2005 18:22:20 von Matt Garrish

"Stephen Hildrey" wrote in message
news:1127574674.36019.0@demeter.uk.clara.net...
> Tad McClellan wrote:
>> Stephen Hildrey wrote:
>>>DVH wrote:
>>>>$headline = $stream->get_trimmed_text('/b') \
>>>> if ($tag->[1]{class} =~ /^h[12]$/);
>>>>
>>>>I don't understand what that backslash is doing at the end of the first
>>>>line.
>>>
>>>I think the author got mixed up between Perl and shell scripting - where
>>>'\' is used to continue across newlines.
>>
>> So, the backslash at the end of the line is escaping the newline that
>> follows it (but there is no need to escape that newline, so it does
>> not do anything that is useful).
>
> No. This is Perl - the backslash is a syntax error:
>
> $ cat > backslash.pl << _EOF && perl backslash.pl
> > use strict;
> > use warnings;
> > my $foo = "foo" \
> > if (1);
> > _EOF
> syntax error at backslash.pl line 3, near "my ="
> Execution of backslash.pl aborted due to compilation errors.
>

You shouldn't make claims you can't substantiate...

my $time = localtime(\
time);
print $time;

The interpreter can usually understand what you're trying to do, but in your
case you broke the line before the conditional "if" and it's not going to
assume that's what you meant so it gives an error. You can, however, break
it anywhere else:

my $foo = \
"foo" if (1);

or

my $foo = "foo" if
(1);

Matt

Re: Website scraper

am 24.09.2005 18:22:20 von Matt Garrish

Re: Website scraper

am 24.09.2005 18:24:14 von Matt Garrish

"Matt Garrish" wrote in message
news:_2fZe.307$Bi.106745@news20.bellglobal.com...
>
> my $foo = "foo" if

my $foo = "foo" if \

That's what I get for typing...

Matt

Re: Website scraper

am 24.09.2005 18:24:14 von Matt Garrish

"Matt Garrish" wrote in message
news:_2fZe.307$Bi.106745@news20.bellglobal.com...
>
> my $foo = "foo" if

my $foo = "foo" if \

That's what I get for typing...

Matt

Re: Website scraper

am 24.09.2005 18:45:34 von Stephen Hildrey

Matt Garrish wrote:
> "Stephen Hildrey" wrote in message
> news:1127574674.36019.0@demeter.uk.clara.net...
>
>>Tad McClellan wrote:
>>
>>>Stephen Hildrey wrote:
>>>
>>>>DVH wrote:
>>>>
>>>>>$headline = $stream->get_trimmed_text('/b') \
>>>>> if ($tag->[1]{class} =~ /^h[12]$/);
>>>>>
>>>>>I don't understand what that backslash is doing at the end of the first
>>>>>line.
>>>>
>>>>I think the author got mixed up between Perl and shell scripting - where
>>>>'\' is used to continue across newlines.
>>>
>>>So, the backslash at the end of the line is escaping the newline that
>>>follows it (but there is no need to escape that newline, so it does
>>>not do anything that is useful).
>>
>>No. This is Perl - the backslash is a syntax error:
>>
>> $ cat > backslash.pl << _EOF && perl backslash.pl
>> > use strict;
>> > use warnings;
>> > my $foo = "foo" \
>> > if (1);
>> > _EOF
>> syntax error at backslash.pl line 3, near "my ="
>> Execution of backslash.pl aborted due to compilation errors.
>>
>
>
> You shouldn't make claims you can't substantiate...
>
> my $time = localtime(\
> time);
> print $time;
>
> The interpreter can usually understand what you're trying to do, but in your
> case you broke the line before the conditional "if" and it's not going to
> assume that's what you meant so it gives an error. You can, however, break
> it anywhere else:

Sorry if I was ambiguous - I was trying to maintain the structure of the
code in the OP's example, and not talking about the general case.

Steve
--
Stephen Hildrey
E-mail: steve@uptime.org.uk / Tel: +442071931337
Jabber: steve@jabber.earth.li / MSN: foo@hotmail.co.uk

Re: Website scraper

am 24.09.2005 18:45:34 von Stephen Hildrey

Re: Website scraper

am 24.09.2005 19:21:43 von Matt Garrish

"Stephen Hildrey" wrote in message
news:1127580334.17298.1@ersa.uk.clara.net...
> Matt Garrish wrote:
>>
>> The interpreter can usually understand what you're trying to do, but in
>> your case you broke the line before the conditional "if" and it's not
>> going to assume that's what you meant so it gives an error. You can,
>> however, break it anywhere else:
>
> Sorry if I was ambiguous - I was trying to maintain the structure of the
> code in the OP's example, and not talking about the general case.
>

Clarity is key. I read your comment as a reference to Perl syntax in
general. The OP's does cause a compilation error, as you were alluding to.

Matt

Re: Website scraper

am 24.09.2005 19:21:43 von Matt Garrish

Re: Website scraper

am 24.09.2005 19:41:20 von DVH

Stephen Hildrey wrote in message
news:1127574674.36019.0@demeter.uk.clara.net...

>
> OP: if you are still experiencing difficulties with the code, do post
> back - I'm sure we'll be able to help :-)

Thanks Stephen.

I removed the backslash, and tidied up a couple of other obvious bugs. My
script now runs through the HTML and successfully creates a well-formatted
RSS file. It's an empty file though, so I think the next stage is to look at
the order of the tags and make sure the script can actually find what it's
looking for.

It isn't immediately obvious how to do this, so I may indeed come back...
thanks for the offer.

[I'm doing this because I want to scrape other sites which don't have an RSS
feed - as you mention elsewhere in the thread, there are numerous uses for
this sort of scraping. But it seemed logical to start with the technique
described in the tutorial].

Re: Website scraper

am 24.09.2005 19:41:20 von DVH

Re: Website scraper

am 26.09.2005 15:17:53 von Glenn Jackman

At 2005-09-24 11:11AM, Stephen Hildrey wrote:
> No. This is Perl - the backslash is a syntax error:
>
> $ cat > backslash.pl << _EOF && perl backslash.pl
> > use strict;
> > use warnings;
> > my $foo = "foo" \
> > if (1);
> > _EOF
> syntax error at backslash.pl line 3, near "my ="
> Execution of backslash.pl aborted due to compilation errors.

No, your shell is doing variable substitution:

$ foo=bar cat > foo.pl << _EOF
> my $foo = `date`;
> _EOF
$ cat foo.pl
my bar = Mon Sep 26 09:15:37 EDT 2005;

That's the perl syntax error you're seeing.

If you want to use shell here-docs to type perl programs, single-quote
your delimiter:

$ foo=bar cat > foo.pl << '_EOF'
> my $foo = `date`;
> _EOF
$ cat foo.pl
my $foo = `date`;

--
Glenn Jackman
NCF Sysadmin
glennj@ncf.ca

Re: Website scraper

am 26.09.2005 15:17:53 von Glenn Jackman