Bookmarks

Yahoo Gmail Google Facebook Delicious Twitter Reddit Stumpleupon Myspace Digg

Search queries

w2ksp4.exe, WwwxxXdbf, procmail "FROM_MAILER" patch, Use of assignment to $[ is deprecated at /usr/local/sbin/apxs line 86. , wwwxxx vim, mysql closing table and opening table, 800c5000, setgid operation not permitted, pciehp: acpi_pciehprm on IBM, WWWXXX.DBF

Links

XODOX
Impressum

#1: Scraping Multiple sites

Posted on 2010-10-03 03:03:59 by Russell Dias

I'm currently stuck on a little problem. I'm using cURL in conjunction
with DOMDocument and Xpath to scrape data from a couple of websites.
Please note that is only for personal and educational purposes.

Right now I have 5 independent scripts (that traverse through 5
websites) that run via a cron tab every 12 hours. However, as you may
have guessed this is a scalability nightmare. If my list of websites
to scrape grows I have to create another independent script and run it
via cron.

My knowledge of OOP is fairly basic as I have just gotten started with
it. However, could anyone perhaps suggest a design pattern that would
suit my needs? My solution would be to create an abstract class for
the web crawler and then simply extend it per website I add on.
However, as I said my experience with OOP is almost non-existant
therefore I have no idea how this would scale. I want this 'crawler'
to be one application which can run via one cron rather than having n
amount of scripts for each websites and having to manually create a
cron each time.

Or does anyone have any experience with this sort thing and could
maybe offer some advice?

I'm not limited to using PHP either, however due to hosting
constraints Python would most likely be my only other alternative.

Any help would be appreciated.

Cheers,
Russell

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Report this message

#2: Re: Scraping Multiple sites

Posted on 2010-10-03 06:21:45 by Chris H

--0016369f9dc7f922400491aec369
Content-Type: text/plain; charset=ISO-8859-1

On Sat, Oct 2, 2010 at 9:03 PM, Russell Dias <rus321@gmail.com> wrote:

> I'm currently stuck on a little problem. I'm using cURL in conjunction
> with DOMDocument and Xpath to scrape data from a couple of websites.
> Please note that is only for personal and educational purposes.
>
> Right now I have 5 independent scripts (that traverse through 5
> websites) that run via a cron tab every 12 hours. However, as you may
> have guessed this is a scalability nightmare. If my list of websites
> to scrape grows I have to create another independent script and run it
> via cron.
>
> My knowledge of OOP is fairly basic as I have just gotten started with
> it. However, could anyone perhaps suggest a design pattern that would
> suit my needs? My solution would be to create an abstract class for
> the web crawler and then simply extend it per website I add on.
> However, as I said my experience with OOP is almost non-existant
> therefore I have no idea how this would scale. I want this 'crawler'
> to be one application which can run via one cron rather than having n
> amount of scripts for each websites and having to manually create a
> cron each time.
>
> Or does anyone have any experience with this sort thing and could
> maybe offer some advice?
>
> I'm not limited to using PHP either, however due to hosting
> constraints Python would most likely be my only other alternative.
>
> Any help would be appreciated.
>
> Cheers,
> Russell
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

Are the sites that you are crawling so different as to justify
maintaining separate chunks of code for each one? I would try to avoid
having any code specific to a site, otherwise scaling your application to
support even a hundred sites would involve overlapping hundreds of points of
functionality and be a logistical nightmare. Unless you're simply wanting
to do this for educational reasons...

My suggestion would be to attempt to create an application that can craw all
the sites, without specifics for each one. You could fire it with a single
cron job, and give it a list of the urls you want it to hit. It can crawl
one url, record the findings, move to the next, repeat.


Chris.

--0016369f9dc7f922400491aec369--

Report this message