Using Curl to replicate a site

Using Curl to replicate a site

am 10.12.2009 16:55:34 von Ashley Sheridan

--=-ftle/jlcKdUvDuF29Cyj
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

Hi,

I need to replicate a site on another domain, and in this case, an
iframe won't really do, as I need to remove some of the graphics, etc
around the content. The owner of the site I'm needing to copy has asked
for the site to be duplicated, and unfortunately in this case, because
of the CMS he's used (which is owned by the hosting he uses) I need a
way to have the site replicated on an already existing domain as a
microsite, but in a way that it is always up-to-date.

I'm fine using Curl to grab the site, and even alter the content that is
returned, but I was thinking about a caching mechanism. Has anyone any
suggestions on this?

Thanks,
Ash
http://www.ashleysheridan.co.uk



--=-ftle/jlcKdUvDuF29Cyj--

Re: Using Curl to replicate a site

am 10.12.2009 17:10:18 von Robert Cummings

Ashley Sheridan wrote:
> Hi,
>
> I need to replicate a site on another domain, and in this case, an
> iframe won't really do, as I need to remove some of the graphics, etc
> around the content. The owner of the site I'm needing to copy has asked
> for the site to be duplicated, and unfortunately in this case, because
> of the CMS he's used (which is owned by the hosting he uses) I need a
> way to have the site replicated on an already existing domain as a
> microsite, but in a way that it is always up-to-date.
>
> I'm fine using Curl to grab the site, and even alter the content that is
> returned, but I was thinking about a caching mechanism. Has anyone any
> suggestions on this?

Sounds like you're creating a proxy with post processing/caching on the
forwarded content. It should be fairly straightforward to direct page
requests to your proxy app, then make the remote request, and
post-process, cache, then send to the browser. The only gotcha will be
for forms if you do caching.

Cheers,
Rob.
--
http://www.interjinn.com
Application and Templating Framework for PHP

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Using Curl to replicate a site

am 10.12.2009 17:15:18 von Ashley Sheridan

--=-y90n7lhIVNdoMqOQ/dWf
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

On Thu, 2009-12-10 at 11:10 -0500, Robert Cummings wrote:

> Ashley Sheridan wrote:
> > Hi,
> >
> > I need to replicate a site on another domain, and in this case, an
> > iframe won't really do, as I need to remove some of the graphics, etc
> > around the content. The owner of the site I'm needing to copy has asked
> > for the site to be duplicated, and unfortunately in this case, because
> > of the CMS he's used (which is owned by the hosting he uses) I need a
> > way to have the site replicated on an already existing domain as a
> > microsite, but in a way that it is always up-to-date.
> >
> > I'm fine using Curl to grab the site, and even alter the content that is
> > returned, but I was thinking about a caching mechanism. Has anyone any
> > suggestions on this?
>
> Sounds like you're creating a proxy with post processing/caching on the
> forwarded content. It should be fairly straightforward to direct page
> requests to your proxy app, then make the remote request, and
> post-process, cache, then send to the browser. The only gotcha will be
> for forms if you do caching.
>
> Cheers,
> Rob.
> --
> http://www.interjinn.com
> Application and Templating Framework for PHP
>


The only forms are processed on another site, so there's nothing I can
really do about that, as they return to the original site.

How would I go about doing what you suggested though? I'd assumed to use
Curl, but your email suggests not to?

Thanks,
Ash
http://www.ashleysheridan.co.uk



--=-y90n7lhIVNdoMqOQ/dWf--

Re: Using Curl to replicate a site

am 10.12.2009 17:19:47 von Joseph Thayne

If the site can be a few minutes behind, (say 15-30 minutes), then what
I recommend is to create a caching script that will update the necessary
files if the md5 checksum has changed at all (or a specified time period
has past). Then store those files locally, and run local copies of the
files. Your performance will be much better than if you have to request
the page from another server every time. You could run this script
every 15-30 minutes depending on your needs via a cron job.

Joseph

Ashley Sheridan wrote:
> Hi,
>
> I need to replicate a site on another domain, and in this case, an
> iframe won't really do, as I need to remove some of the graphics, etc
> around the content. The owner of the site I'm needing to copy has asked
> for the site to be duplicated, and unfortunately in this case, because
> of the CMS he's used (which is owned by the hosting he uses) I need a
> way to have the site replicated on an already existing domain as a
> microsite, but in a way that it is always up-to-date.
>
> I'm fine using Curl to grab the site, and even alter the content that is
> returned, but I was thinking about a caching mechanism. Has anyone any
> suggestions on this?
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
>
>

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Using Curl to replicate a site

am 10.12.2009 17:23:54 von Robert Cummings

Ashley Sheridan wrote:
> On Thu, 2009-12-10 at 11:10 -0500, Robert Cummings wrote:
>> Ashley Sheridan wrote:
>> > Hi,
>> >
>> > I need to replicate a site on another domain, and in this case, an
>> > iframe won't really do, as I need to remove some of the graphics, etc
>> > around the content. The owner of the site I'm needing to copy has asked
>> > for the site to be duplicated, and unfortunately in this case, because
>> > of the CMS he's used (which is owned by the hosting he uses) I need a
>> > way to have the site replicated on an already existing domain as a
>> > microsite, but in a way that it is always up-to-date.
>> >
>> > I'm fine using Curl to grab the site, and even alter the content that is
>> > returned, but I was thinking about a caching mechanism. Has anyone any
>> > suggestions on this?
>>
>> Sounds like you're creating a proxy with post processing/caching on the
>> forwarded content. It should be fairly straightforward to direct page
>> requests to your proxy app, then make the remote request, and
>> post-process, cache, then send to the browser. The only gotcha will be
>> for forms if you do caching.
>>
>> Cheers,
>> Rob.
>> --
>> http://www.interjinn.com
>> Application and Templating Framework for PHP
>>
>
> The only forms are processed on another site, so there's nothing I can
> really do about that, as they return to the original site.
>
> How would I go about doing what you suggested though? I'd assumed to use
> Curl, but your email suggests not to?

Nope, wasn't suggesting not to. You can use many techniques, but cURL is
probably the most robust. The best way to facilitate this, IMHO, is to
have a rewrite rule that directs all traffic for the proxy site to your
application. Then rewrite the REQUEST_URI to point to the page on the
real domain. Then check your cache for the content and if empty use cURL
to retrieve the content, apply your post-processing (to strip out what
you don't want and apply a new page layout or whatever), then cache (if
not already cached) the content (this can be a simple database table
with the request URI and a timestamp), then output the content.

Cheers,
Rob.
--
http://www.interjinn.com
Application and Templating Framework for PHP

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Using Curl to replicate a site

am 10.12.2009 17:25:27 von Robert Cummings

Joseph Thayne wrote:
> If the site can be a few minutes behind, (say 15-30 minutes), then what
> I recommend is to create a caching script that will update the necessary
> files if the md5 checksum has changed at all (or a specified time period
> has past). Then store those files locally, and run local copies of the
> files. Your performance will be much better than if you have to request
> the page from another server every time. You could run this script
> every 15-30 minutes depending on your needs via a cron job.

Use URL rewriting or capture 404 errors to handle the proxy request. No
need to download and cache the entire site if everyone is just
requesting the homepage.

Cheers,
Rob.
--
http://www.interjinn.com
Application and Templating Framework for PHP

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Using Curl to replicate a site

am 10.12.2009 17:25:52 von Ashley Sheridan

--=-FrAbcUg1SGMZ94CEBgWj
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

On Thu, 2009-12-10 at 11:25 -0500, Robert Cummings wrote:

> Joseph Thayne wrote:
> > If the site can be a few minutes behind, (say 15-30 minutes), then what
> > I recommend is to create a caching script that will update the necessary
> > files if the md5 checksum has changed at all (or a specified time period
> > has past). Then store those files locally, and run local copies of the
> > files. Your performance will be much better than if you have to request
> > the page from another server every time. You could run this script
> > every 15-30 minutes depending on your needs via a cron job.
>
> Use URL rewriting or capture 404 errors to handle the proxy request. No
> need to download and cache the entire site if everyone is just
> requesting the homepage.
>
> Cheers,
> Rob.
> --
> http://www.interjinn.com
> Application and Templating Framework for PHP
>


Yeah, I was going to use the page request to trigger the caching
mechanism, as it's unlikely that all pages are going to be equally as
popular as one another. I'll let you all know how it goes on!

Thanks,
Ash
http://www.ashleysheridan.co.uk



--=-FrAbcUg1SGMZ94CEBgWj--

Re: Using Curl to replicate a site

am 11.12.2009 16:48:04 von Ashley Sheridan

--=-CN8jNLbvhH1V07YoAJ7a
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

On Thu, 2009-12-10 at 16:25 +0000, Ashley Sheridan wrote:

> On Thu, 2009-12-10 at 11:25 -0500, Robert Cummings wrote:
>
> > Joseph Thayne wrote:
> > > If the site can be a few minutes behind, (say 15-30 minutes), then what
> > > I recommend is to create a caching script that will update the necessary
> > > files if the md5 checksum has changed at all (or a specified time period
> > > has past). Then store those files locally, and run local copies of the
> > > files. Your performance will be much better than if you have to request
> > > the page from another server every time. You could run this script
> > > every 15-30 minutes depending on your needs via a cron job.
> >
> > Use URL rewriting or capture 404 errors to handle the proxy request. No
> > need to download and cache the entire site if everyone is just
> > requesting the homepage.
> >
> > Cheers,
> > Rob.
> > --
> > http://www.interjinn.com
> > Application and Templating Framework for PHP
> >
>
>
> Yeah, I was going to use the page request to trigger the caching
> mechanism, as it's unlikely that all pages are going to be equally as
> popular as one another. I'll let you all know how it goes on!
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>


Well I got it working just great in the end. Aside from the odd issue
with relative URLs use in referencing images and Javascripts that I had
to sort out, everything seems to be working fine and is live. I've got
it on a 12-hour refresh, as the site will probably not be changing very
often at all. Thanks for all the pointers!

Thanks,
Ash
http://www.ashleysheridan.co.uk



--=-CN8jNLbvhH1V07YoAJ7a--