Any ideas how to read a url that"s changed by the server?

Any ideas how to read a url that"s changed by the server?

am 20.08.2007 17:48:23 von CSTechie

I apologize, but I posted this in the php general forum earlier and
realized that this is the more appropriate forum. Hopefully there's a
coder here who has done this in the past.

I've got code that uses CURL to go a web page to read the data.

When I type in www.website.com, the server automatically adds a
session variable to the url. I need to be able to read that session
variable. Then I will use that session variable to input into a new
CURL session.

Any ideas how I can do this?

If I use code like this:

// find out the domain:
$domain = $_SERVER['HTTP_HOST'];
// find out the path to the current file:
$path = $_SERVER['SCRIPT_NAME'];

It gives me the code for where my script is sitting on my server
rather than the values for the web site that I'm trying to read.

Any ideas?

Thanks for your time!

Re: Any ideas how to read a url that"s changed by the server?

am 20.08.2007 20:01:47 von Andy Hassall

On Mon, 20 Aug 2007 15:48:23 -0000, TechieGrl wrote:

>I apologize, but I posted this in the php general forum earlier and
>realized that this is the more appropriate forum. Hopefully there's a
>coder here who has done this in the past.
>
>I've got code that uses CURL to go a web page to read the data.
>
>When I type in www.website.com, the server automatically adds a
>session variable to the url. I need to be able to read that session
>variable. Then I will use that session variable to input into a new
>CURL session.

As in it redirects to something like http://example.com/?SESSIONID=blah
?

In which case, tell cURL to follow redirects:

http://uk.php.net/manual/en/function.curl-setopt.php

with option CURLOPT_FOLLOWLOCATION.

then read the "effective URL" from the handle with:

http://uk.php.net/manual/en/function.curl-getinfo.php

with option CURLINFO_EFFECTIVE_URL.

You should then be able to extract the session ID from that using your choice
of text matching function.

--
Andy Hassall :: andy@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool

Re: Any ideas how to read a url that"s changed by the server?

am 20.08.2007 22:57:01 von CSTechie

> You should then be able to extract the session ID from that using your choice
> of text matching function.
>


Now that I've extracted the unique key from the url, I'm finding that
I'm having problems opening up the actual pages. This unique key
appears to be their session variable and so now their server views the
first hit to the server as sesson 1 and assigns it an id. Then when I
send in a new url with the session id, it views it as being a new
session and so that unique session id is no longer valid.

Any thoughts?

Re: Any ideas how to read a url that"s changed by the server?

am 20.08.2007 23:12:44 von luiheidsgoeroe

On Mon, 20 Aug 2007 22:57:01 +0200, TechieGrl wrot=
e:

>
>> You should then be able to extract the session ID from that using yo=
ur =

>> choice
>> of text matching function.
>>
>
>
> Now that I've extracted the unique key from the url, I'm finding that
> I'm having problems opening up the actual pages. This unique key
> appears to be their session variable and so now their server views the=

> first hit to the server as sesson 1 and assigns it an id. Then when I=

> send in a new url with the session id, it views it as being a new
> session and so that unique session id is no longer valid.
>
> Any thoughts?

Probably you need to use cookies.



//create sort of 'anonymous' cookiefile in temporary directory
$cookiefile =3D tempnam();

//initiliaze curl
$c =3D curl_init();

//follow redirects
curl_setopt($c,CURLOPT_FOLLOWLOCATION,true);

//store & retrieve cookies
curl_setopt($c,CURLOPT_COOKIEFILE,$cookiefile);
curl_setopt($c,CURLOPT_COOKIEJAR,$cookiefile);

//some people still think referrer should be checked, hidious:
curl_setopt($c,CURLOPT_AUTOREFERER,true);

//set the url
curl_setopt($c, CURLOPT_URL, "http://www.example.com/");

//and go:
curl_exec($c);

//and close
curl_close($c);

//and delete cookiefile
unlink($cookiefile);

-- =

Rik Wasmus

Re: Any ideas how to read a url that"s changed by the server?

am 21.08.2007 00:16:05 von CSTechie

> //set the url
> curl_setopt($c, CURLOPT_URL, "http://www.example.com/");


Following the example as you have it worked great and gave me the
initial page information! But the problem is that I am not sure how
get to the page that I really need given how the url is created.

I need to hit this page - www.example.com/sessionId/page.html

My initial thought is to go to the main web site - www.example.com.
When I go to that site, I'm automatically redirected to a page that
has the session variable inserted into the url - www.example.com/sessionId/page.html

page.html is actually where the data is that I'm grabbing.

It seems as if I need to sent in 2 CURLOPT_URL values, but that's
where the session variable becomes a problem because it now thinks
that I have 2 separate sessions.

Maybe I'm approaching this all wrong.

Re: Any ideas how to read a url that"s changed by the server?

am 21.08.2007 02:56:04 von luiheidsgoeroe

On Tue, 21 Aug 2007 00:16:05 +0200, TechieGrl wrote:

>
>> //set the url
>> curl_setopt($c, CURLOPT_URL, "http://www.example.com/");
>
>
> Following the example as you have it worked great and gave me the
> initial page information! But the problem is that I am not sure how
> get to the page that I really need given how the url is created.
>
> I need to hit this page - www.example.com/sessionId/page.html
>
> My initial thought is to go to the main web site - www.example.com.
> When I go to that site, I'm automatically redirected to a page that
> has the session variable inserted into the url -
> www.example.com/sessionId/page.html
>
> page.html is actually where the data is that I'm grabbing.
>
> It seems as if I need to sent in 2 CURLOPT_URL values, but that's
> where the session variable becomes a problem because it now thinks
> that I have 2 separate sessions.

Requesting and discarding several pages before you enter the 'real' data
shouldn't be a problem like this.

> Maybe I'm approaching this all wrong.

If you have a cookie with a session-id, you probably don't need in the URL
(might be required though, I don't know which site).

--
Rik Wasmus

Re: Any ideas how to read a url that"s changed by the server?

am 21.08.2007 15:44:12 von CSTechie

> Requesting and discarding several pages before you enter the 'real' data
> shouldn't be a problem like this.
>
>
> If you have a cookie with a session-id, you probably don't need in the URL
> (might be required though, I don't know which site).


Here's an example of a redirect - not the same site that I'm using,
but you can see what happens here.

When I type in http://my.opera.com, I am redirected to http://my.opera.com/community

Then when I click on a link, I go to a page that includes "community"
in the url - http://my.opera.com/community/blog/2007/08/17/member-of-the- week


I need to get from my.opera.com to the last url, but if the word
"community" was actually a changing session ID, then I would need to
check for that each time prior to getting to the page I really want,
member-of-the-week.

Does that make sense?

Re: Any ideas how to read a url that"s changed by the server?

am 21.08.2007 16:01:07 von luiheidsgoeroe

On Tue, 21 Aug 2007 15:44:12 +0200, TechieGrl wrote:

>
>> Requesting and discarding several pages before you enter the 'real' data
>> shouldn't be a problem like this.
>>
>>
>> If you have a cookie with a session-id, you probably don't need in the
>> URL
>> (might be required though, I don't know which site).
>
>
> Here's an example of a redirect - not the same site that I'm using,
> but you can see what happens here.
>
> When I type in http://my.opera.com, I am redirected to
> http://my.opera.com/community
>
> Then when I click on a link, I go to a page that includes "community"
> in the url -
> http://my.opera.com/community/blog/2007/08/17/member-of-the- week
>
>
> I need to get from my.opera.com to the last url, but if the word
> "community" was actually a changing session ID, then I would need to
> check for that each time prior to getting to the page I really want,
> member-of-the-week.
>
> Does that make sense?

Could very well be. It all depends on how the implemented the session. If
you enable the cookies in CURL on most site you'll just use the cookies,
without having to check the url. If it enforces a GET session-id, you'll
have to check that & continue to add it to subsequent reuqests (recheck
for change, etc).

As said, you'll have to use curl_getinfo() to check for ending URL,
possible use a curl_setopt() to get some headers which might be important.

Usefull functions here are also parse_url() & parse_str() for the returned
(ending) url. And if it doesn't work, check with a 'normal' browser what
redirects/headers get sent (Fiddler for MSIE & LiveHTTPHeaders for FF come
to mind), copy that to curl, and remove again one by one untill you're
left with the once that really matter. It's all about discovering
(knowing/asking(would be fastest...)) what the actual inner workings of
the site are.

Keep in mind that CURL works great as long as the site doesn't use
javascript for some critical browsing/displaying/session functions. If it
does, you're in for a very painstaking translation of the critical
javascript code to the actual actions, which may or may not fail in future
with the minimum amount of change in the setup of the site.
--
Rik Wasmus