user agents

user agents

am 01.12.2004 10:25:51 von zed.lopez

I just found some behavior that surprised me.

LWP::Simple's perldoc says:

The user agent created by this module will identify itself as
"LWP::Simple/#.##" (where "#.##" is the libwww-perl version number)
and will initialize its proxy defaults from the environment (by
calling $ua->env_proxy).

But it doesn't mention that the get method ends up not using it,
allowing _trivial_http_get to write its own User Agent string.

get results in this in a weblog:
69.109.167.40 - - [01/Dec/2004:00:41:54 -0800] "GET / HTTP/1.0" 200
44222 "-" "lwp-trivial/1.40"

getprint results in this:
69.109.167.40 - - [01/Dec/2004:00:41:59 -0800] "GET / HTTP/1.1" 200
44222 "-" "LWP::Simple/5.79"

I just went through some hair-pulling debugging 'cause getprint was
working where get was failing, apparently because the site's robots.txt was
allowing one and blocking the other. It's also
striking that they use different HTTP versions.

I'd like to suggest these differences be documented. Does anyone know
why _trivial_http_get uses its own user agent and HTTP version?

Zed

Re: user agents

am 01.12.2004 10:35:13 von gisle

Zed Lopez writes:

> I'd like to suggest these differences be documented.

I agree this is wrong. Do you want to suggest a doc patch?

> Does anyone know why _trivial_http_get uses its own user agent and
> HTTP version?

Because it is a totally different client implementation with it's own
bugs and limitations. You can force always using the full LWP client
implementaion by importing $ua from LWP::Simple.

Regards,
Gisle

HTML-Parser-3.41

am 01.12.2004 10:52:22 von gisle

HTML-Parser-3.41 is available from CPAN. The major news is that
HTML::Parser should now do the right thing with Unicode strings and
that the compile time option to enable Unicode entities is gone.
There is a new 'utf8_mode' that allow saner parsing of raw undecoded
UTF-8. The Unicode support is only available if you use perl-5.8 or
better.

Other noteworthy recent changes:

- content parsed in literal mode<br /> <br /> - <script> and <style> skip quoted strings when looking for<br /> matching end tag<br /> <br /> - if no matching end tag is found for <script>, <style>, <xmp><br /> <title>, <textarea> then generate one where the next tag<br /> starts.<br /> <br /> - will decode unterminated entities in 'dtext', i.e. foo bar<br /> become "foo=A0bar".<br /> <br /> <br /> Enjoy!</p> </article> <article> <h2>Re: user agents</h2><span>am 02.12.2004 09:18:44 von zed.lopez</span> <p>On 01 Dec 2004 01:35:13 -0800, Gisle Aas <gisle@activestate.com> wrote:<br /> > Zed Lopez <zed.lopez@gmail.com> writes:<br /> > > I'd like to suggest these differences be documented.<br /> > <br /> > I agree this is wrong. Do you want to suggest a doc patch?<br /> <br /> I'm working on the doc patch... would it be considered desirable to<br /> document that a user can get get() to drive HTTP::Request by setting<br /> $LWP::Simple::FULL_LWP to a true value? Or that one can use get_old()<br /> to drive HTTP::Request?<br /> <br /> Obviously, no one wants to add a lot of complexity to a ::Simple<br /> module, but right now the behavior includes: the user agent and HTTP<br /> version are subject to change if an HTTP proxy is in use or if the<br /> requested page does a redirect. And there's no way to code around that<br /> within this module's public interface.<br /> <br /> The choices seem to be to direct users to live with that and to avoid<br /> LWP::Simple if they can't, or to document at least one of $FULL_LWP<br /> and get_old().<br /> <br /> Zed</p> </article> <article> <h2>Re: user agents</h2><span>am 02.12.2004 09:39:56 von gisle</span> <p>Zed Lopez <zed.lopez@gmail.com> writes:<br /> <br /> > On 01 Dec 2004 01:35:13 -0800, Gisle Aas <gisle@activestate.com> wrote:<br /> > > Zed Lopez <zed.lopez@gmail.com> writes:<br /> > > > I'd like to suggest these differences be documented.<br /> > > <br /> > > I agree this is wrong. Do you want to suggest a doc patch?<br /> > <br /> > I'm working on the doc patch... would it be considered desirable to<br /> > document that a user can get get() to drive HTTP::Request by setting<br /> > $LWP::Simple::FULL_LWP to a true value? Or that one can use get_old()<br /> > to drive HTTP::Request?<br /> > <br /> > Obviously, no one wants to add a lot of complexity to a ::Simple<br /> > module, but right now the behavior includes: the user agent and HTTP<br /> > version are subject to change if an HTTP proxy is in use or if the<br /> > requested page does a redirect. And there's no way to code around that<br /> > within this module's public interface.<br /> <br /> It is documented (barely) that the module export the variable '$ua'.<br /> A side effect of importing this variable is that this forces the full<br /> LWP::UserAgent implementation to be used, otherwise settings on the<br /> $ua object would have no effect. I want to declare this as the<br /> official interface to force this and not document either get_old or<br /> $FULL_LWP.<br /> <br /> Regards,<br /> Gisle</p> </article> <article> <h2>Re: user agents</h2><span>am 03.12.2004 09:49:45 von gisle</span> <p>Mattias Holmlund <u1@m1.holmlund.se> writes:<br /> <br /> > Gisle Aas wrote:<br /> > <br /> > >It is documented (barely) that the module export the variable '$ua'.<br /> > >A side effect of importing this variable is that this forces the full<br /> > >LWP::UserAgent implementation to be used, otherwise settings on the<br /> > >$ua object would have no effect. I want to declare this as the<br /> > >official interface to force this and not document either get_old or<br /> > >$FULL_LWP.<br /> > ><br /> > ><br /> > In HTTP::Cache::Transparent, I override LWP::UserAgent::simple_request<br /> > and set $LWP::Simple::FULL_LWP to 1 to make sure that the full<br /> > LWP::UserAgent implementation is always used. I do this without<br /> > actually loading LWP::Simple, since HTTP::Cache::Transparent doesn't<br /> > need LWP::Simple and some users may prefer to use LWP::UserAgent<br /> > instead of LWP::Simple.<br /> > <br /> > With your proposal of an "official interface", I have to load<br /> > LWP::Simple to be able to achieve the same effect. This is not a big<br /> > issue, but it will consume memory unnecessarily. Do you think I should<br /> > change this in HTTP::Cache::Transparent?<br /> <br /> I would just keep it as it is. This seems reasonable as long as you are<br /> aware that you are playing outside of the documented interface and are<br /> preparared to deal with the consequences if the internals of<br /> LWP::Simple should change in the future.<br /> <br /> Regards,<br /> Gisle</p> </article> <footer> <a href="/">Index</a> | <a href="/impressum.php">Impressum</a> | <a href="/datenschutz.php">Datenschutz</a> | <a href="https://www.xodox.de/">XODOX</a> </footer> </main> </body> </html>