Stopping robots searching particular page

am 11.09.2007 23:41:56 von dorayme

A website is on a server. Just one or two of the pages are not
for public consumption. They are not top secret and no big harm
would be done if it was not 100% possible, but it would be best
if they did not come up in search engines. (A sort of provision
by a company for making some files available to those who have
the address. Company does not want password protection; but I am
considering persuading them).

What is the simplest and most effective way of stopping robots
searching a particular html pages on a server. Am looking for an
actual example and clear instructions. Getting confused by
looking at http://www.searchtools.com/index.html though doubtless
I will get less confused after much study.

--
dorayme

Re: Stopping robots searching particular page

am 12.09.2007 00:12:26 von jkorpela

Scripsit dorayme:

> What is the simplest and most effective way of stopping robots
> searching a particular html pages on a server.

Put the following into the head part of each of those pages:

Replace "noindex" by "noindex, nofollow" if you also want to stop robots
from following any links on the page (i.e. from finding new indexable pages
through it).

This follows the de-facto standard (Robots Exclusion Standard) that has long
been obeyed by any well-behaving indexing robots. And there's not much you
can do to the ill-behaving robots.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: Stopping robots searching particular page

am 12.09.2007 00:18:20 von Tina Peters

"dorayme" wrote in message
news:doraymeRidThis-705494.07415612092007@news-vip.optusnet. com.au...
>A website is on a server. Just one or two of the pages are not
> for public consumption. They are not top secret and no big harm
> would be done if it was not 100% possible, but it would be best
> if they did not come up in search engines.

If its not linked to any other webpage, in any way, it shouldn't be
spidered.

--Tina
--
AxisHOST.com - cPanel Hosting
BuyAVPS.com - VPS Accounts
Serving the web since 1997

Re: Stopping robots searching particular page

am 12.09.2007 00:47:56 von jkorpela

Scripsit Tina Peters:

> If its not linked to any other webpage, in any way, it shouldn't be
> spidered.

Yet it may be spidered. Actually, it would be an interesting exercise in a
course on web issues to ask the students list down 10 possible situations
where the page might be spidered.

And to make the task a little more difficult, let's exclude the perhaps most
obvious scenario: someone who knows the page address submits it to a search
engine via its "Add URL" form.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: Stopping robots searching particular page

am 12.09.2007 02:14:38 von dorayme

In article <2sEFi.222058$rx.205394@reader1.news.saunalahti.fi>,
"Jukka K. Korpela" wrote:

> Scripsit dorayme:
>
> > What is the simplest and most effective way of stopping robots
> > searching a particular html pages on a server.
>
> Put the following into the head part of each of those pages:
>
>
>
> Replace "noindex" by "noindex, nofollow" if you also want to stop robots
> from following any links on the page (i.e. from finding new indexable pages
> through it).
>
> This follows the de-facto standard (Robots Exclusion Standard) that has long
> been obeyed by any well-behaving indexing robots. And there's not much you
> can do to the ill-behaving robots.

Thank you. This is the level of exclusion that I want. Job done.

--
dorayme

Re: Stopping robots searching particular page

am 12.09.2007 02:30:17 von Sherm Pendley

dorayme writes:

> What is the simplest and most effective way of stopping robots
> searching a particular html pages on a server.

There are two popular "standards" (neither of which is a standard in
the formal sense). One uses elements in your HTML, and the
other uses separate robots.txt files. Both are described here:

Both approaches depend on cooperative robots. For uncooperative robots,
all you can do is shout "klaatu barada nikto" and hope for the best.

sherm--

--
Web Hosting by West Virginians, for West Virginians: http://wv-www.net
Cocoa programming in Perl: http://camelbones.sourceforge.net

Re: Stopping robots searching particular page

am 12.09.2007 02:58:13 von dorayme

In article ,
Sherm Pendley wrote:

> dorayme writes:
>
> > What is the simplest and most effective way of stopping robots
> > searching a particular html pages on a server.
>
> There are two popular "standards" (neither of which is a standard in
> the formal sense). One uses elements in your HTML, and the
> other uses separate robots.txt files. Both are described here:
>
>
>
> Both approaches depend on cooperative robots. For uncooperative robots,
> all you can do is shout "klaatu barada nikto" and hope for the best.
>

Thanks. If I get any reports of the pages concerned being found
now that I have gone the meta route, I will look further into the
robots.txt approach.

(Actually, sherm, I started reading about this last before
posting my question, got restless and slightly confused and
thought, I know what to do, I will pop my head above the trench
line a mo and see if something comes back from alt.htm to make
this thing stop buzzing around my brain. I know, it was a bit
reckless. But it who dares... you know...

I also have a search engine on the particular site concerned and
they have various masking procedures I have since looked into.)

--
dorayme

Re: Stopping robots searching particular page

am 12.09.2007 11:28:09 von TravisNewbury

On Sep 11, 6:47 pm, "Jukka K. Korpela" wrote:
> > If its not linked to any other webpage, in any way, it shouldn't be
> > spidered.
> Yet it may be spidered. Actually, it would be an interesting exercise in a
> course on web issues to ask the students list down 10 possible situations
> where the page might be spidered.

Well that's just a stupid asignment. The students might actually be
forced to learn something from it. What the heck is your problem
suggesting something were a student could learn...

Re: Stopping robots searching particular page

am 12.09.2007 11:55:40 von Ben C

On 2007-09-11, Jukka K. Korpela wrote:
> Scripsit Tina Peters:
>
>> If its not linked to any other webpage, in any way, it shouldn't be
>> spidered.
>
> Yet it may be spidered. Actually, it would be an interesting exercise in a
> course on web issues to ask the students list down 10 possible situations
> where the page might be spidered.
>
> And to make the task a little more difficult, let's exclude the perhaps most
> obvious scenario: someone who knows the page address submits it to a search
> engine via its "Add URL" form.

1. Someone posts the URL to a newsgroup.
2. You forget to turn off the webserver's AutoIndex or similar, so the
spider can just navigate its way to the url going through auto
generated directory indexes.

What are the other 8?

Re: Stopping robots searching particular page

am 12.09.2007 12:12:24 von jkorpela

Scripsit Ben C:

> 1. Someone posts the URL to a newsgroup.
> 2. You forget to turn off the webserver's AutoIndex or similar, so the
> spider can just navigate its way to the url going through auto
> generated directory indexes.
>
> What are the other 8?

To mention some other scenarios of having a page indexed without having been
linked to from any other web page*), here's one relatively obvious one and
one imaginary though realistic (we know such things are being done with
email addresses for spamming purposes):

3. The page _was_ linked to from another page.

4. An indexing robot generates URLs automatically, more or less at random,
and tries them. It might for example try servers known to exist and append
to the server name some strings that are known to be common for web pages,
like /help.htm, /news.html....

*) Of course an author cannot prevent linking by others. You tell the URL to
your friend, who tells it to his pal, who sets up a link. But this common
way of getting indexed against your will falls outside the current exercise.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: Stopping robots searching particular page

am 12.09.2007 12:16:23 von Dylan Parry

Jukka K. Korpela wrote:

>> 1. Someone posts the URL to a newsgroup.
>> 2. You forget to turn off the webserver's AutoIndex or similar, so the
>> spider can just navigate its way to the url going through auto
>> generated directory indexes.
>>
> 3. The page _was_ linked to from another page.
>
> 4. An indexing robot generates URLs automatically, more or less at random,
> and tries them. It might for example try servers known to exist and append
> to the server name some strings that are known to be common for web pages,
> like /help.htm, /news.html....

5. Someone visits your page[1] and has the Google Toolbar (or others
similar things) installed and reporting back to Google about the sites
they are visiting, thus allowing Google to add the site to their index.

____
[1] How they got the URL in the first place might be an issue here, but
it could be that you personally gave it to them or that it was written
down somewhere that wasn't necessarily an online resource (business card
etc).

--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk

The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.

Re: Stopping robots searching particular page

am 12.09.2007 16:04:19 von Nick Theodorakis

On Sep 12, 5:55 am, Ben C wrote:

>
> >
> 2. You forget to turn off the webserver's AutoIndex or similar, so the
> spider can just navigate its way to the url going through auto
> generated directory indexes.

At least one robot does this. I have a template page (definitely not
mentioned anywhere else) in a subdirectory that seems to get spidered
by the yahoo slurp robot.

Nick

--
Nick Theodorakis
nick_theodorakis@hotmail.com

Re: Stopping robots searching particular page

am 12.09.2007 16:56:15 von Adrienne Boswell

Gazing into my crystal ball I observed dorayme
writing in news:doraymeRidThis-
705494.07415612092007@news-vip.optusnet.com.au:

> A website is on a server. Just one or two of the pages are not
> for public consumption. They are not top secret and no big harm
> would be done if it was not 100% possible, but it would be best
> if they did not come up in search engines. (A sort of provision
> by a company for making some files available to those who have
> the address. Company does not want password protection; but I am
> considering persuading them).
>
> What is the simplest and most effective way of stopping robots
> searching a particular html pages on a server. Am looking for an
> actual example and clear instructions. Getting confused by
> looking at http://www.searchtools.com/index.html though doubtless
> I will get less confused after much study.
>

1. Robots exclusion, you can name a particular file, eg. backoffice.asp
2. Meta route (in my experience, not quite as reliable as the first)

--
Adrienne Boswell at Home
Arbpen Web Site Design Services
http://www.cavalcade-of-coding.info
Please respond to the group so others can share

Re: Stopping robots searching particular page

am 12.09.2007 22:22:18 von Ed Mullen

dorayme wrote:
> A website is on a server. Just one or two of the pages are not
> for public consumption. They are not top secret and no big harm
> would be done if it was not 100% possible, but it would be best
> if they did not come up in search engines. (A sort of provision
> by a company for making some files available to those who have
> the address. Company does not want password protection; but I am
> considering persuading them).
>
> What is the simplest and most effective way of stopping robots
> searching a particular html pages on a server. Am looking for an
> actual example and clear instructions. Getting confused by
> looking at http://www.searchtools.com/index.html though doubtless
> I will get less confused after much study.
>

Why not just put it in a password-protected directory?

--
Ed Mullen
http://edmullen.net
http://mozilla.edmullen.net
http://abington.edmullen.net
I used to be schizophrenic, but we're all right now.

Re: Stopping robots searching particular page

am 13.09.2007 00:04:45 von dorayme

In article ,
Ed Mullen wrote:

> it would be best
> > if they did not come up in search engines. (A sort of provision
> > by a company for making some files available to those who have
> > the address. Company does not want password protection; but I am
> > considering persuading them).
> >
> > What is the simplest and most effective way of stopping robots
> > searching a particular html pages on a server.
> >
>
> Why not just put it in a password-protected directory?

I guess because it puts up a hurdle for the company and the
particular companies to which they need to communicate this
address. People forget passwords and it is extra work to be
transmitting password information. I understand the reluctance on
this occasion. But see above.

[I am working on a psychologically based scheme at the moment,
Ed, in consultation with my psychologist, to make pages that have
a level of natural repugnance. The level must be such that people
with no real need or interest in the purpose of the page will
flee from it quickly whereas those with a task that requires the
resources to be found on that page will persist till they get
them. At the crudest level, perhaps a picture of a dead
decomposing rat at the top? Animated gif of fumes emanating from
it? Embedded horrible dead rat sounds? If you care to invest in
the further development of this promising new scheme, please send
$10.]

--
dorayme

Re: Stopping robots searching particular page

am 13.09.2007 00:07:52 von John Clayton

"dorayme" wrote in message
news:doraymeRidThis-8EEFDD.10143812092007@news-vip.optusnet. com.au...
> In article <2sEFi.222058$rx.205394@reader1.news.saunalahti.fi>,
> "Jukka K. Korpela" wrote:
>
>> Scripsit dorayme:
>>
>> > What is the simplest and most effective way of stopping robots
>> > searching a particular html pages on a server.
>>
>> Put the following into the head part of each of those pages:
>>
>>
>>
>> Replace "noindex" by "noindex, nofollow" if you also want to stop robots
>> from following any links on the page (i.e. from finding new indexable
>> pages
>> through it).

Would this also help answer the recent, earlier question "how to prevent
spiders from indexing 'mailto' addresses"?
Just asking.

John

Re: Stopping robots searching particular page

am 13.09.2007 04:01:02 von Ed Mullen

dorayme wrote:
> In article ,
> Ed Mullen wrote:
>
>> it would be best
>>> if they did not come up in search engines. (A sort of provision
>>> by a company for making some files available to those who have
>>> the address. Company does not want password protection; but I am
>>> considering persuading them).
>>>
>>> What is the simplest and most effective way of stopping robots
>>> searching a particular html pages on a server.
>>>
>> Why not just put it in a password-protected directory?
>
> I guess because it puts up a hurdle for the company and the
> particular companies to which they need to communicate this
> address. People forget passwords and it is extra work to be
> transmitting password information. I understand the reluctance on
> this occasion. But see above.

But, most browsers have the ability to "remember" logon info so it's a a
case of "do it once". Geez, how hard is that? Set up an example and
show them. I have two different sites with protected pages/files. My
Mozilla-based browsers remember the logon info just fine. I click on a
link/favorite/bookmark, the logon pop-up comes up, I click OK.

>
> [I am working on a psychologically based scheme at the moment,
> Ed, in consultation with my psychologist, to make pages that have
> a level of natural repugnance. The level must be such that people
> with no real need or interest in the purpose of the page will
> flee from it quickly whereas those with a task that requires the
> resources to be found on that page will persist till they get
> them. At the crudest level, perhaps a picture of a dead
> decomposing rat at the top? Animated gif of fumes emanating from
> it? Embedded horrible dead rat sounds? If you care to invest in
> the further development of this promising new scheme, please send
> $10.]
>

I doubt that decomposing rats will be a sufficiently universal
deterrence. In fact, I'm not sure you can settle on any image that will,
say, tick off, what? 80% of viewers? 90%? Now, if you could be certain
that everyone was browsing with sound on and the volume set to max, well
.... ooooo, baby! Then we got something!

--
Ed Mullen
http://edmullen.net
http://mozilla.edmullen.net
http://abington.edmullen.net
Give me ambiguity or give me something else.

Re: Stopping robots searching particular page

am 13.09.2007 04:51:08 von dorayme

In article <-bidnZXVP9N9BHXbnZ2dnUVZ_uSgnZ2d@comcast.com>,
Ed Mullen wrote:

> But, most browsers have the ability to "remember" logon info so it's a a
> case of "do it once". Geez, how hard is that? Set up an example and
> show them. I have two different sites with protected pages/files. My
> Mozilla-based browsers remember the logon info just fine. I click on a
> link/favorite/bookmark, the logon pop-up comes up, I click OK.

I knew someone would take this line I remind you that I said
that I am considering so persuading in my original post. That is
point one. And yes, I am aware of some browsers having such
facilities, I would be personally lost without them or the Mac
keychain. But step back, Ed, and see why I am only considering
persuading and not headlong rushing into it. You are a young man,
full of natural enthusiasms, I am a 570 year old martian,
reserved, restrained, conservative, not the least pushy.

You are basically asking me to persuade not only the company to
change browsers but also to persuade them to persuade their
clients/suppliers (all over the world, rich and poor countries)
who need the resources on the page concerned to make sure they
have the appropriate browsers. How hard is that? It is much
harder than me not doing anything but sticking in the meta thing
that JK said on the nice web page I made for them and now sitting
back with pleasant thoughts of sorting out pictures of the dog I
walk, of all the gorgeous pics from babyhood to married of some
family members, of a new screen (cheap from Dell) for my desk and
getting ready to go and have a swim on a Sydney beach this avo
(have you any idea how lovely Sydney smells and feels today,
jasmine and clear blue sky... Almost a caricature of spring,
except it is real).

[not a snowflake in sight - Whack!]

--
dorayme

Re: Stopping robots searching particular page

am 13.09.2007 06:05:46 von Ed Mullen

dorayme wrote:
> In article <-bidnZXVP9N9BHXbnZ2dnUVZ_uSgnZ2d@comcast.com>,
> Ed Mullen wrote:
>
>> But, most browsers have the ability to "remember" logon info so it's a a
>> case of "do it once". Geez, how hard is that? Set up an example and
>> show them. I have two different sites with protected pages/files. My
>> Mozilla-based browsers remember the logon info just fine. I click on a
>> link/favorite/bookmark, the logon pop-up comes up, I click OK.
>
> I knew someone would take this line I remind you that I said
> that I am considering so persuading in my original post. That is
> point one. And yes, I am aware of some browsers having such
> facilities, I would be personally lost without them or the Mac
> keychain. But step back, Ed, and see why I am only considering
> persuading and not headlong rushing into it. You are a young man,
> full of natural enthusiasms, I am a 570 year old martian,
> reserved, restrained, conservative, not the least pushy.
>
> You are basically asking me to persuade not only the company to
> change browsers but also to persuade them to persuade their
> clients/suppliers (all over the world, rich and poor countries)
> who need the resources on the page concerned to make sure they
> have the appropriate browsers. How hard is that? It is much
> harder than me not doing anything but sticking in the meta thing
> that JK said on the nice web page I made for them and now sitting
> back with pleasant thoughts of sorting out pictures of the dog I
> walk, of all the gorgeous pics from babyhood to married of some
> family members, of a new screen (cheap from Dell) for my desk and
> getting ready to go and have a swim on a Sydney beach this avo
> (have you any idea how lovely Sydney smells and feels today,
> jasmine and clear blue sky... Almost a caricature of spring,
> except it is real).
>
> [not a snowflake in sight - Whack!]
>

I gotta go get a drink. I read it, I (sorta) got it, and now my head
hurts so much ...

Do it or don't do it. It is a solution. If you or your client don't
like it, fine. Your choice. But, it's simple, it exists, and, let's
face it, if it's a commercial app? "Use of this site/facility requires
...." And it is NOT onerous.

Ok, I'm wandering downstairs now ...

--
Ed Mullen
http://edmullen.net
http://mozilla.edmullen.net
http://abington.edmullen.net
An ounce of practice is worth more than tons of preaching. - Mohandas Gandhi

Re: Stopping robots searching particular page

am 13.09.2007 07:00:30 von jkorpela

Scripsit John Clayton:

>>>
- -
> Would this also help answer the recent, earlier question "how to
> prevent spiders from indexing 'mailto' addresses"?

It would prevent well-behaving robots from indexing the page at all, but
robots that collect addresses for spamming can hardly be expected to be
well-behaving.

(If you want to prevent spammers to get your email address at any cost, get
rid of all email addresses you have and don't ever get one. That's the only
method that actually works for the purpose. If you just want to use email
for something useful, find out an optimal way of doing spam filtering. Do
_not_ make this your visitors' problem.)

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: Stopping robots searching particular page

am 13.09.2007 13:38:43 von Nick Theodorakis

On Sep 12, 9:01 pm, Ed Mullen wrote:

[...]

>
> I doubt that decomposing rats will be a sufficiently universal
> deterrence. In fact, I'm not sure you can settle on any image that will,
> say, tick off, what? 80% of viewers? 90%? ...
>

Goatse should do the trick.

Nick

--
Nick Theodorakis
nick_theodorakis@hotmail.com
contact form:
http://theodorakis.net/contact.html

Re: Stopping robots searching particular page

am 13.09.2007 17:27:13 von Philip Semanchuk

In article ,
Dylan Parry wrote:

> Jukka K. Korpela wrote:
>
> >> 1. Someone posts the URL to a newsgroup.
> >> 2. You forget to turn off the webserver's AutoIndex or similar, so the
> >> spider can just navigate its way to the url going through auto
> >> generated directory indexes.
> >>
> > 3. The page _was_ linked to from another page.
> >
> > 4. An indexing robot generates URLs automatically, more or less at random,
> > and tries them. It might for example try servers known to exist and append
> > to the server name some strings that are known to be common for web pages,
> > like /help.htm, /news.html....
>
> 5. Someone visits your page[1] and has the Google Toolbar (or others
> similar things) installed and reporting back to Google about the sites
> they are visiting, thus allowing Google to add the site to their index.

6. Someone sends the URL in an email via a mail service (like GMail)
that's also related to a search engine.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Re: Stopping robots searching particular page

am 13.09.2007 17:29:17 von Philip Semanchuk

Re: Stopping robots searching particular page

am 14.09.2007 00:26:31 von dorayme

In article ,
Ed Mullen wrote:

> I gotta go get a drink. I read it, I (sorta) got it, and now my head
> hurts so much ...
>
> Do it or don't do it. It is a solution. If you or your client don't
> like it, fine. Your choice. But, it's simple, it exists, and, let's
> face it, if it's a commercial app? "Use of this site/facility requires
> ..." And it is NOT onerous.
>
> Ok, I'm wandering downstairs now ...

You come back right up here young man and listen to me some
more... I have not even started...

--
dorayme