This might not been a big news for you, but for me it was the first time I came across this issue. Normally my job is more to put make a website the more visible possible but this time I was asked to remove some information that is not think to be public.
Okay, easy I’ll modify the robots.txt …
So the robots.txt is a file read by search engines to know what they are allow to display from a website.
The syntax is quite easy:
To allow spiders unlimited access to your entire site:
User-agent: *
Disallow:To block search engines to the entire Web site:
User-agent: *
Disallow: /
Notice the importance of detail: the single forward slash in the disallow command blocks the entire site from being indexed by the search engines. The forward slash covers the entire directory of files in that domain.To disallow specific directories or files:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /includes/
Disallow: /pdf/admin.pdf
Extracted from thekarchergroup.
Now I could tell Google or other search engines to don’t index the files I wanted but I still had an issue: how can I tell search engines to remove those files from the cache?
For Google it’s only a question of some copy-paste. With Google Webmasters you can go to Diagnostic >> URL Removals. Then in one day or two your content will be removed from the cache and their index.
For other search engines it is not so easy, it’s seems that we have to wait until they remake index of the website before change take place.
Maybe in case of sensible information not well protected it’s easier to remove them completely from your server until the cache is refreshed, or use your .htaccess to make user unable to read the files.
Or is there another way to clear the cache of Yahoo and Live?
Ahmet
I’ve never understood why people wouldnt want to block a search engine to their website?
If you dont want a search engine seeing a page then I would have thought you wouldnt want joe public loking either, so you could just make it secure?
Sure I would, but if my files are secure I don’t even need to block search engines. They simply can’t access. In this case I was working on files that became private with time and once you are in a search engine cache it become difficult to hide.