This might not been a big news for you, but for me it was the first time I came across this issue. Normally my job is more to put make a website the more visible possible but this time I was asked to remove some information that is not think to be public.
Okay, easy I’ll modify the robots.txt …
So the robots.txt is a file read by search engines to know what they are allow to display from a website.
The syntax is quite easy:
To allow spiders unlimited access to your entire site:
To block search engines to the entire Web site:
Notice the importance of detail: the single forward slash in the disallow command blocks the entire site from being indexed by the search engines. The forward slash covers the entire directory of files in that domain.
To disallow specific directories or files:
Extracted from thekarchergroup.
Now I could tell Google or other search engines to don’t index the files I wanted but I still had an issue: how can I tell search engines to remove those files from the cache?
For Google it’s only a question of some copy-paste. With Google Webmasters you can go to Diagnostic >> URL Removals. Then in one day or two your content will be removed from the cache and their index.
For other search engines it is not so easy, it’s seems that we have to wait until they remake index of the website before change take place.
Maybe in case of sensible information not well protected it’s easier to remove them completely from your server until the cache is refreshed, or use your .htaccess to make user unable to read the files.
Or is there another way to clear the cache of Yahoo and Live?