Replies

Here’s the thing.... Crawled versions of the website (e.g. archive.org/wayback machine cached pages) were not necessarily scrubbed. They may have never existed depending on the coding of the page to begin with (you can tell all bots that will listen - archive.org being one of many that do) not to crawl your domain or certain pages if you like (bots.txt and robots meta tag). I tried explaining this (with references and such) to another freeper a while back, but ran into a wall of denial by that person. If you would like, I can attempt to find that discussion so you can see the citations I made with all the pertinent info.

Here’s two citations off the cuff:
http://www.javascriptkit.com/howto/robots2.shtml

http://www.thesitewizard.com/archive/robotstxt.shtml

There was no robot.txt file associated with this page - I checked. It is caching the index page and all related subs. I was able to locate caches of all linked pages appropriately. They existed, but the domain was crawled again, for some reason, by all three of the primary SO’s on 10/2. All three search engines showing up on the same day when there are billions of pages to crawl?!?!?