Free Republic
Browse · Search
News/Activism
Topics · Post Article

To: dead

Here’s the thing.... Crawled versions of the website (e.g. archive.org/wayback machine cached pages) were not necessarily scrubbed. They may have never existed depending on the coding of the page to begin with (you can tell all bots that will listen - archive.org being one of many that do) not to crawl your domain or certain pages if you like (bots.txt and robots meta tag). I tried explaining this (with references and such) to another freeper a while back, but ran into a wall of denial by that person. If you would like, I can attempt to find that discussion so you can see the citations I made with all the pertinent info.

Here’s two citations off the cuff:
http://www.javascriptkit.com/howto/robots2.shtml

http://www.thesitewizard.com/archive/robotstxt.shtml


77 posted on 10/02/2012 2:33:48 PM PDT by jurroppi1
[ Post Reply | Private Reply | To 1 | View Replies ]


To: jurroppi1

There was no robot.txt file associated with this page - I checked. It is caching the index page and all related subs. I was able to locate caches of all linked pages appropriately. They existed, but the domain was crawled again, for some reason, by all three of the primary SO’s on 10/2. All three search engines showing up on the same day when there are billions of pages to crawl?!?!?


86 posted on 10/02/2012 3:13:41 PM PDT by RobertClark (Be prepared, be polite, be professional and have a plan to kill everyone you meet.)
[ Post Reply | Private Reply | To 77 | View Replies ]

Free Republic
Browse · Search
News/Activism
Topics · Post Article


FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson