×
The Internet Archive: Include Every Site on the Wayback Machine, Regardless of Robots.txt
Michael R.
started this petition to
The Internet Archive
As you might know, robots.txt is a bossy, simple text file written entirely in letters that tells what crawlers can and cannot do to the site. But those letters, like one's bad language, can do a lot of damage...in the case of archiving websites, that is.
You see, archive.org is being slightly gullible here. They are letting a simple text file control what and what does not go on the Wayback Machine, and it's ridiculous, because this policy is preventing some awesome web pages from reaching their database. Even worse is: robots.txt files can change over time, most unapparent if a site was shut down and some spam crawler takes over the defunct site and adds a robots text file, even though the webmaster did not intend to add it. One time I was able to crawl Walmart's catalog on archive.org, but not anymore, because they updated their robots.txt policy a few months later.
Some other people might be uncaring, preventing their website from being crawled (e.g. Nintendo of Europe). What about the amazing, old history that lies behind that site URL? Can I really not see it for reason you won't tell? C'mon, site owners, it's just history of your site on an archiving service. It would really disappoint a lot to not see the history, because of your paranoia/"legal reasons".
Take a look at nintendo.com -- they have no robots.txt file, and they are more than healthily running their site. Ditch robots.txt! It's redundant! Even the ArchiveTeam, run by the well known and beloved Jason Scott, dislikes it.
You see, archive.org is being slightly gullible here. They are letting a simple text file control what and what does not go on the Wayback Machine, and it's ridiculous, because this policy is preventing some awesome web pages from reaching their database. Even worse is: robots.txt files can change over time, most unapparent if a site was shut down and some spam crawler takes over the defunct site and adds a robots text file, even though the webmaster did not intend to add it. One time I was able to crawl Walmart's catalog on archive.org, but not anymore, because they updated their robots.txt policy a few months later.
Some other people might be uncaring, preventing their website from being crawled (e.g. Nintendo of Europe). What about the amazing, old history that lies behind that site URL? Can I really not see it for reason you won't tell? C'mon, site owners, it's just history of your site on an archiving service. It would really disappoint a lot to not see the history, because of your paranoia/"legal reasons".
Take a look at nintendo.com -- they have no robots.txt file, and they are more than healthily running their site. Ditch robots.txt! It's redundant! Even the ArchiveTeam, run by the well known and beloved Jason Scott, dislikes it.
For more information about why robots.txt sucks, read this article:
http://archiveteam.org/index.php?title=Robots.txt
In the mean time, keep voting! :D
P.S. Sorry for rushing this petition. If anyone wants to make it better, feel free to do so. :)
Posted
(Updated )
Report this as inappropriate
There was an error when submitting your files and/or report.