Update your Cookie Settings to use this feature.
Click 'Allow All' or just activate the 'Targeting Cookies'
By continuing you accept Avaaz's Privacy Policy which explains how your data can be used and how it is secured.
Got it
We use cookies to analyse how visitors use this website and to help us provide you the best possible experience. View our Cookie Policy .
OK
The Internet Archive: Include Every Site on the Wayback Machine, Regardless of Robots.txt

The Internet Archive: Include Every Site on the Wayback Machine, Regardless of Robots.txt

1 have signed. Let's get to
50 Supporters

Close

Complete your signature

,
By continuing you agree to receive Avaaz emails. Our Privacy Policy will protect your data and explains how it can be used. You can unsubscribe at any time. If you are under 13 years of age in the USA or under 16 in the rest of the world, please get consent from a parent or guardian before proceeding.
This petition has been created by Michael R. and may not represent the views of the Avaaz community.
Michael R.
started this petition to
The Internet Archive
As you might know, robots.txt is a bossy, simple text file written entirely in letters that tells what crawlers can and cannot do to the site. But those letters, like one's bad language, can do a lot of damage...in the case of archiving websites, that is.

You see, archive.org is being slightly gullible here. They are letting a simple text file control what and what does not go on the Wayback Machine, and it's ridiculous, because this policy is preventing some awesome web pages from reaching their database. Even worse is: robots.txt files can change over time, most unapparent if a site was shut down and some spam crawler takes over the defunct site and adds a robots text file, even though the webmaster did not intend to add it. One time I was able to crawl Walmart's catalog on archive.org, but not anymore, because they updated their robots.txt policy a few months later.

Some other people might be uncaring, preventing their website from being crawled (e.g. Nintendo of Europe). What about the amazing, old history that lies behind that site URL? Can I really not see it for reason you won't tell? C'mon, site owners, it's just history of your site on an archiving service. It would really disappoint a lot to not see the history, because of your paranoia/"legal reasons".

Take a look at nintendo.com -- they have no robots.txt file, and they are more than healthily running their site. Ditch robots.txt! It's redundant! Even the ArchiveTeam, run by the well known and beloved Jason Scott, dislikes it.

For more information about why robots.txt sucks, read this article:

http://archiveteam.org/index.php?title=Robots.txt

In the mean time, keep voting! :D


P.S. Sorry for rushing this petition. If anyone wants to make it better, feel free to do so. :)


Posted (Updated )