robots.txt as an Insight Into Web Administration Wars
robots.txt, or the Robot Exclusion Protocol, is one of the oldest protocols on the web. It's a file, usually stored at the top level of a domain, that provides a list of rules which politely informs web crawlers what they are and are not allowed to do. This simple file is a great insight into the kinds of struggles that web administrators have in maintaining their websites.
Wikipedia¶
Wikipedia, being one of the most visited sites ever, has to deal with a large number of web crawlers who are drooling at the mouth to index their content. Let's take a look at their robots.txt
:
➜ ~ curl https://en.wikipedia.org/robots.txt |& less
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#
# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /
# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /
# Wikipedia work bots:
User-agent: IsraBot
Disallow:
User-agent: Orthogaffe
Disallow:
# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /
User-agent: DOC
Disallow: /
User-agent: Zao
Disallow: /
[...]
The Wikipedia devs have quite clearly explained their reasoning for disallowing certain crawlers. In this case, we see the above crawlers are known to be well-behaved, but they aren't feeding any search engines. Others are known to be trouble:
# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /
User-agent: Zealbot
Disallow: /
User-agent: MSIECrawler
Disallow: /
User-agent: SiteSnagger
Disallow: /
User-agent: WebStripper
Disallow: /
They also seem to have had issues with wget
users trying to recursively get all pages:
#
# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /
Let's experiment with this by downloading the Wiki entry for robots.txt
➜ test_recursive_wget wget --recursive --level=0 https://en.wikipedia.org/wiki/Robots.txt
--2024-01-09 14:05:55-- https://en.wikipedia.org/wiki/Robots.txt
Resolving en.wikipedia.org (en.wikipedia.org)... 208.80.154.224
Connecting to en.wikipedia.org (en.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 136396 (133K) [text/html]
Saving to: ‘en.wikipedia.org/wiki/Robots.txt’
en.wikipedia.org/wiki/Robots.txt 100%[============================================================================================================================================>] 133.20K --.-KB/s in 0.08s
2024-01-09 14:05:55 (1.70 MB/s) - ‘en.wikipedia.org/wiki/Robots.txt’ saved [136396/136396]
Loading robots.txt; please ignore errors.
--2024-01-09 14:05:55-- https://en.wikipedia.org/robots.txt
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 27524 (27K) [text/plain]
Saving to: ‘en.wikipedia.org/robots.txt’
en.wikipedia.org/robots.txt 100%[============================================================================================================================================>] 26.88K --.-KB/s in 0s
2024-01-09 14:05:55 (328 MB/s) - ‘en.wikipedia.org/robots.txt’ saved [27524/27524]
FINISHED --2024-01-09 14:05:55--
Total wall clock time: 0.3s
Downloaded: 2 files, 160K in 0.08s (2.04 MB/s)
We observe wget downloading the en.wikipedia.org/robots.txt
file to consult if it's allowed to do the recursive download. The man page states:
Wget can follow links in HTML, XHTML, and CSS pages, to create local versions of remote web sites, fully recreating the directory structure of the original site. This is sometimes referred to as "recursive downloading." While doing that, Wget respects the Robot Exclusion Standard (/robots.txt).
So it makes sense that it refuses to download any further links.
Wikipedia has also struggled with clients that don't respect robots.txt
anyway, but they add a rule as an act of futile desparation:
Webreaper is simply permabanned:
# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /
Wikipedia states that if your bot is well-behaved, they'll allow your bot through, but only for static pages.
#
# Friendly, low-speed bots are welcome viewing article pages, but not
# dynamically-generated pages please.
#
# Inktomi's "Slurp" can read a minimum delay between hits; if your
# bot supports such a thing using the 'Crawl-delay' or another
# instruction, please let us know.
#
# There is a special exception for API mobileview to allow dynamic
# mobile web & app views to load section content.
# These views aren't HTTP-cached but use parser cache aggressively
# and don't expose special: pages etc.
#
# Another exception is for REST API documentation, located at
# /api/rest_v1/?doc.
#
User-agent: *
Allow: /w/api.php?action=mobileview&
Allow: /w/load.php?
Allow: /api/rest_v1/?doc
Disallow: /w/
Disallow: /api/
Disallow: /trap/
Disallow: /wiki/Special:
Disallow: /wiki/Spezial:
Disallow: /wiki/Spesial:
Disallow: /wiki/Special%3A
Disallow: /wiki/Spezial%3A
Disallow: /wiki/Spesial%3A
Amazon¶
Amazon's case is interesting as their rule matching is less specific and instead targets all User-agent
values. The notable value here is the fact that they block GPT from training on their data:
Microsoft¶
Microsoft simply blocks all access everywhere with no exceptions:
News Organizations¶
TheVerge.com is another website that blocks GPTBot:
OepnAI has a very brief documentation page that shows how to block GPTBot. It's no surprise that many websites would opt out of their crawlers, but the interesting thing to note is that as ChatGPT is evolved to ingest more and more data, we can expect many more websites to request GPTBot to not index their content. This has a real possibility of making the input quality to future ChatGPT models far worse than the initial interations, especially if information juggernauts like Wikipedia decide to restrict access to their contents. It's also likely that with more news organizations requesting their content to not be crawled that ChatGPT will become less able to disseminate news-oriented information.
Some other organizations that block GPTBot are:
- CNN
- MSNBC
- PBS
- NPR
- The Washington Post
- Bloomberg
In fact, most of the major news organizations I investigated have all blocked GPTBot, which indicates there is a strong desire amongst privatized organizations to not have their content employed in the training of for-profit LLMs.
Reddit¶
Reddit has quite an amusing rule:
Which, of course, is a reference to Bender from Futurama's most beloved refrain, often said in anger.
Effectiveness¶
It's worth noting, and probably obvious to every reader here, that robots.txt
is not an effective means of blocking web crawlers from scouring your website. It's akin to posting a "No Trespassing" sign on your property. The sign itself won't stop bad actors from trespassing, so many of these LLMs which don't adhere to ethical principles can simply decide to not respect the webadmin's wishes. It's entirely plausible that black market LLMs might one day become more effective than well-known industry-standard LLMs as they don't need to adhere to annoying things like ethics and laws.