Telling AI to go away (but politely)

When it comes to AI – as with all new technologies starting to find their stride – defining the rules about how they should be used lags behind their adoption
One thing we’ve noticed over the past few months is a slow and steady increase in traffic to sites we host from services which most people would label as “AI”. This isn’t a blog post about the pros and cons of AI though; instead this is a post about telling those services – as politely as possible – to go away.
There are many reasons why we might want to do this. Content licensing might forbid re-use of data in certain ways, some content might have more serious implications if AI misinterprets it, or in some cases we might just plain disagree with people using whatever content they feel like to train their language models. Whatever the reason, sometimes we need to say “not today” to the robots.
Unfortunately, the world of AI doesn’t seem to have moved as far as getting AI to read the licence for content it’s crawling and figure out if it’s allowed to. There’s probably a research paper in that for someone, but that’s a job for another day. Instead we’re left with the question of “how can we programmatically tell an AI crawler how it’s allowed to use content on this page?”
What’s the problem?
The default stance of most services which grab content from the internet – search engines and AI crawlers, for example – is that things are fair game unless explicitly told otherwise. Whether this is right or wrong is a minefield of content rights, copyright law and philosophy, but over the years a set of standards have emerged for more mature technologies like search engines. These can be used to describe exactly what can and can’t be done with online content – can they look at a page in the first place for example, and if they do should they then be able to index it?
There’s also a bunch of well-established standards for providing hints to search engines about what’s on a page, how data is structured and how it relates to other pieces of information, and there are standard ways of embedding licence and ownership details in things like images. All of these can be (and mostly are) used by well behaved systems to make informed decisions about how to use content they find online.
When it comes to artificial intelligence – as with all new technologies starting to find their stride – defining the rules about how they should be used lags behind their adoption. We took a quick look into the current options.
What can we do about it?
We’ve been looking at ways we can steer crawlers for AI services into doing what we want them to when it comes to consuming content on sites we run. Here are the most interesting and useful bits of what we found:
Use robots.txt
If you’ve done any kind of search engine tinkering or SEO in the past you’re probably familiar with robots.txt. The full specification is a bit boring, but the short version is it tells web crawlers how they should behave when faced with a site. It also allows us to tell specific crawlers what they can and can’t do, which means we can instruct GPTBot, for example, that it’s not allowed to crawl anything.
There are some problems with this approach though, namely that there’s no way of saying “all AI crawlers”. Instead you need to list each one you care about individually, and the default stance of “go for it” means if you miss one then it will continue to crawl content. People have tried to compile lists, but by their very nature they’re going to become out of date. Without some other standard this is just going to become an ongoing race between new models arriving with their own new crawlers (especially as cheaper compute time and roll-your-own models become more commonplace), and people updating their lists.
It gets even messier when we consider that AI is able to – in effect – conduct its own searches and ‘look’ at web pages to try to answer questions. OpenAI lists 3 crawlers (at the time of writing) which are variously used for training their model, returning search results which they think are useful, and trying to extract information from a page to answer a user query. Deciding where you’re happy drawing the line for the various capabilities of each service (assuming they distinguish them at all) is just another thing to keep track of.
The other big problem with this approach is that while it can technically be used to express preferences on a per-page basis, trying to do this with a large site will rapidly become unwieldy. Trying to express preferences for training, indexing and querying multiple pages between multiple crawlers is a combinatorial explosion. This could rapidly lead to enormous files that services will simply start to ignore as being too big to process in a timely manner.
Use the <meta name=”robots”> tag
On individual pages, we can use the “robots” meta tag to indicate to search engine crawlers that they shouldn’t be indexing a page, as well as slightly more complex concepts such as “you can index this, but don’t display text snippets”. When it comes to AI there isn’t an accepted standard for indicating if content can be used or not, but there is an emerging one:
<meta name=”robots” content=”noai,noimageai”>
This originally came from DeviantArt, but it seems to have a growing acceptance among content-sharing sites as being a de-facto standard for indicating this kind of thing. Whether AI crawlers care about it, on the other hand, is very much unknown.
This also lacks the ability to express the difference between the use of content for training and the use of content for answering queries, meaning it can be a bit of a blunt instrument.
Use domain-specific options
It turns out that there’s a standard mechanism (insofar as it’s included as a .well-known file) – ostensibly for news sites – for expressing trust relationships between services and highlighting things like responsible disclosure policies. This includes a datatrainingallowed parameter which can be used for telling crawlers at a site-level that we don’t want them to be using content for training.
Unfortunately, the same as with robots.txt, there’s no mechanism here for expressing nuance. It’s an all or nothing switch between “use our whole site for training purposes” and “don’t train using any content on our site”. Domain-specific options also suffer from being narrow in scope by design, making them poorly suited for sharing your intent with crawlers operating outside of that domain.
What’s the conclusion?
Ultimately, at the moment, there isn’t a good way to instruct AI crawlers when it comes to them consuming content. Only a set of suggestions with poor flexibility and even poorer implementation from crawlers where it exists at all.
The most robust method with the most support seems to be the use of robots.txt, but new crawlers may appear in the future and start slurping content before they can be included in lists. As services become more complex and with more nuanced abilities robots.txt is also badly suited for expressing different policies page-by-page.
Although the use of noai directives give us flexibility to specify exactly how we want AI to be able to interact with individual pieces of content, support isn’t widely implemented. As far as I can tell, big players like OpenAI simply ignore it, and even if they did offer support it doesn’t let us express concepts like “don’t use this for training, but you can use it to try to answer questions”.
Ultimately though, web standards are built from the ground up by people using them. We’d welcome a much more rigorous discussion around how content providers of all shapes and sizes can tell AI crawlers what they can and can’t do, but in the meantime using what we’ve got available now helps to identify gaps in what we can express and place pressure on those AI crawlers to be good internet citizens.