What Can We Do about LLMs on Our Independent Websites?

The following post appeared in my Mastodon timeline this morning:

As of today, MacStories is disallowing all known AI crawlers, including ChatGPT and Applebot. If you want to read our site, you should use a web browser or RSS.
✌️
Federico Viticci, June 13, 2023

This pronouncement could, indeed, is being made by countless others who run independent websites. Questions and anxiety abound, but we are nowhere close to fashioning responses adequate to the current moment. A visceral sense of violation, of wrong, is hitting many of us, who have long since had enough of Silicon Valley, Madison Avenue, and all the rest treating the traces we leave online as a natural resource to extract value from. Yesterday, the above-quoted observer on Mastodon reported, “I was grossed out when I read that Applebot scraped ‘the open web’ to train their AI model…”

My most immediate question about his no-AIs-on-MacStories statement was how? How you are working to keep LLMs from feasting on your site? Take the robots.txt file: Are there settings that allow search engines to index pages while instructing those same search engines that their LLM activities are not welcome? How do you cover all known LLMs and related scraping activities in a meaningful way? Aren’t these things and their user numbers multiplying and mutating in ways that exceed our individual tracking abilities, not to mention the limits of what a robots.txt file can hold?

Alternatively, have you found a way to make copyright and licensing statements that these tools will understand and honor? I’m trying meta tags, but I think only the first of the two below is standard:

<meta name="author" content="Mark R. Stoneman">
<meta name="license" content="CC BY-NC-SA 4.0">

I’m also using semantic tags to ensure my copyright and CC BY-NC-SA 4.0 license statements in the footer are easily discoverable, but it all feels so provisional, so early-2000s-wild-west again. Only worse.

Moreover, the LLM challenge reaches beyond simple illegal copying, downloading, storage, and retrieval. These LLMs are being trained on our words, trained, it almost seems, by lurking in our conversations and learning. Their value proposition is that they will build something new on the basis of what already exists. Although this feels purely extractive (like Elsevier’s contribution to academic publishing), how can such direct links between our individual output and that of the LLMs be established?

Perhaps we are mixing feelings and concepts in ways that do not necessarily apply to the current situation. To begin with, there is that sense of personal violation that many of us have long felt with respect to the trackers that Google and other marketing entities use without our permission. The European Union’s GDPR is meant to address that problem for its citizenry, but the phenomenon of LLMs training on our language does not necessarily amount to an infringement of our data privacy—not if they are “learning” how to “listen” and “talk”, but not retaining data on our personal habits for later use. At the same time, this LLM scraping feels even creepier and more extractive than what advertising trackers were doing in the pre–LLM era.

I’m trying to imagine how we can regain a sense of control over the things we publish on the internet. We need practical web standards and tools that give us choices that the big LLM players are willing to respect. We need metaphors that reach beyond copyright and privacy, when necessary, but that don’t lead us to anthropomorphize the LLMs and their “learning”, thereby making the problem of so-called AI even worse. And we need jurisprudence that looks for ways to deal with our sense of violation that privacy laws and copyright alone do not seem adequate for. Meanwhile companies that use our linguistic existences to sell language-based LLM services back to us would do well to listen to their critics and adjust their activities accordingly. Wrapping up everything in pretty privacy policies and new feature hype will not be enough for them to thrive, although their deep pockets will help buy the legal frameworks they want, if we’re not paying attention.

Want to discuss? Sign in below, or reply directly from your own Fediverse, Bluesky, or IndieWeb home.