For more than 20 years, Kit Loffstadt has written fan fiction exploring alternate universes for “Star Wars” heroes and “Buffy the Vampire Slayer” villains, sharing her stories free online.
But in May, Ms. Loffstadt stopped posting her creations after she learned that a data company had copied her stories and fed them into the artificial intelligence technology underlying Chat GPT, the viral chatbot. Dismayed, she hid her writing behind a locked account.
Ms. Loffstadt also helped organize an act of rebellion last month against AI systems. Along with dozens of other fan fiction writers, she published a flood of irreverent stories online to overwhelm and confuse the data-collection services that feed writers’ work into AI technology.
“We each have to do whatever we can to show them the output of our creativity is not for machines to harvest as they like,” said Ms. Loffstadt, a 42-year-old voice actor from South Yorkshire in Britain.
Fan fiction writers are just one group now staging revolts against AI systems as a fever over technology has gripped Silicon Valley and the world. In recent months, social media companies such as Reddit and Twitter, news organizations including The New York Times and NBC News, authors such as Paul Tremblay and the actress Sarah Silverman have all taken a position against AI sucking up their data without permission.
Their protests have taken different forms. Writers and artists are locking their files to protect their work or are boycotting certain websites that publish AI-generated content, while companies like Reddit want to charge for access to their data. At least 10 lawsuits have been filed this year against AI companies, accusing them of training their systems on artists’ creative work without consent. This past week, Ms. Silverman and the authors Christopher Golden and Richard Kadrey sued OpenAI, the maker of ChatGPT, and others over AI’s use of their work.
At the heart of the rebellions is a newfound understanding that online information — stories, artwork, news articles, message board posts and photos — may have significant untapped value.
The new wave of AI — known as “generative AI” for the text, images and other content it generates — is built atop complex systems such as large language models, which are capable of producing humanlike prose. These models are trained on hoards of all kinds of data so they can answer people’s questions, mimic writing styles or churn out comedy and poetry.
That has set off a hunt by tech companies for even more data to feed their AI systems. Google, Meta and OpenAI have essentially used information from all over the internet, including large databases of fan fiction, troves of news articles and collections of books, much of which was available free online. In tech industry parlance, this was known as “scraping” the internet.
OpenAI’s GPT-3, an AI system released in 2020, spans 500 billion “tokens,” each representing parts of words found mostly online. Some AI models span more than one trillion tokens.
The practice of scraping the internet is longstanding and was largely disclosed by the companies and nonprofit organizations that did it. But it was not well understood or seen as particularly problematic by the companies that owned the data. That changed after ChatGPT debuted in November and the public learned more about the underlying AI models that powered the chatbots.
“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, the founder and chief executive of Nomic, an AI company. “Previously, the thought was that you got value from data by making it open to everyone and running ads. Now, the thought is that you lock up your data, because you can extract much more value when you use it as an input to your AI.”
The data protests may have little effect in the long run. Deep-pocketed tech giants like Google and Microsoft already sit on mountains of proprietary information and have the resources to license more. But as the era of easy-to-scrape content comes to a close, smaller AI upstarts and nonprofits that had hoped to compete with the big firms might not be able to obtain enough content to train their systems.
In a statement, OpenAI said ChatGPT was trained on “licensed content, publicly available content and content created by human AI trainers.” It added, “We respect the rights of creators and authors, and look forward to continuing to work with them to protect their interests.”
Google said in a statement that it was involved in talks on how publishers could manage their content in the future. “We believe everyone benefits from a vibrant content ecosystem,” said the company. Microsoft did not respond to a request for comment.
The data revolts erupted last year after ChatGPT became a worldwide phenomenon. In November, a group of programmers filed a proposed class action lawsuit against Microsoft and OpenAI, claiming the companies had violated their copyright after their code was used to train an AI-powered programming assistant.
In January, Getty Images, which provides stock photos and videos, sued Stability AIan AI company that creates images out of text descriptions, claiming the start-up had used copyrighted photos to train its systems.
Then in June, Clarkson, a law firm in Los Angeles, filed a 151-page proposed class action suit against OpenAI and Microsoft, describing how OpenAI had gathered data from minors and said web scraping violated copyright law and constituted “theft.” On Tuesday, the firm filed a similar suit against Google.
“The data rebellion that we’re seeing across the country is society’s way of pushing back against this idea that Big Tech is simply entitled to take any and all information from any source whatever, and make it their own,” said Ryan Clarkson, the founder of Clarkson.
Eric Goldman, a professor at Santa Clara University School of Law, said the lawsuit’s arguments were expansive and unlikely to be accepted by the court. But the wave of litigation is just beginning, he said, with a “second and third wave” coming that would define AI’s future.
Larger companies are also pushing back against AI scrapers. In April, reddit said It wanted to charge for access to its application programming interface, or API, the method through which third parties can download and analyze the social network’s vast database of person-to-person conversations.
Steve Huffman, Reddit’s chief executive, said at the time that his company didn’t “need to give all of that value to some of the largest companies in the world for free.”
That same month, Stack Overflow, a question-and-answer site for computer programmers, said it would also ask AI companies to pay for data. The site has nearly 60 million questions and answers. Its move was earlier reported To install Wired.
News organizations are also resisting AI systems. In an internal memo about the use of generative AI in June, The Times said AI companies should “respect our intellectual property.” A Times spokesperson declined to elaborate.
For individual artists and writers, fighting back against AI systems has meant rethinking where they publish.
Nicholas Kole, 35, an illustrator in Vancouver, British Columbia, was alarmed by how his distinct art style could be replicated by an AI system and suspected the technology had scraped his work. He plans to keep posting his creations to Instagram, Twitter and other social media sites to attract clients, but he has stopped publishing on sites like ArtStation that post AI-generated content alongside human-generated content.
“It just feels like wanton theft from me and other artists,” Mr. Cole said. “It puts a pit of existential dread in my stomach.”
At Archive of Our Own, a fan fiction database with more than 11 million stories, writers have increasingly pressured the site to ban data-scraping and AI-generated stories.
In May, when some Twitter accounts shared examples of ChatGPT mimicking the style of popular fan fiction posted on Archive of Our Own, dozens of writers rose up in arms. They blocked their stories and wrote subversive content to mislead the AI scrapers. They also pushed Archive of Our Own’s leaders to stop allowing AI-generated content.
Betsy Rosenblatt, who provides legal advice to Archive of Our Own and is a professor at University of Tulsa College of Law, said the site had a policy of “maximum inclusivity” and did not want to be in the position of discerning which stories were written with AI
For Ms. Loffstadt, the fan fiction writer, the fight against AI came as she was writing a story about “Horizon Zero Dawn,” a video game where humans fight AI-powered robots in a postapocalyptic world. In the game, she said, some of the robots were good and others were bad.
But in the real world, she said, “thanks to hubris and corporate greed, they are being twisted to do bad things.”