The increasing scrutiny on data sets used in artificial intelligence (AI) has led to a lack of transparency among AI’s major players. Companies like Meta, which initially shared the data sets used to train its ChatGPT competitor Llama, are now secretive about their data sources. This lack of transparency serves as a powerful deterrent for these companies, as they fear lawsuits if they admit to using copyrighted material in their data training sets. However, this opacity makes it difficult for writers to identify potential copyright infringement.
At present, AI companies have the discretion to disclose or withhold information about their training sets. Without this crucial information, individuals find it nearly impossible to prove that their data was used or request its removal. The European Parliament has passed a draft law on AI regulations that emphasizes increased data transparency. However, these regulations are not yet enforced, and other regions are significantly behind in implementing such measures.
The battle over data sets like Books3 highlights the ongoing debate about AI’s role in our society. Copyright law aims to strike a balance between creators’ rights and the collective right to access information. In the age of AI, the fight over data sets questions what this balance should look like.
Those who advocate for data transparency argue that if companies like OpenAI have access to certain data sets, the public should also have access to them. Restricting access to data sets like Books3 might hinder innovation in the industry, limiting the entry of smaller companies and researchers while allowing established players to maintain their dominance.
Pam Samuelson, a copyright lawyer at the Berkeley Center for Law and Technology, suggests that cracking down on the use of data sets like Books3 might only benefit large corporations that have already utilized them. She points out that regulations cannot be applied retroactively and that stricter rules in the EU or US could create a phenomenon called “innovation arbitrage.” This refers to AI entrepreneurs flocking to countries with more lenient regulations, such as Israel and Japan.
The central issue underlying this debate is whether we should accept generative AI training on copyrighted material as an inevitability. Stephen King, after discovering that his work is included in Books3, stated that he would not forbid the teaching of his stories to computers. He compared trying to stop generative AI training to King Canute attempting to halt the tide or a Luddite destroying a steam loom to prevent industrial progress.
Idealists like Butterick and Hedrup refuse to give up the fight to regain control for creators. They advocate for a shift towards an opt-in model for generative AI training, where only works in the public domain or explicitly offered for such purposes are used in data sets. Eryk Salvaggio, an emerging technology researcher, supports this approach and believes that data sets should not be scraped from the web without permission.
While a comprehensive solution to this complex issue remains uncertain, intermediary efforts are being made. Startups like Spawning are persuading generative AI groups to respect creators’ wishes and keep their work out of data sets. Spawning’s search engine, “Have I Been Trained?,” currently allows people to check if their visual work has been used in AI training data sets. The company plans to add support for video, audio, and text in the near future. Moreover, Spawning offers an API that assists companies in honoring opt-outs. StabilityAI has already implemented this tool, and Spawning’s CEO, Jordan Meyer, hopes that larger companies like OpenAI and Meta will join the cause. Shawn Presser, who wants to help creative types protect their work, also expressed support for individuals’ right to opt out of data sets.
In conclusion, increased scrutiny on data sets used in AI has led to a lack of transparency among major players in the industry. The battle over these data sets highlights the ongoing debate surrounding AI’s role and the balance between creators’ rights and the public’s right to access information. Efforts to promote data transparency and respect creators’ wishes are underway, but a comprehensive solution will require collaboration between AI companies, legislators, and copyright experts.