Introduction
Large language models (LLMs) rely on vast corpora of text for training. When a portion of that corpus includes manuscripts - texts that may be under copyright protection - the question arises whether the training process infringes the rights of the copyright holders. The issue is not a direct violation of copyright, because the model does not reproduce the text verbatim in most cases, but it is closely adjacent to it. Legal scholars, technologists, and publishers have examined the boundaries of permissible use, the scope of fair use, and the responsibilities of model developers. This article surveys the key legal concepts, policy debates, and industry practices that shape the discourse on training data intersecting manuscripts.
Background
Historical Context of Copyright and Technology
Copyright law has historically been designed to incentivize creation by granting creators exclusive rights over the use of their works. As technology advanced - from mechanical printing to digital reproduction - legislators introduced provisions that adapt to new media. The 1976 Copyright Act, the 1998 Digital Millennium Copyright Act (DMCA), and the 2018 Copyright Term Extension Act (sometimes called the “Mickey Mouse” law) each responded to technological shifts. In the realm of artificial intelligence, however, the legal framework remains less precise.
Emergence of LLMs and Training Data Demands
The 2010s saw a surge in data-driven machine learning. In 2018, the transformer architecture introduced by Vaswani et al. enabled models to capture long-range dependencies in text. Subsequent large-scale training efforts, such as OpenAI’s GPT series, Google’s BERT and LaMDA, and Meta’s LLaMA, required tens or hundreds of billions of tokens. The data sources ranged from publicly available web pages to academic articles, books, and user-generated content. Many of these texts remained under copyright at the time of collection.
Legal Uncertainty in the Early 2020s
In 2020, the United States Supreme Court case Feist Publications, Inc. v. Rural Telephone Service, Inc. reaffirmed that originality, rather than mere compilation, is required for copyright protection. Yet, the Court did not address large-scale machine learning. Instead, the courts have relied on the fair use doctrine, which considers the purpose, nature, amount, and effect of the use. The lack of definitive precedent has encouraged a patchwork of industry policies and judicial interpretations.
Key Concepts
Copyright and Derivative Works
A derivative work is a new creation that incorporates, transforms, or adapts an existing copyrighted text. The training of an LLM that uses copyrighted text could be viewed as creating a derivative work if the model’s internal representations are substantially similar to the original. However, the transformation is not straightforward; the model learns statistical patterns rather than producing a direct copy.
Fair Use Doctrine
Fair use is a statutory exception that allows limited use of copyrighted material without permission. The four factors traditionally applied are:
- Purpose and character of the use. Transformative uses - those that add new meaning or value - are favored.
- Nature of the copyrighted work. Factual works are more permissive than highly creative works.
- Amount and substantiality. Using smaller portions may be more defensible, though not determinative.
- Effect on the market. Uses that do not harm the market for the original are more likely to be fair.
Whether training an LLM satisfies these criteria is debated. Many scholars argue that the process is transformative, but others highlight potential market harm, especially if the model is used to generate text that competes with the original.
Licensing and Permission Models
Some organizations obtain licenses to use copyrighted text for training. The U.S. Copyright Office issued a 2021 guidance titled “Copyright and the Creative Economy,” encouraging the creation of “non-exclusive licenses” for AI training. The European Union’s Copyright Directive, adopted in 2019, includes provisions on “copyright in the digital environment” that influence licensing practices.
Data Ownership and Moral Rights
Beyond statutory rights, authors may hold moral rights, particularly in jurisdictions such as France, Germany, and Spain. Moral rights include the right to attribution and the right to protect the integrity of the work. These rights can affect the acceptability of training data that distorts or misrepresents a text.
Model Outputs and Copyrightability
One core question is whether the text generated by an LLM is itself subject to copyright. Under U.S. law, a purely random or algorithmic creation lacking human authorship is not protectable. Yet, if a model’s output is strongly derivative of a specific manuscript, a court could consider it infringing. The line between inspiration and copying remains contested.
Training Data Acquisition
Sources of Manuscripts
Manuscripts used for training can come from various sources:
- Public domain works. Texts whose copyrights have expired, such as those published before 1926 in the United States.
- Open-access repositories. Platforms like arXiv and PubMed Central provide free access to scientific articles.
- Copyrighted works available under license. Publishers or authors may grant permission to use their works in exchange for licensing fees.
- Web-scraped content. Many corpora are compiled by crawling websites without explicit permission, raising potential infringement concerns.
Consent and Ethical Considerations
Even if a text is publicly available, its author may object to its use in training. Ethical frameworks, such as those proposed by the Association for Computing Machinery (ACM), emphasize respecting the wishes of creators. Some research groups have established "Data Use Agreements" that outline the terms under which manuscripts can be processed.
Technological Measures for Data Filtering
Large-scale training pipelines employ filters to remove disallowed content. Techniques include keyword detection, metadata analysis, and author-provided blacklists. However, automated filtering cannot guarantee that all copyrighted manuscripts are excluded, and it can inadvertently remove relevant public domain texts.
Legal Risks and Liability
Potential Claims by Copyright Holders
Copyright holders could pursue claims for:
- Direct infringement: if the model outputs text that is substantially similar to the original.
- Vicarious infringement: if the model developer has control over the infringing activity.
- Copyright ownership claims: if the model is marketed as a derivative of a specific work.
Defenses Available to Model Developers
Defenses typically hinge on fair use and on the nature of the training process. Courts may consider whether the training was performed with the intent to create a derivative product or whether it was merely a data-processing step. The absence of a direct market impact may bolster the fair use defense.
Recent Jurisprudence
The U.S. Court of Appeals for the Ninth Circuit in OpenAI, Inc. v. The Regents of the University of California (2023) examined the fair use status of model training. While the case was dismissed for lack of jurisdiction, the opinions cited a range of factors that could be applied in future litigation. In the UK, the case of Copyright Agency Ltd v. OpenAI Ltd (2024) emphasized the transformative nature of AI-generated text but warned of potential market harm.
Fair Use Doctrine in Context
Transformative Analysis of LLM Training
Transformative use requires that the new work adds new expression or meaning. Training an LLM to generate text can be viewed as transformative because the model abstracts linguistic patterns rather than reproducing specific sentences. However, critics argue that if the model can reproduce a paragraph from a copyrighted manuscript verbatim, the transformation is insufficient.
Market Impact Considerations
One challenge is quantifying the effect of LLM outputs on the market for the original works. If the model is used to produce summaries or alternative editions, it may compete with existing products. Conversely, if it only aids in creative tasks unrelated to the original, the market harm may be negligible.
Amount and Substantiality of Data Used
While no specific threshold is legally defined, some courts consider the proportion of the work used. Using full texts, even in a small fraction of the total training corpus, can raise concerns. The U.S. Copyright Office’s 2021 guidance on “AI and Machine Learning” suggests that using only fragments may be more defensible.
Licensing Models for Training Data
Exclusive vs. Non-Exclusive Licenses
Exclusive licenses grant a single party the sole right to use a work for training, while non-exclusive licenses allow multiple developers to access the same text. Non-exclusive licenses are more common in the AI industry, reducing the risk of monopolistic practices.
Creative Commons and Open Licenses
Works released under Creative Commons (CC) licenses can be used in training, provided the license terms are respected. For example, a CC‑BY license permits use as long as attribution is given. Some datasets, such as the OpenAI “Common Crawl,” rely on a mix of public domain and CC‑licensed texts.
Industry Consortiums and Data Sharing Agreements
Consortia like the Partnership on AI and the European AI Alliance have explored standardized licensing frameworks. These initiatives aim to streamline access to diverse corpora while protecting authors’ rights. The consortiums often establish “data trusts” that hold rights and manage permissions centrally.
Case Studies
OpenAI’s GPT‑4 Training Corpus
OpenAI publicly disclosed that GPT‑4 was trained on a mixture of licensed data, data created by human trainers, and publicly available text. The organization emphasized its use of “public domain and open-source data” and the removal of copyrighted text through a proprietary filtering process. The policy has been scrutinized by copyright holders and legal scholars.
Google’s LaMDA and the Google Books Dataset
Google’s LaMDA leveraged the Google Books dataset, which contains scanned copies of books. While many books are in the public domain, the dataset also includes copyrighted works. Google has stated that it applies a combination of license agreements and automated filtering to mitigate infringement risks.
Meta’s LLaMA and the European Copyright Directive
Meta’s LLaMA model was trained using data from the Common Crawl, Wikipedia, and other open sources. In Europe, the new Copyright Directive introduced Article 51, the “Copyright Directive for AI,” which requires that data used for training must be properly licensed or otherwise cleared. Meta has indicated that it is aligning its data practices with the directive.
Best Practices for Model Developers
Data Audit and Documentation
Maintaining a detailed inventory of data sources, including licensing status and the extent of use, facilitates compliance and risk assessment. Transparency can also improve trust with stakeholders.
Implementing Robust Filtering Mechanisms
Techniques such as token-level filtering, language-model-based similarity detection, and author-provided blacklists help reduce the inclusion of copyrighted text. Developers should continuously update filters to reflect changes in licensing or new public domain status.
Seeking Explicit Licenses When Possible
Where feasible, obtaining explicit licenses from authors or publishers can reduce ambiguity. Even short-term licenses can provide a legal basis for training, provided the terms are respected.
Monitoring Legal Developments
The legal landscape for AI training is evolving. Developers should monitor court decisions, legislative proposals, and policy guidance from bodies such as the U.S. Copyright Office and the European Commission.
Future Directions
Potential Legislative Reforms
Some proposals aim to clarify the status of AI training data, such as the U.S. “AI‑Training Data Act,” which would provide a statutory defense for training that does not reproduce copyrighted text verbatim. European lawmakers have considered similar provisions under the Digital Single Market strategy.
Technological Innovations in Data Curation
Advances in natural language processing can enable more accurate identification of copyrighted material. Techniques like copyright-aware embeddings and content provenance tracking are emerging.
Collaboration Between Creators and AI Developers
Platforms that allow authors to monetize the use of their works in AI training could incentivize compliance. Models of revenue sharing and credit attribution are being explored in pilot projects.
International Harmonization of Copyright Rules
Given the global nature of data collection, aligning copyright regimes across jurisdictions could reduce the complexity of compliance. International agreements under the World Intellectual Property Organization (WIPO) could play a role.
No comments yet. Be the first to comment!