Is AI copying or creating? Study suggests copyright should focus on data dependence


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 18-02-2026 10:40 IST | Created: 18-02-2026 10:40 IST
Is AI copying or creating? Study suggests copyright should focus on data dependence
Representative Image. Credit: ChatGPT

A new economic study states that existing copyright law is not equipped to handle the rise of generative artificial intelligence. The authors argue that the legal test courts rely on today fails to capture the true nature of creative dependence in the AI era.

The study, titled Creative Ownership in the Age of AI and released as an arXiv working paper, lays out a new framework for determining when AI-generated output should count as infringement. Instead of asking whether an output is substantially similar to a protected work, the authors propose a more demanding standard: whether the output could have been generated without that work being present in the training data.

From similarity to dependence: A new test for infringement

Current copyright law typically focuses on resemblance. Courts ask whether a new work is substantially similar to protected expression. This approach draws a firm line between protected expression and unprotected ideas or styles. Under this framework, copying an author's tone or an artist's aesthetic does not usually qualify as infringement unless protected elements are reproduced.

Generative AI systems complicate that distinction. Large language models can mimic the cadence of a novelist or the structure of a genre without lifting any identifiable passages. Image models can recreate visual styles without copying specific works. In most cases, such outputs would not meet the substantial similarity threshold.

This resemblance-based test misses a deeper issue: whether a specific work was essential to producing a given output. They introduce a counterfactual criterion. An AI-generated creation infringes on an existing work if that output could not have been generated without that work in the training corpus.

Under this approach, the question becomes causal rather than visual. If removing a work from the training data would make the output impossible to generate, then the output depends critically on that work. That dependence, the authors argue, forms a stronger basis for infringement than surface-level similarity.

This standard cuts in both directions. An output might look nothing like any single work but still depend critically on one of them. Conversely, an output might resemble an author's style yet remain lawful if there are alternative generative paths that do not rely on that author's work.

The paper positions this dependence-based test not as a wholesale replacement for existing law, but as a complementary criterion that could coexist with training-stage rules and output-stage similarity tests. In effect, it adds a third layer of analysis focused on generative necessity.

Modeling AI generation as a structured creative system

To formalize their proposal, the authors develop a mathematical model of creative production. In their framework, each creation is represented as a point in a multidimensional space. A corpus of works forms a set of such points. A generative AI system is modeled as a mapping that transforms a corpus into the set of outputs it can produce.

The authors define a generator with three core properties. First, preservation: any work in the corpus remains generable. Second, monotonicity: expanding the corpus cannot reduce the set of generable outputs. Third, idempotence: once the generator has produced its full range of outputs, running it again adds nothing new.

Within this structure, they define two central sets. The permissible set consists of outputs that remain generable even if any single work is removed from the corpus. The violation set consists of outputs that depend critically on at least one specific work.

The paper then derives several structural properties. The permissible set expands as the corpus grows. If more works are added, the space of non-infringing outputs does not shrink. The permissible set is also stable under further generation, meaning that combining permissible outputs cannot suddenly create a violation.

The authors identify conditions under which the permissible set is guaranteed to be nonempty. In certain geometric settings, if the corpus contains enough distinct works relative to the dimensionality of the creative space, there will always be at least some outputs that do not rely critically on any one input.

The model also captures how adding new works changes the landscape. If a newly added work was already permissible, the permissible set remains unchanged. If the new work was previously a violation, its addition expands the permissible set by creating an additional generative path. If the new work is genuinely novel and lies outside the existing generative frontier, the effect can be ambiguous.

The authors extend their framework to collections of protected works. A portfolio, a catalog, or a coalition of creators can be treated as a protected group. In such cases, the permissible set shrinks relative to the case of individual claims. The paper shows that violations can be superadditive: the combined claim of a group can exceed the sum of individual claims. This finding has implications for collective bargaining and licensing negotiations.

The study also addresses practical implementation. Directly testing counterfactual generability would require retraining a model without a specific work and observing whether the output persists. While this brute-force method is often infeasible, emerging research in machine unlearning and influence functions offers potential tools for estimating the marginal contribution of individual data points.

Innovation patterns determine the future of copyright protection

Next, the paper examines what happens as creative corpora grow large. In an era where AI models are trained on billions of works, the authors ask whether infringement risk will fade over time or remain a persistent concern.

To answer this, they analyze the ratio of permissible outputs to total generable outputs as the corpus expands. The result hinges on the statistical nature of creative innovation.

If new works follow a light-tailed distribution, where extreme breakthroughs are rare and innovation is gradual, the permissible ratio converges to one as the corpus grows. In plain terms, nearly all generable outputs eventually become non-infringing. As more works accumulate, no single work remains essential. Each output can be reached through multiple generative paths.

This scenario describes mature creative domains characterized by incremental innovation. In such markets, the marginal contribution of any individual work diminishes as the corpus expands. Under the authors' framework, copyright regulation would have diminishing bite in the long run.

On the other hand, if creative production follows a heavy-tailed distribution, where rare but transformative breakthroughs occur, the story changes. In heavy-tailed domains, extreme outliers continue to define the frontier of the creative space. Even as the corpus grows, some outputs remain critically dependent on these frontier works. The permissible ratio does not converge to one.

The authors illustrate this with the example of a Pareto distribution, a classic model for superstar effects. In such settings, a new work can dramatically exceed all previous works in scale or impact. Outputs near that frontier depend on the breakthrough work, and infringement risk persists indefinitely.

In fields like avant-garde art or high-concept literary fiction, where breakthrough innovation plays a central role, individual creators may retain meaningful leverage under a dependence-based standard. In more formulaic genres, where new works mostly refine existing patterns, the legal significance of any single work may erode over time.

The study also suggests that creators might strategically position themselves at the frontier of the creative space to preserve essentiality. If creators anticipate how AI systems use training data, they may aim to produce works that are hard to substitute. Such strategic behavior could thicken the tails of the distribution and sustain a larger violation set.

The authors also acknowledge the tradeoff between permissiveness and incentives. If nearly all outputs become non-infringing in large corpora, creators' ex ante incentives may weaken. On the other hand, a large and persistent violation set could limit the social gains from generative AI. The framework provides a foundation for analyzing this balance in future research.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback