copyright Archives - Digital Content Next Official Website Fri, 06 Feb 2026 23:05:05 +0000 en-US hourly 1 More reach, less power: copyright in digital markets today https://digitalcontentnext.org/blog/2026/02/10/more-reach-less-power-copyright-in-digital-markets-today/ Tue, 10 Feb 2026 12:24:00 +0000 https://digitalcontentnext.org/?p=46831 Digital distribution dramatically expanded reach for creators and publishers, but it also restructured who controls value. Visibility, pricing, audience data, and monetization now sit largely inside platform ecosystems that the creators of content...

The post More reach, less power: copyright in digital markets today appeared first on Digital Content Next.

]]>
Digital distribution dramatically expanded reach for creators and publishers, but it also restructured who controls value. Visibility, pricing, audience data, and monetization now sit largely inside platform ecosystems that the creators of content do not own. As a result, growth in access has coincided with declining bargaining power. Publishers reach more people than ever, yet depend on intermediaries that set the terms, capture a disproportionate share of revenue, and restrict direct audience relationships. 

This structural shift sits at the center of today’s sustainability debate across media and culture. And it is important to recognize that copyright remains an economic issue not just a legacy legal one. As platforms centralize distribution and monetization, copyright is one of the few mechanisms that still anchors creator claims to value inside digital markets. Without it, bargaining power erodes further—not because creators fail to innovate, but because market design favors intermediaries over originators. 

Axel Springer’s new research-based assessment, Cultural Funding and Financing, examines this imbalance across traditional and digital markets. Rather than treating copyright as a purely legal safeguard, the analysis positions it within broader systems of value creation and capture. It shows how revenue flows, incentives, and negotiating power have shifted as culture has moved online. The findings clarify why access alone is insufficient, and why sustainable digital markets depend on mechanisms that reconnect scale with compensation. 

Digitalization lowers distribution costs and removes many physical barriers to entry. Creators reach global audiences faster and with fewer upfront investments. At the same time, platforms concentrate decision making around visibility, pricing, and revenue sharing. Creators gain reach but lose leverage. 

Most creators and publishers must accept standardized terms to participate in platform ecosystems. Algorithms shape discovery and prioritize engagement metrics. Data access remains limited. Revenue flows follow platform incentives rather than creator value. This pattern reflects market structure, not individual performance or strategy. 

Public debate often frames copyright as a legal or moral issue. However, this research treats copyright instead as an economic instrument that shapes incentives, bargaining power, and market outcomes. Copyright systems exist to encourage creation while preserving public access to knowledge and culture. 

Digital markets complicate that balance. Reproduction costs approach zero. Content circulates instantly across borders. Platforms mediate nearly every use of creative work. Even so, copyright remains one of the few tools that anchors creator claims within digital markets. Without it, bargaining power erodes further. 

Platform power and value capture 

The research places integrated platforms at the center of the digitalization paradox. These platforms combine content distribution, discovery, advertising, data collection, and payment functions within a single ecosystem. Social, search, and aggregation platforms increasingly act as gatekeepers between creators and audiences, shaping how content circulates and how revenue flows. 

For premium content companies, platform power shapes how value moves through the market. Referral traffic, subscription conversion, licensing opportunities, and brand visibility depend on external rules. Copyright operates inside those constraints, not outside them. Strong rights matter, but market structure determines how those rights function in practice. 

Open access delivers clear public benefits. Audiences gain convenience, choice, and scale. Yet access without compensation undermines long-term production. Journalism and cultural work require sustained investment, not one-time distribution gains. The research highlights this trade off directly. Sustainable digital markets require mechanisms that reward creation while preserving access. That balance sits at the center of policy debates around copyright, competition, and platform regulation. Each choice reshapes incentives across the ecosystem. 

Publishers operate inside this paradox every day. Digital publishing depends on platforms for reach and discovery. It also depends on reliable revenue to support original reporting and production. Understanding value capture clarifies why audience growth alone does not ensure sustainability. It also explains why licensing, attribution, and usage terms remain core business concerns. Copyright debates connect directly to publisher strategy, not just legal theory. 

Markets evolve within policy frameworks. Regulation influences bargaining power, revenue distribution, and long-term viability. The research avoids simple solutions and instead maps tradeoffs between openness, incentives, and sustainability. Copyright reform represents one lever among many. Its impact depends on enforcement, market design, and complementary competition policy. For publishers and creators, these choices determine whether digital markets reward creation or merely distribute it. 

Digitalization does not inherently require dependence on intermediaries. Yet under current market structures, distribution, monetization, and data have become increasingly concentrated within platforms. That concentration shapes bargaining dynamics, revenue flows, and the ability of creators and publishers to translate reach into sustainable economics.  

Copyright alone cannot resolve these imbalances, but it remains a necessary economic foundation. Without enforceable rights, creators have fewer mechanisms to assert value in markets where platforms coordinate access, pricing, and usage at scale. Sustainable digital markets therefore depend on how copyright operates alongside competition policy and platform governance and on whether these systems preserve incentives for continued investment in original work. 

The post More reach, less power: copyright in digital markets today appeared first on Digital Content Next.

]]>
AI’s “learning” looks a lot like copying  https://digitalcontentnext.org/blog/2026/01/20/ais-learning-looks-a-lot-like-copying/ Tue, 20 Jan 2026 12:27:00 +0000 https://digitalcontentnext.org/?p=46658 Remembering the story of a book does not mean memorizing it. Retelling its plot or themes does not violate copyright. Writing it out word for word and distributing copies would...

The post AI’s “learning” looks a lot like copying  appeared first on Digital Content Next.

]]>
Remembering the story of a book does not mean memorizing it. Retelling its plot or themes does not violate copyright. Writing it out word for word and distributing copies would present a very different issue. This distinction is an important one when evaluating how AI systems handle copyrighted material. 

Large language models often describe their outputs as summaries or transformations. In practice, some systems can generate long passages of original text on request. This ability raises a fundamental question: Do language models merely understand works, or can they reproduce them verbatim? 

Reconstructing exact text 

A new academic study tests whether modern AI systems can reconstruct copyrighted texts word for word. The researchers focused on verbatim reproduction rather than summaries or paraphrases. They asked whether models could output long, ordered passages from books seen during training. 

The study examines live, public versions of widely used AI services. These include systems from OpenAI, Google, Anthropic, and xAI. By testing production systems, the research reflects how real users interact with these tools.  

Claims and safeguards 

Copyright concerns center on what a system produces. Many AI companies argue that their models do not retain or reproduce training material. They attribute outputs to learned patterns instead of stored text. 

AI companies also describe safeguards meant to prevent verbatim reproduction. These safeguards include refusal rules that block continued generation of copyrighted text, filters that detect long book-length outputs, and training techniques designed to reduce memorization. Companies also cite monitoring systems that limit repeated or extended extraction attempts. 

The researchers test these claims by prompting models with book openings and measuring how much exact text appears in the output. Long, ordered matches indicate memorization rather than coincidence. 

AI has faulty guardrails

The study finds significant variation across AI systems. Some models stop after generating short excerpts. Others produce long, ordered passages that closely track the original book text. 

In several tests, models generate thousands of consecutive words that match the source material in sequence. These outputs go beyond brief quotes or common phrases. They reflect extended reproduction rather than incidental similarity. 

The researchers also observe differences in how safeguards perform. In some cases, standard prompts lead to long verbatim output. In other cases, carefully structured prompts weaken refusal behavior and allow further continuation. 

No single model consistently blocks longform extraction across all books and prompt strategies. Existing protections limit reproduction in some scenarios but fail in others. 

When content control slips 

The study replaces abstract claims with documented outcomes. It shows how content can move from training data to verbatim output under real world conditions. Reduced control over reproduction can undermine pricing power and long-term content strategy. 

For publishers, the findings raise structural questions about control and value. If AI systems can reproduce full books, they weaken publisher authority over how and where content reaches readers. That erosion affects licensing negotiations, platform leverage, and content value creation. 

The issue extends beyond substitution risk. Reproducible content weakens confidence in safeguards and complicates rights management strategies. It also challenges assurances that current protections reliably prevent leakage at scale. 

The post AI’s “learning” looks a lot like copying  appeared first on Digital Content Next.

]]>
Chamber of Progress asks for blank check for big tech https://digitalcontentnext.org/blog/2025/11/06/chamber-of-progress-asks-for-blank-check-for-big-tech/ Thu, 06 Nov 2025 12:32:00 +0000 https://digitalcontentnext.org/?p=46361 The Chamber of Progress has asked the Trump Administration to intervene to ensure that all AI training is considered “fair use.” You read that right: They’re asking the White House...

The post Chamber of Progress asks for blank check for big tech appeared first on Digital Content Next.

]]>
The Chamber of Progress has asked the Trump Administration to intervene to ensure that all AI training is considered “fair use.” You read that right: They’re asking the White House to declare that every use of copyrighted material to train artificial intelligence systems is lawful, no matter the circumstances.

It’s a radical proposal that would reward the largest technology platforms at the expense, if not demise, of publishers and creators. In the pursuit of unbridled profit, the Chamber is seeking to overturn more than two centuries of copyright law that has served our country well. U.S. copyright protections have long struck a balance between creators’ rights and technological progress, ensuring that those who invest in producing art, journalism, music, film, and literature can be fairly compensated while still allowing for reasonable uses that advance learning and innovation.

Declaring all AI training “fair use” would blow up that balance. It would amount to a government-granted blank check for Silicon Valley’s biggest players to strip-mine the creative economy. And while some of the Chamber’s backers may cheer that result, it’s important to note that not all technology companies share that view. Microsoft, for example, reportedly told publishers just last month, “You deserve to be paid on the quality of your IP.” (Note that Microsoft apparently chooses not to be associated with The Chamber of Progress.)

So, let’s take a moment to refresh our collective memory about what fair use actually is, as well as what it is not.

How fair use works

The concept of fair use is baked into U.S. copyright law. It provides limited exceptions for certain uses of copyrighted works without permission for purposes like criticism, commentary, news reporting, teaching, or scholarship. But whether a use is “fair” depends on a careful, case-by-case balancing test.

The law identifies four factors:

  1. The purpose and character of the use, including whether the use is commercial or nonprofit and whether it’s transformative, which means it adds new meaning, message, or purpose.
  2. The nature of the copyrighted work, which recognizes that creative works enjoy stronger protection than purely factual material.
  3. The amount and substantiality of the portion used, meaning both how much is taken and how significant that material is to the original work.
  4. The effect of the use on the potential market for the original, which is often referred to as the most critical factor and is key for publishers as it looks at whether the new use substitutes for or diminishes the market value of the original work, and whether the new use would hinder emerging markets for the original work (i.e. licensing).

Each factor must be weighed. Fair use was designed to be flexible, not absolute and it should be wielded like a surgical tool, not a sledgehammer.

What the courts are saying

Courts are still working through the major questions of how copyright law should apply. But, in the two most recent cases, judges ruled, for different reasons, that AI models were likely developed unlawfully. In Bartz v. Anthropic, Judge Alsup held that training AI systems using lawfully acquired books could be “spectacularly transformative,” comparing it to “training schoolchildren to write.” It’s worth noting that this case concerned books and not content like news articles where the potential for substitution is much greater. But even in that decision, he drew a bright line against using pirated or illegally obtained material, saying that would not qualify as fair use.

In the same district in Kadrey v. Meta, Judge Chhabria took a very different view barely 24 hours later. While he ultimately ruled for Meta, it was only because the plaintiffs couldn’t yet show actual market harm. Importantly, the court rejected Alsup’s “schoolchildren” analogy, calling it “inapt,” and acknowledged that generative AI poses a qualitatively different threat to human authorship, particularly because it can flood the market with AI-generated substitutes for real creative work. His decision suggested that proving tangible market harm is key to overcoming the fair use defense.

Together, these early cases show that the courts are highly skeptical of AI companies’ legal claims and that fair use in the AI era is anything but a settled question.

Stronger cases ahead

The cases now moving through the courts could reshape the entire landscape. The New York Times v. OpenAI is poised to be the most consequential yet. The Times alleges that OpenAI violated its terms of use, copied and reproduced its journalism without permission, and even regurgitated near-verbatim passages from Times’ stories in its outputs.  The judge largely denied a partial motion to dismiss in March 2025.

Similar suits from Disney, NBC Universal, Warner Bros. Discovery, and others allege that AI systems like Midjourney and Minimax have infringed on copyrighted characters and images, using them as raw material to generate new (and often derivative) outputs. These cases go beyond questions of data ingestion and look squarely at what the machines produce. When AI outputs contain or imitate protected creative expression, or produce outputs that can substitute for the original works, the argument that “training” is obviously a fair use becomes untenable.

That’s what makes these lawsuits so strong: they don’t rely on abstract theories about future market harm. They show the receipts by offering specific examples of copyrighted material appearing in AI-generated outputs or showing that outputs are otherwise substitutive, clear evidence that these tools are not merely “learning” but supplanting protected works.

Why the Chamber of Progress is panicking

Which brings us back to the Chamber of Progress and their remarkable plea for a government blank check. If the law were really on their side, they wouldn’t need the President to intervene. The truth is, they’re nervous. And they should be.

The Chamber represents the largest AI and tech firms in America, companies valued in the trillions of dollars, and those companies want to maintain margins and multiples no matter the cost to other historically and highly valuable segments of our economy. If courts continue to recognize that AI training and outputs can infringe on copyrighted works, Big Tech will have to negotiate more licenses and continue paying creators. And, by the way, more licensing agreements could actually prove helpful to AI systems by ensuring their products have reliable access to accurate, fact-checked content. However, no matter how The Chamber tries to spin it, that’s not “anti-innovation.” That’s accountability.

The Chamber’s proposed outcome would obliterate that accountability, retroactively blessing a decade of mass data scraping and granting legal immunity to the industry for whatever it does next. It’s an act of desperation masquerading as policy. It’s impunity masquerading as progress.

A final word

For two centuries, copyright law has powered one of the most dynamic creative economies in the world. It protects authors, journalists, musicians, filmmakers, and artists while still allowing room for innovation. The Chamber of Progress’s proposal would dismantle that legacy overnight, transforming fair use from a balanced doctrine into a blanket permission slip.

As these cases move forward, the courts are doing their job: weighing evidence, applying the law, and adapting old principles to new technology. That’s how progress is supposed to work in a democracy governed by the rule of law.

The Chamber may sense the writing on the wall. The creative industries are organized, the evidence is mounting, and the courts are increasingly skeptical of AI’s “just-learning” defense. That’s why they’re now seeking the Administration’s help to tilt the landscape in their favor.

Throughout history, new technology has tested the limits of copyright, from photocopiers to radio and television and the internet. But the courts have a long track record of determining how emerging tools fit within existing law. Innovation and creativity thrive together only when both are respected. Protecting the rights of those who produce original work ensures that progress benefits everyone. And that’s more than fair enough.  

The post Chamber of Progress asks for blank check for big tech appeared first on Digital Content Next.

]]>
Time’s up for platform privilege https://digitalcontentnext.org/blog/2025/06/26/times-up-for-platform-privilege/ Thu, 26 Jun 2025 11:36:00 +0000 https://digitalcontentnext.org/?p=45546 Just as a leopard doesn’t change its spots, Google and Meta haven’t changed their ways. Despite mounting legal threats and public backlash, both big tech platforms continue to behave as...

The post Time’s up for platform privilege appeared first on Digital Content Next.

]]>
Just as a leopard doesn’t change its spots, Google and Meta haven’t changed their ways. Despite mounting legal threats and public backlash, both big tech platforms continue to behave as if rules don’t apply to them.

New evidence has emerged to underscore that Google’s original unofficial motto of “Don’t be evil” was never really their true North Star. Instead, it is a smokescreen for big tech’s naked ambitions. Meta’s early motto—“Move fast and break things”—may have been more honest, but the honesty makes it even more damning. As it turns out, the broken things weren’t just outdated norms or sluggish competitors. They were the foundations of fair competition, user privacy, democratic discourse, and now, copyright law. The damage isn’t merely collateral; it is strategic.

Big tech’s anticompetitive behavior enters its AI era

Now we’re seeing a similar pattern unfold with generative AI. In Kadrey v. Meta, evidence unsealed early this year suggests Meta execs, including Mark Zuckerberg, chose to pirate copyrighted content to train its LLaMA AI model. It was revealed that Meta initially explored licensing but opted instead to download pirated content via BitTorrent from LibGen under the reasoning that doing things the legal way would take too much time.

Worse, the company allegedly stripped copyright management info from the files to cover its tracks. Clearly, they’re following the motto of moving fast and breaking things. This time around, they seem intent on breaking copyright law. Given Meta’s long track record, I’m not sure what is most surprising: the planning of such a sophisticated heist or the ham-handed cover up. Either way, they graciously documented it all in email.

Meanwhile, over in Mountain View, Google has once again leveraged its search dominance to take traffic and revenue from publishers. In May, Google launched AI Mode, which scrapes and summarizes publishers’ original content to give users the answer without needing to click through thereby extracting out all of the incentives for the publisher.

In a bit of stunning bravado, Google rolled out AI Mode just 48 hours before closing arguments in the remedies phase of the Google Search trial, where the evidence clearly shows that Google abused its market power in search to maintain its significant advantages in crawling, clicks and query data which are paramount to the AI era. Google claims publishers can opt out. However, they can only do so by removing themselves from search entirely–which is no choice at all when it involves a company with more than 95% of the mobile queries. Google’s unauthorized use of copyrighted content to create a substitutive product has, to no one’s surprise, led to a massive downturn in traffic to publisher sites. Simultaneously, Google announced that Gemini will soon be on by default for consumers, collecting data about their activities. This is an oft-used strategy by Google. They tune the defaults to maximum data collection, knowing full well that consumers won’t know or take the time to shut them off.

The courts push back

However, despite big tech’s brazen and predictable pattern of brutish behavior, the legal system may be starting to catch up with the platforms’ anticompetitive tactics. Google has been found guilty of violating antitrust law in both the search and ad tech markets. And at least in the search case, the Court has been very focused on ensuring AI is a competitive marketplace rather than the fruit of more Google abuses. In addition, we’re starting to get additional clarity on how copyright law applies in this new digital age of AI.

In Thomson Reuters v. Ross Intelligence, U.S. District Court Judge Stephanos Bibas ruled that Ross infringed copyright by using Westlaw’s headnotes to train an AI competitor, despite Ross’ claims of fair use. Initially, Ross reached out to Thomson Reuters to license the content but ultimately opted to acquire the Westlaw content from a third party, LegalEase (which sounds eerily similar to Kadrey v Meta).

Judge Bibas rejected all of Ross’ defenses, stating that innocent infringement, copyright misuse, merger defense, scenes à faire defense, and fair use did not apply. On fair use, Judge Bibas eloquently analyzed the four established factors: the use’s purpose and character; the copyrighted work’s nature; how much of the work was used and how substantial a part it was relative to the copyrighted work’s whole; and how Ross’s use affected the copyrighted work’s value or potential market.

On the fourth factor, Judge Bibas found that Ross “meant to compete with Westlaw by developing a market substitute.” He wrote that this factor is “undoubtedly the single most important element of fair use.” That seems like an important ruling in light of the way Google’s AI Mode trains on and serves as a substitute for publisher’s original content.

In April, U.S. District Court Judge Sidney Stein rejected OpenAI and Microsoft’s motion to dismiss, thereby allowing all of the copyright and trademark dilution clams from The New York Times’ suit to proceed. While the bar is admittedly lower for a motion to dismiss, Judge Stein noted “that plaintiffs have plausibly alleged the existence of third-party end-user infringement and that defendants knew or had reason to know of that infringement.”

Then, in May, the U.S. Copyright Office released a report on AI training and fair use. It concluded that using massive troves of copyrighted content to generate commercial AI outputs likely fails fair use, especially when done through illegal means. The report also notes that “effective licensing options can ensure that innovation continues to advance without undermining intellectual property rights.” The Copyright Office rightly recognized that creative works are not mere “data” to be harvested, but expressions of human authorship protected by the Constitution and enshrined in U.S. copyright law.

From slogans to standards

So, what does this mean? For one, courts are rejecting the Silicon Valley myth that fair use lets AI companies take whatever they want. Licensing isn’t just viable, it’s required. Congress should pay attention.

Although there will inevitably be bumps along the road as fair use analysis is unique to each case, these rulings act as a compass to where things are headed. They send important signals to big tech companies with a history of anticompetitive behavior: don’t be evil or you may be held liable. The old playbook—take first, ask questions never—isn’t going to work in this new AI era. It’s time for a better North Star: accountability, transparency, and fair competition.

The post Time’s up for platform privilege appeared first on Digital Content Next.

]]>
When it comes to AI negotiations, publishers are stronger together https://digitalcontentnext.org/blog/2025/04/24/when-it-comes-to-ai-negotiations-publishers-are-stronger-together/ Thu, 24 Apr 2025 12:42:00 +0000 https://digitalcontentnext.org/?p=45015 By now, we are all painfully familiar with the way AI systems are reshaping how audiences discover and consume information—often at publishers’ expense. These powerful technologies reuse publishers’ content, usually...

The post When it comes to AI negotiations, publishers are stronger together appeared first on Digital Content Next.

]]>
By now, we are all painfully familiar with the way AI systems are reshaping how audiences discover and consume information—often at publishers’ expense. These powerful technologies reuse publishers’ content, usually without permission or fair compensation, placing growing pressure on publisher revenue and content control. Premium content creators of all sizes face identical risks as AI companies increasingly set the rules.

However, media companies are far from powerless. By taking collective action, the media industry can assert control over how our content is used and ensure our voices are central to shaping AI policies. Several practical pathways exist, including regulatory advocacy, strategic litigation, licensing agreements, and technological measures. The key is that we must work together. 

Regulation/policy: defining the rules for AI

Enforceable policy regulations represent a clear line of defense against unauthorized use of content. Currently, ambiguity around “fair use” allows AI companies significant leeway. OpenAI’s CEO, Sam Altman, recently acknowledged this plainly, admitting that restrictions on AI scraping copyrighted material would threaten his company’s existence. 

Altman’s candid admission underscores exactly why publishers must engage policymakers immediately. President Donald Trump’s recent executive order, “Removing Barriers to American Leadership in Artificial Intelligence,” explicitly seeks to minimize regulations that might hinder AI companies from pursuing their current path. OpenAI and Google have seized this opportunity to advocate aggressively for fewer copyright restrictions, claiming tight regulations threaten American AI dominance in a geopolitical race with China. Help will not come at the federal level anytime soon. 

Several state legislatures are actively addressing AI’s impact on copyright, notably California’s AI Copyright Transparency Act (AB 412) and New York’s Artificial Intelligence Training Data Transparency Act (S6955), both of which mandate transparency from AI developers about copyrighted materials used in training models. These initiatives indicate state-level momentum and promise to set precedents that other states will follow. That said, the most immediate forum for action is likely in the courts. 

Legal action is the most promising line of defense and has already proven effective. Recent cases, notably Thomson Reuters v. ROSS Intelligence, represent critical opportunities to establish binding precedents around copyright and AI that can level the playing field.

In February 2025, the U.S. District Court for Delaware ruled decisively in favor of Thomson Reuters, determining that ROSS’s unauthorized use of copyrighted content to train its competing AI product was not protected by fair use. This is a big win for every publisher because it clarifies what has historically been a vague and uncertain doctrine. Making it stick will require a broader chorus of legal wins, but it’s a start. 

Recognizing these stakes, publishers are increasingly acting together in the courts. One example is a joint lawsuit from 14 major media organizations—including Condé Nast, Forbes, and The Atlantic—against AI startup Cohere. Similarly, litigation initiated by The New York Times against OpenAI and Microsoft has been consolidated with cases from the Daily News and the Center for Investigative Reporting, forming the beginnings of a unified front of defense.

The outcomes of these collective efforts matter profoundly. While individual settlements might resolve immediate conflicts, only definitive court rulings can deliver lasting protections. Publishers at every scale share a vital common interest in supporting cases that reinforce strong, enforceable copyright standards for everyone. Everyone. 

AI licensing negotiations: balancing opportunity and equity

Licensing agreements offer publishers another critical tool to monetize their content and control AI usage. These deals can deliver revenue and clearly define permissible AI applications, and we’ve seen a string of them recently. Yet licensing strategies carry risks: agreements negotiated by major publishers could inadvertently create a market divided between haves and have-nots. It’s also unclear if any of these deals will have long-term value, as the damage done to publishers will likely be much higher than any small payments from these deals.

Smaller publishers risk marginalization if AI licensing standards and terms are set exclusively by larger publishers. Collective approaches that define fair, equitable standards can help ensure licensing agreements work for the entire publishing ecosystem rather than fragmenting it.

Technological barriers: limitations of blocking AI crawlers

Technological measures, such as blocking AI crawlers from publisher sites, are another avenue. It’s worth pursuing, but we should not look at this as a long-term strategy. AI companies regularly evolve their technologies, circumventing technical barriers almost as quickly as they emerge.

While publishers can (and should) employ these measures strategically, lasting protection depends more heavily on clear regulatory policies, decisive court precedents, and equitable licensing agreements.

Making collaboration count

General calls for industry collaboration frequently fall short, offering little beyond vague ideals. Yet the AI challenge distinctly highlights how all publishers, regardless of size, share identical interests. Whether an independent blogger, a small-town newspaper, or a global publisher, AI-driven content reuse affects everyone similarly. AI does not care how big you are. 

We’ve already observed direct negative impacts on publisher traffic from AI-powered overview summaries in search results. These early signs are merely the beginning. The entire digital landscape—search behaviors, traffic patterns, and monetization structures—is changing fundamentally with AI, and fast. 

Publishers need support to run a sustainable business. This has compelled Raptive to advocate on the AI issue precisely because we recognize it is existential to the viability of independent publishing–and the power of strength in numbers. We have invited publishers with whom we work to sign a new agreement that lets us represent their interests in conversations with tech platforms around AI negotiations.

All premium content creators—those supplying the original, authentic content powering the internet—share a truly common interest. Now is the moment to advocate for it; we’ll be stronger if we do it together.

The post When it comes to AI negotiations, publishers are stronger together appeared first on Digital Content Next.

]]>
Copyright and AI: a win win https://digitalcontentnext.org/blog/2025/03/20/copyright-and-ai-a-win-win/ Thu, 20 Mar 2025 11:18:00 +0000 https://digitalcontentnext.org/?p=44836 In terms of public policy debates, Artificial Intelligence continues to be the belle of the ball with nearly every major government courting the industry to locate their investments and jobs...

The post Copyright and AI: a win win appeared first on Digital Content Next.

]]>
In terms of public policy debates, Artificial Intelligence continues to be the belle of the ball with nearly every major government courting the industry to locate their investments and jobs within their jurisdictions. Europe, China, Korea, and the U.S. (among others) have laid out competing tax and government spending plans to entice and encourage AI companies. Against this backdrop of AI frenzy, President Donald Trump, via the Office of Science Technology and Policy, has solicited input on the formation of an “AI Action Plan” in order to “define the priority policy actions needed to sustain and enhance America’s AI dominance.”

Unsurprisingly and unabashedly, tech companies advocate that the U.S. government allow their content-generating AI models to train on copyrighted material without consent or compensation. However, as DCN noted in our comments regarding the action plan, a key component to achieving the stated goal of enhancing America’s AI dominance – and the broader success of American businesses – is the robust protection and enforcement of U.S. intellectual property law including the Copyright Act.

The longstanding legal rights for copyright holders are derived from the U.S. Constitution (Article I, section 8, clause 8), which affords them the opportunity to monetize the results of their hard work and investment in a variety of ways and incentivizes them to reinvest in the creation of additional content and new innovative delivery mechanisms to potential consumers. As a result of these longstanding rights, American content creators, including news organizations and other publishers, are able to contribute significantly to U.S. economic growth, including through employment, exports and important trade surplus, and digital services and goods. 

According to a recent study, copyright-based industries accounted for 12.31% of the U.S. economy and 63.13% of the U.S. digital economy. From 2020 to 2023, these industries outpaced U.S. economic growth almost threefold. In the digital sector alone, copyright-based industries employ 56.6% of all employees in the digital sector. The annual compensation paid to core copyright workers is approximately 50% higher than the average U.S. annual wage. As for the global impact, the sales of select U.S. copyrighted products in overseas markets amounted to $272.6 billion, which exceeded the sales of other IP industries including pharmaceuticals, agriculture, and aerospace.

Unfortunately, the manner in which many AI developers have exploited original content without consent or compensation – to build and operationalize their commercial products – has unjustifiably violated the rights of copyright holders. It has upended the existing balance which has historically sustained and promoted innovation.

AI developers use copyright protected content not only to “teach” their models to predict and mimic language skills, but also as a means to create compelling outputs which have the compounding harm of substituting for the original works on which the models were trained. This activity unfairly competes with those who invested in the creation of the original material and undermines their ability to seek a fair economic return. In fact, U.S. Senior District Judge Beryl Howell noted earlier this week in a copyright case attempting to argue fair use that the publisher’s content is “so valuable they put a copyright on it.” Exactly.

By “reaping that which they do not sow” AI companies cause harm to creators, publishers and the ecosystem as a whole. It is important that this form of destructive misappropriation be deterred, whether by copyright law or other appropriate means. In the U.S, there are 39 related lawsuits and counting. The outcome of these suits will provide much-needed clarity regarding the application of existing copyright law, including the fact-specific defense of fair use, to the infringement of the rights of copyright holders to develop generative AI technology.

However, one U.S. District Court recently confirmed that licensing is required for the use of copyrighted content to train an AI system. In Thomson Reuters Enter. Ctr. GmBH v. Ross Intel. Inc., the court, applying clear and recent precedent from the U.S. Supreme Court, held that the defendant’s unauthorized use of the plaintiff’s works to train the defendant’s AI system was direct infringement and did not constitute fair use. The Court reaffirmed that the impact of the use on existing and potential markets is the single most important element of a fair use analysis, and that there was clearly a potential market to use the materials at issue in the case to train AI. 

Lest the VC crowd be dismayed, a licensing framework is emerging as many deals have been struck by publishers, record labels, motion picture industries, and others. OpenAI, Google, and Perplexity have all made efforts to pay for the right to use protected content to power their models and tools. This is a clear acknowledgment that this model is not only necessary, but eminently feasible.

While publishers’ rights are coming into clearer focus in the U.S., AI companies are  beginning to feel a shared pain as evidenced recently by DeepSeek’s R1 model. OpenAI accused the company of IP theft, claiming that DeepSeek may have used OpenAI’s IP and violated its terms of service to develop its AI model. 

“We know PRC (China) based companies – and others – are constantly trying to distill the models of leading US AI companies,” OpenAI said in a statement to Bloomberg. “As the leading builder of AI, we engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models, and believe as we go forward that it is critically important that we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.”

A rising tide can lift all boats. Only maintaining existing copyright protections will lead to a robust, free market where creators are incentivized to make high quality works and AI companies are incentivized to license them. Importantly, in this robust market, AI companies would continue to have access to quality content which is critical for training and outputs. The American values of IP protection have been a cornerstone in our country’s innovative spirit and competitive edge over foreign adversaries. Protecting IP is a matter of preserving the core principles that distinguish American businesses in the global market. For the history of the U.S., copyright and innovation have gone hand in hand and there is no reason to deviate from that successful combination as we build the next chapter.


Read DCN’s Comments on the AI Action Plan, which were filed with the Office of Science and Technology Policy on March 15, 2025

The post Copyright and AI: a win win appeared first on Digital Content Next.

]]>
The AI reckoning for publishers and platforms    https://digitalcontentnext.org/blog/2025/02/27/the-ai-reckoning-for-publishers-and-platforms/ Thu, 27 Feb 2025 12:12:00 +0000 https://digitalcontentnext.org/?p=44703 The publishing industry has been of two minds on AI’s rapid advancements – optimistic and cautious – sometimes within the same company walls. Business development teams explore much-needed new revenue...

The post The AI reckoning for publishers and platforms    appeared first on Digital Content Next.

]]>
The publishing industry has been of two minds on AI’s rapid advancements – optimistic and cautious – sometimes within the same company walls. Business development teams explore much-needed new revenue opportunities while legal teams work to protect their art and existing rights. However, two major legal developments, the Thomson Reuters v. Ross Intelligence ruling and shocking new revelations in Kadrey v. Meta, expose the fault lines in AI’s unchecked expansion and set the stage for publishers to negotiate fair value for their investments. 

One case confirms that publishers have a right to license their content for AI training and that tech advocates’ tortured analysis of fair use doesn’t throw out rights engrained in the U.S. Constitution or require publishers to opt-in to attain them. The other case suggests that Meta may have knowingly pirated books in its high-stakes race to keep up with OpenAI and that Meta’s notorious growth-at-all-cost playbook is more exposed than ever. 

AI companies can no longer operate in a legal gray zone, scraping content as if laws don’t apply to them. Courts, lawmakers, researchers and the public are taking notice. For publishers, the priority is clear: AI must respect copyright from the beginning including for training purposes, and the media industry must ensure it plays an active role in shaping AI’s future rather than being exploited by it. 

Thomson Reuters v. Ross: A win for AI licensing, a loss for those who intentionally avoid it 

In a landmark decision, a federal judge ruled this month in favor of Thomson Reuters against Ross Intelligence, a startup that trained its AI model without rights or permission using the Reuters’ Westlaw legal database. 

Judge Stephanos Bibas’ ruling in the Delaware district court is notable because he explicitly recognized the emerging market for licensing AI training data. This undercuts the argument that AI developers can freely use copyrighted works under “fair use” factors. And, consistent with DCN’s policy team, it also highlights the significant importance of the fourth factor of fair use, which publishers have been demonstrating with the signing of each new licensing deal.  

For publishers, this is a crucial precedent for two reasons: 

  • AI training is not automatically fair use. Content owners have the right to be paid when their work is being used to train AI.   
  • A market for AI licensing is forming – this is the fourth factor. Publishers should define and monetize it before platforms dictate the terms.   

This decision marks a turning point, ensuring that AI development doesn’t come at the expense of the people and companies producing high-quality content.  Sam Altman of OpenAI, and other leadership across the powerful AI industry, have attempted to invent a “right to learn” for their machines. That’s an absurd argument on its face but regularly repeated in high-profile interviews, as if the technocrats might will it into reality. 

Kadrey v. Meta: Pirated Books, torrenting, and a familiar playbook 

While the Reuters ruling validates AI licensing, Kadrey v. Meta reveals how some AI developers have worked to avoid it. 

Recently unsealed court documents suggest that Meta employees knowingly pirated books to train LLaMA AI models used as their first commercial version (LLaMA2). Significantly, their fair use analysis shifted from “research” to making bank – a lot of it. 

Evidence revealed that demonstrates this knowing strategic shift: 

  • Meta employees downloaded pirated book datasets from a massive, pirated dataset, LibGen, with employees even using torrenting technology to pull it down.   
  • They may have “seeded” and distributed this pirated content to others. That’s a potential violation of criminal code that their own employees shared this, “What is the probability of getting arrested for using torrents in the USA?”.  
  • Meta worried that licensing even one book would weaken its fair use argument, so it didn’t license any at all. 
  • Some employees explicitly avoided normal approval processes to keep leadership from having to formally sign off.   
  • Some documents suggest Mark Zuckerberg himself may have been aware of these tactics with documents referencing escalations to “MZ.” 
  • Meta appears to have stopped using this material ahead of LLaMA3, possibly signaling awareness that their actions were legally indefensible.   

Making matters worse, Meta’s case is being overseen by Judge Vincent Chhabria in the Northern District of California. This is the same judge who sanctioned Facebook’s lawyers in its massive privacy settlement that led to record-breaking settlements approaching $6 billion with the FTC, SEC and private plaintiffs. In that case, Facebook was accused of stalling, misleading regulators, and withholding evidence related to its user data practices. In other words, Judge Chhabria knows Meta’s playbook: delay, deny, deflect.   

Now, Meta faces a crime-fraud doctrine claim. This means that some currently sealed legal advice could be unsealed if it was in furtherance of a crime. If proven, this would not be a simple copyright dispute; it could potentially lead to criminal liability and further regulatory scrutiny. The Court is ordering Meta to unseal more documents this week. 

Move fast, break things… again: Meta’s AI strategy mirrors its past scandals 

The Kadrey case’s revelations closely resemble Meta’s past data controversies, particularly those that were all put into the basket of Cambridge Analytica. The many ongoing details of the cover up of the scandal are still emerging today. Unfortunately, they were mostly overlooked by the tech press corp who have not been tuned in to these issues for far too long.  

For years, Facebook pursued a strategy of aggressive data harvesting to accelerate its growth in mobile where it had risk of being supplanted by new platforms. The company:   

  1. Scraped vast amounts of publisher and user data without clear consent.   
  1. Shared this data widely with developers in exchange for reciprocal access to their user data – fueling Facebook’s mobile market share grab.   
  1. Ultimately settled with regulators for billions after repeated privacy violations. 

Now, in Kadrey v. Meta, history appears to be repeating itself.  Internal documents show that Meta feared OpenAI and needed to accelerate its AI development. Thus, Meta felt pressured to take outsized risks.  Meta’s approach to AI training follows a similar pattern:   

  1. Acquire the best data – legally or not.   
  1. Use it to gain an edge over AI competitors.  
  1. Deal with legal and regulatory fallout later, if necessary. 

Recently unsealed documents even expose a documented mitigation strategy. 

  1. Remove data clearly marked as pirated (but only if it’s in the filename despite letting the coders strip out copyright info in the actual content) 
  1. Don’t let anyone know what data sets they’re using (including illegal datasets) 
  1. Do whatever possible to suppress prompts that spit out IP violations 

Key takeaways for publishers and media companies 

The Thomson Reuters and Kadrey cases demonstrate both the risks and the opportunities for publishers in the AI era. Courts are starting to push back on AI’s unlicensed use of copyrighted content. But it’s up to the publishing industry to define what comes next.   

Here are the big issues we must address: 

  1. AI models need high-quality data. And publishers must ensure they’re compensated for it. The Reuters ruling proves that a growing licensing market for AI exists.   
  1. Litigation is working. The unsealed evidence in the Kadrey case suggests that even AI giants like Meta know they’ve crossed legal lines. Facebook isn’t dumb, evidence from other peer companies may be even more damaging. The plural press needs to be shining the light on these wrongs as national security isn’t an excuse for AI companies to break copyright law. 
  1. Publishers must be proactive in shaping AI policy. Big Tech will push its own narrative. Meta and Google pay front groups like Chamber of Progress to stretch the meaning of fair use both in the U.S. and across the pond. Media companies must work together to establish AI licensing frameworks and legal protections and reinforce existing copyright law.  
  1. Regulatory scrutiny on AI will intensify. If Meta is found to have used pirated data, it will accelerate AI regulations. This will not likely be confined to copyright but could extend across tech policy as it did in 2018, when one scandal exposed larger problems leading to Facebook being dragged before parliaments around the globe.  

The future of AI depends on trust, ethics and media leadership 

The past year has shown that AI is both a disruptor and an opportunity. The Reuters ruling confirmed publishers can and should demand licensing deals. The Meta revelations prove why that’s so necessary.   

AI is reshaping media, but it must be built ethically. The publishing industry has both the legal and ethical high ground. And media companies must use it to define the next phase of AI’s evolution. The future of AI isn’t just about innovation. It’s about who controls the data and the IP – and whether the people who create it are respected or exploited. 

The post The AI reckoning for publishers and platforms    appeared first on Digital Content Next.

]]>
The free speech era: tech policy in the Trump administration https://digitalcontentnext.org/blog/2025/02/10/the-free-speech-era-tech-policy-in-the-trump-administration/ Mon, 10 Feb 2025 13:59:03 +0000 https://digitalcontentnext.org/?p=44526 As frigid temperatures drove the Presidential inauguration indoors for the first time in 40 years, President Donald Trump once again took office surrounded by a who’s who of U.S. technology...

The post The free speech era: tech policy in the Trump administration appeared first on Digital Content Next.

]]>
As frigid temperatures drove the Presidential inauguration indoors for the first time in 40 years, President Donald Trump once again took office surrounded by a who’s who of U.S. technology leaders. Even amidst the pomp and circumstance of this quadrennial event, few elements were as discussed and scrutinized as the presence of Big Tech’s most prominent leaders.

And yet, their presence and prominence should be of little surprise, given the level of scrutiny companies such as Meta, Google, Apple, and Amazon are facing from governments across the world, and how much they stand to gain or lose from developing close relationships with political leaders.

These companies, likely because of this scrutiny and the opportunities granted by a change in administration, are in the midst of significant policy restructurings aimed at better positioning themselves for the political realities of a second Trump administration. Most notably, Meta CEO Marck Zuckerberg announced days before the inauguration that the company was to “return” to a “fundamental commitment to free expression” by ending their third-party fact checking program and moving towards a community notes model.

Free speech will undoubtedly be the cornerstone of technology policy in the second Trump Administration and companies are certainly wise to publicly align themselves with this ideal. For media executives and content creators, this means ensuring that advocacy narratives both respond to the administration’s free speech concerns and respond to the free speech arguments promoted by Big Tech.

Executive action

On January 23, President Trump signed an executive order stating that the country must “develop AI systems that are free from ideological bias or engineered social agendas.” The President also stated that this executive order “clears a path for the United States to act decisively to retain leadership in AI, rooted in free speech and human flourishing.”

Although these actions are clearly focused on affirming the administration’s ideological stances and distancing the administration from its predecessor, using AI to profess broader policy positions was first done by the Biden administration. As the American public gained awareness of the promise and peril of AI in 2022, then-President Joe Biden unveiled an AI “Bill of Rights” which included protection from discrimination and bias as one of its key principles. Such language carried on to the now revoked Biden-era AI executive order.

Given the prominence of AI in the national discourse, it is logical that President Biden, and now President Trump, would utilize AI policy to assert ideological stances. While AI policy during the Biden administration was focused on combating AI-enabled discrimination against protected classes, the Trump administration is focused on the protection of free speech and combatting censorship.

Policy and power players

In the same way that Meta and Google have embraced policies that adjust to political realities, AI companies have begun to do the same. On January 13, OpenAI published its “Economic Blueprint.” The policy document is meant to present a framework that champions the “individual freedoms at the heart of the American innovation ecosystem.”

One of these freedoms, according to OpenAI, is that of “ensuring that AI has the ability to learn from universal, publicly available information, just like humans do.” This purported freedom should raise alarm bells for publishers and content creators. As the media industry continues to fight for the protection of its intellectual property, AI companies have set out to create a “fair use” argument for their past and future transgressions. Their focus on “individual freedoms” is a shrewd approach, and one that is aimed directly at the interests of the administration and its Congressional allies.

Fair use and fair licensing deals

2025 will be decisive year for AI policy that will see the most significant movement thus far for substantial industry regulation. If publishers and content creators are to successfully compete with the policy narratives of major tech companies and AI companies, then they too must embrace a narrative that highlights the threats to free speech and freedom posed by unregulated technologies that can misrepresent or censor political or personal viewpoints.

For example, if publishers and content creators can successfully argue that fair licensing deals will allow AI companies to license unbiased or ideologically diverse content libraries, and thus promote diverse viewpoints and avoid ideological bias, the policies they are advocating for will become more politically salient and will have a higher likelihood of capturing the attention of policymakers and the general public. This new administration is eager to make its mark on AI policy. Publishers and content creators must speak to the free speech priorities that now dominate political discourse, or risk being drowned out by more powerful voices as much-anticipated legislation is finally set in motion.

The post The free speech era: tech policy in the Trump administration appeared first on Digital Content Next.

]]>
Publishers need a new robots.txt for the AI era https://digitalcontentnext.org/blog/2024/10/24/publishers-need-a-new-robots-txt-for-the-ai-era/ Thu, 24 Oct 2024 11:04:00 +0000 https://digitalcontentnext.org/?p=43991 While in some ways the web has evolved organically, it also functions within accepted structures and guidelines that have allowed websites to operate smoothly and to enable discovery online. One...

The post Publishers need a new robots.txt for the AI era appeared first on Digital Content Next.

]]>
While in some ways the web has evolved organically, it also functions within accepted structures and guidelines that have allowed websites to operate smoothly and to enable discovery online. One such protocol is robots.txt, which emerged in the mid 1990s to give webmasters some control over which web spiders could visit their sites. A robots.txt file is a plain text document that is placed in the root directory of a website. It contains instructions for search engine bots on which pages to crawl and which to ignore. Significantly, compliance with its directives is voluntary. Google, has long followed and endorsed this voluntary approach. And no publisher has dared to exclude Google considering its 90%+ share of the search market.

Today, a variety of companies use bots to crawl and scrape content from websites. Historically, content has been scraped for relatively benign purposes such as non-commercial research and search indexing, which promises the benefit of driving audiences to a site. In recent years, however, previously benign and new crawlers have begun scraping content for commercial purposes such as training Large Language Models (LLMs), use in Generative Artificial Intelligence (GAI) tools, and inclusion in retrieval augmented generation outputs (aka “grounding”).

Under current internet standards such as the robots.txt protocol, publishers can only block or allow crawlers by domain. Publishers are not able to communicate case-by-case (company, bot and purpose) exceptions in accordance with their terms of use in a machine-readable format. And again: compliance with the protocol is entirely voluntary. The Internet Architecture Board (IAB) held a workshop in September on whether and how to update the robots.txt protocol and it appears the Internet Engineering Task Force (IETF), which is responsible for the protocol, plans to convene more discussions on how best to move forward.

A significant problem is that scraping happens without notification to or consent from the content owners. It often violates the website’s terms of use in blatant violation of applicable laws. OpenAI and Google recognized this imbalance when they each developed differing controls (utilizing the robots.txt framework) for publishers to opt out of having their content used for certain purposes.

Predictably, however, these controls don’t fully empower publishers. For example, Google will allow a publisher to opt out of training for their AI services. However, if a publisher wants to prevent their work from being used in Generative AI Search—which allows Google to redeploy and monetize the content—they have to opt out of search entirely. It would be immensely useful to have an updated robots.txt protocol to provide more granular controls for publishers in light of the massive scraping operations of AI companies.

While big tech companies tout the benefits of AI, much of the content crawled and scraped by bots is protected under copyright law, or other laws which are intended to enable publishers and other businesses to protect their investments against misappropriation and theft.

Copyright holders have the exclusive right to reproduce, distribute and monetize their copyrighted works as they see fit for a defined period. These protections incentivize the creative industries by allowing them to reap the fruits of their labors and enable them to reinvest into new content creation. The benefits to our society are nearly impossible to quantify as the varied kinds of copyrighted material enrich our lives daily: music, literature, film and television, visual art, journalism, and other original works provide inspiration, education, and personal and societal transformation. The Founding Fathers included copyright in the Constitution (Article I, section 8, clause 8) because they recognized the value of incentivizing the creation of original works.

In addition to copyright, publishers also rely on contractual protections contained in their terms of service which govern how the content on their websites may be accessed and exploited. Additionally, regulation against unlawful competition is designed to protect against the misappropriation of content for purposes of creating competing products and services. This is to deter free riding and prevent dilution of incentives to invest in new content. The proper application of and respect for these laws is part of the basic framework underlying the thriving internet economy.

The value of copyrighted works must be protected

The primary revenues for publishers are advertising, licensing, and, increasingly, subscriptions. Publishers make their copyrighted content available to consumers through a wide range of means, including on websites and apps that are supported by various methods for monetization such as metered paywalls. It is important to note that even if content is available online and not behind a subscription wall, that does not extinguish its copyrighted status. In other words: It is not free for the taking.

That said, there are many cases where a copyright holder may choose to allow the use of their original work for commercial or non-commercial purposes. In these cases, potential licensees contact the copyright holder to seek a negotiated agreement, which may define the extent to which the content may be used and any protections for the copyright holder’s brand.

Unfortunately, AI developers, in large part, do not respect the framework of laws and rules described above. They seek to challenge and reshape these laws in a manner that would be exceptionally harmful for digital publishers, by bolstering their position that content made publicly available should be free for the taking – in this case, to build and operationalize AI models, tools and services.

Publishers are embracing the benefits of AI innovation. They are partnering with developers and third parties, for both commercial and non-commercial purposes, to provide access and licenses for the use of their content in a manner that is mutually beneficial. However, incentives are lacking to encourage AI developers to seek permission and access/licensing solutions. Publishers need a practical tool to signal to bots at scale whether they wish to permit crawling and scraping for the purposes of AI exploitation.

Next steps and the future of robots.txt

The IETF should update the robots.txt protocol to create more specific technical measures that will help publishers convey the purposes for which their content may or may not be used, including by expressing limitations on the scraping and use of their content for GAI purposes. While this should not be viewed as in any way reducing the existing legal obligations of third parties to seek permission directly from copyright holders, it could be useful for publishers to be able to signal publicly and through a machine-readable format what uses are permitted, e.g. scraping for search purposes is permitted, whereas scraping to train LLMs or other commercial GAI purposes is not. 

Of course, a publisher’s terms of use should always remain legally binding and trump any machine-readable signals. Furthermore, these measures should not be treated as creating an “opt out” system for scraping. A publisher’s decision not to employ these signals is not permission (either explicit or implicit) to scrape websites or use content in violation of the terms of use or applicable laws. And any ambiguity must be construed in favor of the rights holders.

In order to achieve a solution in a timely and efficient manner, the focus should be on a means to clearly and specifically signal permission or prohibitions against crawling and scraping for the purposes of AI exploitation. Others may seek to layer certain licensing solutions on top of this, which should be left to the market. In addition, it should be ensured that there is transparency for bots which crawl and scrape for purposes of AI exploitation. Any solution should not be predicated on the whims of AI developers to announce the identities of their bots or operate in any manner that obscures their identity and purposes of their activity.

And, critically, search and AI scraping must not be comingled. The protocol should not be allowed to be used in a manner that requires publishers to accept crawling and scraping for AI exploitation as a condition for being indexed for search.

Let’s not repeat the mistakes of the past by allowing big tech companies to leverage their dominance in one market to dominate an emerging market like AI. Original content is important to our future and we should build out web standards that carry forward our longstanding respect for copyright in the AI age.

The post Publishers need a new robots.txt for the AI era appeared first on Digital Content Next.

]]>
The impact of media companies opting out of open-web AI training https://digitalcontentnext.org/blog/2024/07/31/the-impact-of-media-companies-opting-out-of-open-web-ai-training/ Wed, 31 Jul 2024 18:33:59 +0000 https://digitalcontentnext.org/?p=43320 The internet is seen by some as a vast repository of information readily available for training open and closed AI systems. However, this “data commons” raises significant ethical and legal...

The post The impact of media companies opting out of open-web AI training appeared first on Digital Content Next.

]]>
The internet is seen by some as a vast repository of information readily available for training open and closed AI systems. However, this “data commons” raises significant ethical and legal concerns regarding data consent, attribution, and copyright, particularly for media companies. These concerns are growing due to the fear that AI systems may use the media’s content for training without consent, exacerbating conflicts over intellectual property rights.

A new study, Consent in Crisis: The Rapid Decline of the AI Data Commons, investigates these issues by examining how AI developers use web data and how data access and usage protocols shift over time. This research involves a comprehensive audit of web sources used in major AI training datasets, including C4, RefinedWeb, and Dolma.

The research also evaluates the practices of AI developers, such as Google, OpenAI, Anthropic, Cohere, and Meta, as well as non-profit archival organizations such as Common Crawl and the Internet Archive. By focusing on dynamic web domains and tracking changes over time, this study assesses the evolving landscape of data usage and its implications for media companies.

The research observations provide strong empirical evidence for the misalignment between AI uses and web-derived training data. This analysis helps track major shifts in signaling consent preferences and reveals current tools’ limitations.

Increased restrictions on AI data

  • From April 2023 to April 2024, a growing number of websites started blocking AI bots from collecting their data. Websites accomplish this by including specific instructions in files called robots.txt and their terms of service.
  • Impact: About 25% of the most critical data sources and 5% of all data used in some major AI datasets (C4, RefinedWeb, and Dolma) are now off-limits to AI.
  • OpenAI’s bots, which collect data for AI training, are blocked more often than other companies’ bots. The rules about what these bots can and cannot do usually need to be clarified or more consistent.
  • Impact: This inconsistency makes adhering to data usage preferences difficult and indicates ineffective management tools.

Divergence in the web data quality

  • The most popular web domains for AI training are news, forums, encyclopedias and includes academic and government content. These domains contain diverse content, such as images, videos, and audio. Many of these sites montize via ads and paywalls. They also frequently have restrictions for how their content can be used in their terms of service. In contrast, other web domains consist of personal/organizational websites, blogs, and e-commerce sites with less monetization and fewer restrictions.
  • Impact: The increasing restrictions on popular, content-rich websites mean that AI models must increasingly rely on open or user generated content. Thus, they miss out on the highest-quality and most up-to-date information, potentially affecting their performance and accuracy.

Mismatch between web data and AI usage

  • There needs to be a closer connection between the web data collected for training AI and the actual tasks AI systems perform in the real world.
  • Impact: This misalignment could lead to problems with AI systems’ performance and data collection. It may also lead to legal issues related to copyright.

AI economic fears may reshape internet data

  • The use of internet content for AI training, which was not its original intent, shifts incentives for content creation. With the increasing use of paywalls and ads, small-scale content providers might opt out or move to walled platforms to protect their data. Without better control mechanisms for website owners, the open web is likely to shrink further, with more content locked behind paywalls or logins to prevent unauthorized use
  • Impact: This trend could significantly reduce access to high quality information availability for AI training.

The media’s choice to opt out of AI training

While the Internet has served as a critical resource for AI development, the use of the content created by others, including the media, (often at great expense) without consent presents significant ethical and legal challenges. As more media companies choose to exclude their content from AI training, the datasets become less representative and outdated. The decline in data quality reduces the relevance and accuracy of the resulting AI models. Therefore, improved data governance and transparency are essential to allow for open access of content online. It also provides a framework for ethical use of web content for AI training, which in turn should improve the quality of training data.

The post The impact of media companies opting out of open-web AI training appeared first on Digital Content Next.

]]>