Congress Wants Tech Companies to Pay Up for AI Training Data

Do AI companies need to pay for the training data that powers their generative AI systems? The question is hotly contested in Silicon Valley and in a wave of lawsuits levied against tech behemoths like Meta, Google, and OpenAI. In Washington, DC, though, there seems to be a growing consensus that the tech giants need to cough up.

Today, at a Senate hearing on AI’s impact on journalism, lawmakers from both sides of the aisle agreed that OpenAI and others should pay media outlets for using their work in AI projects. “It’s not only morally right,” said Richard Blumenthal, the Democrat who chairs the Judiciary Subcommittee on Privacy, Technology, and the Law that held the hearing. “It’s legally required.”

Josh Hawley, a Republican working with Blumenthal on AI legislation, agreed. “It shouldn’t be that just because the biggest companies in the world want to gobble up your data, they should be able to do it,” he said.

Media industry leaders at the hearing today described how AI companies were imperiling their industry by using their work without compensation. Curtis LeGeyt, CEO of the National Association of Broadcasters, Danielle Coffey, CEO of the News Media Alliance, and Roger Lynch, CEO of Condé Nast, all spoke in favor of mandatory licensing. (WIRED is owned by Condé Nast.)

Coffey claimed that AI companies “eviscerate the quality content they feed upon,” and Lynch characterized training data scraped without permission as “stolen goods.” Coffey and Lynch also both said that they believe AI companies are infringing on copyright under current law. They urged lawmakers to clarify that using journalistic content without first brokering licensing agreements is not protected by fair use, a legal doctrine that permits copyright violations under certain conditions.

Common Ground

Senate hearings can be adversarial, but the mood today was largely congenial. The lawmakers and media industry insiders often applauded each others’ statements. “If Congress could clarify that the use of our content, or other publisher content, for the training and output of AI models is not fair use, then the free market will take care of the rest,” Lynch said at one point. “That seems eminently reasonable to me,” Hawley replied.

Journalism professor Jeff Jarvis was the hearing’s only discordant voice. He asserted that training on data obtained without payment is, indeed, fair use, and spoke against compulsory licensing, arguing that it would damage the information ecosystem rather than safeguard it. “I must say that I am offended to see publishers lobby for protectionist legislation, trading on the political capital earned through journalism,” he said, jabbing at his fellow speakers. (Jarvis was also subject to the hearing’s only real contentious line of questioning, from Republican Marsha Blackburn, who needled Jarvis about whether AI is biased against conservatives and recited an AI-generated poem praising President Biden as evidence.)

Outside of the committee room, there is less agreement that mandatory licensing is necessary. OpenAI and other AI companies have argued that it’s not viable to license all training data, and some independent AI experts agree.

“What would that even look like?” asks Sarah Kreps, who directs the Tech Policy Institute at Cornell University. “Requiring licensing data will be impractical, favor the big firms like OpenAI and Microsoft that have the resources to pay for these licenses, and create enormous costs for startup AI firms that could diversify the marketplace and guard against hegemonic domination and potential antitrust behavior of the big firms.”

Even within circles that favor some form of licensing for AI training data, there’s some dissent about whether it should be legally compulsory rather than simply encouraged as an industry norm. “As a high-quality and up-to-date source of information, news media is a valuable source of data for AI companies. My opinion is that they should pay to license it and that it is in their interest to do so,” Northwestern computational journalism professor Nick Diakopoulos says. “But I do not think a mandatory licensing regime is tenable.”

It remains to be seen exactly how lawmakers plan to fulfill requests, like Lynch’s, to clarify existing copyright law. But there are already several attempts to pass legislation that would create guardrails around data licensing, including the Journalism and Competition Preservation Act, a bill authorizing news outlets to collectively negotiate licensing arrangements, and Blumenthal and Hawley’s Bipartisan Framework on AI Legislation, which calls for a licensing regime overseen by an independent body.

As today’s hearing made clear, though, Congress is already highly critical of AI’s potential to amplify the power of the tech industry and its potentially deleterious impacts on journalism. The way Blumenthal described Big Tech’s impact on the local media ecosystem captured this pugilistic tone: “It is literally eating away at the lifeblood of our democracy.”