Canadian News v. OpenAI: The suit that could change LLM regulations
Last week, at a Toronto courthouse, OpenAI attempted to seal "commercially sensitive" documents ahead of its trial against various Canadian media organisations. This was the first time the tech giant was in court since the news media group launched legal action late last year.
On November 29, 2024, a joint coalition including companies owning the Toronto Star, Metroland, National Post, The Globe and Mail, The Canadian Press, and Canadian Broadcasting Corporation filed a suit against OpenAI and its subsidiaries.
The driving force behind the legal action comes from the alleged data scraping of OpenAI's ChatGPT, using copyrighted material to train the model how to write. The news organisations argue that these actions are an infringement of the News Media Companies' copyright, breach their terms of use, and "unjustly enriched" themselves at the expense of the news media outlets.
"Journalism is in the public interest. OpenAI using other companies' journalism for their own commercial gain is not. It's illegal," stated a joint release from the news organisations back in November.
The plaintiffs say it is not just the usage of owned material that breaks the law, but also the training that the company uses to advance its products, like ChatGPT.
Allan Oziel, partner lawyer at Oziel Law, specialises in technology law, including AI licensing agreements and addenda. He says that tech companies that are procuring or providing services with Gen-AI functionality are struggling to address ownership of output. This is partially caused by the uncertainty around ownership of copyrighted content that has been ingested into the LLM models.
Questions arise, like 'What is the output substantially based on? Is there a prompt that is just a result of an analysis? Is it primarily based on the proprietary data or models of the provider or the algorithms of the provider?'
Oziel says ownership of output is a big topic among tech companies as they determine whether they can use clients' information to improve their product offerings. Many free products will use data to train models from the client side.
According to OpenAI's policy page, "When you use our services for individuals such as ChatGPT, Codex, and Sora, we may use your content to train our models." Additionally, users can opt out of the program training on their data.
As cases of organisations suing LLMs over the copyrighted content they use arise, copyright considerations are becoming increasingly important, but with no precedent set in Canadian law.
Oziel says the best way to protect an LLM is for it only to ingest licensed content or content that is clearly under the "fair dealing" exception to copyright in Canada.
This isn't the first legal battle that the news media have handed to OpenAI. In the US, The New York Times filed a suit against OpenAI in late 2023. The United States District Court for the Southern District of New York deemed it a relevant matter, launching the process for a full trial. It is projected to extend until mid‑2026.
Oziel says the outcome of these cases will give lawyers and companies guidance on how to use information lawfully.
"In other words, are you allowed to use copyright material on the basis of an exemption to copyright infringement under the various copyright legislations? ...That's the crux of what they're looking at.
But in a case where the court deems it unlawful to use these materials to train LLMs, how will they be able to obtain the necessary information to train? Oziel thinks LLMs will have to pay licensing fees for data in a likely case outcome. This way, news sites with large amounts of content will become much more valuable and will garner license fees if the LLMs wish to ingest this data to train their models.
"I do think it will become its own field, where data is king, and then the processors of the data, like the LLM model, become immediately less valuable because it's based on the data."
In the case of a win, there's also a risk of retroactive suits once a precedent is set, subject to applicable limitations in law.
"LLMs are going to be waiting with bated breath, so to speak, because, if there's a precedent that is set, it opens the floodgates entirely for all data owners to then determine whether or not their data was ingested into a large language model and seek recourse," says Oziel.
The judicial challenge hearing, which may determine if a trial is set or tossed out, is scheduled for September.