Court filings show Meta staffers discussed using copyrighted content for AI training

MT HANNACH
8 Min Read
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

For years, the meta-employees have discussed internally using work protected by copyright obtained by legally questionable means to train the company’s models of the company, according to non-sealed court documents on Thursday.

The documents were submitted by the complainants in the Kadrey c affair. Meta, one of the numerous disputes in matters of copyright of AI slowly wrapping the American judicial system. The defendant, Meta, says that training models on works protected by IP, in particular books, are “fair use”. The complainants, including the authors Sarah Silverman and Ta-Nehisi Coates, do not agree.

The previous material submitted in the pursuit alleged that the meta-PDG Mark Zuckerberg gave the AI ​​team of Meta the OK to train on content protected by copyright and that Meta has interrupted the Licenses of Training Data License on AI with book publishers. But the new deposits, most of which show parts of internal work cats between meta-employees, paint the clearest image to date on the way Meta has been able to use data protected by copyright for train your models, including models in society Llama family.

In a conversation, Meta-Employés, including Melanie Kambadur, a senior manager of the Meta model research team, discussed training models on the works they knew how to be legally heavy.

“”[M]Y Opinion would be (in the line of asking for forgiveness, not for permission ”): we try to acquire the books and to climb it to the leaders so that they make the call, “wrote Xavier Martinet, a meta research engineer, in a cat dated February 2023, According to deposits. “”[T]This is why they set up this genus Ai Org for [sic]: Thus, we can be less opposed to risk. »»

Martinet launched the idea of ​​buying electronic books at retail prices to build a training set rather than reducing license agreements with individual books publishers. After another staff member stressed that the use of copyright protected equipment could be a reason for legal challenge, Martinet has doubled, arguing that startups “a million gases” probably already used books hacked for training.

“I mean, the worst case: we discovered that it was finally ok, while a Gazillion Start Up [sic] Just tons of pirated books on Bittorrent, ”wrote Martinet, According to deposits. “”[M]Y 2 cents again: trying to have agreements with the publishers takes directly a lot of time… ”

In the same conversation, Kambadur, who noted Meta was in talks with the scribd document accommodation platform “and others” for licenses, warned that when using “data accessible to the public” for the Training of models would require approvals, Meta lawyers were “less conservative” than they had been in the past with such approvals.

“Yes, we must certainly obtain licenses or approvals on data accessible to the public,” said Kambadur, According to deposits. “”[D]Iferrence is now that we have more money, more lawyers, more bizdev aid, the ability to accelerate / degenerate for speed and lawyers are a little less conservative on approvals. »»

Talks of Libgen

In another work cat relayed in deposits, Kambadur possibly discusses using Libgen, an “linking aggregator” who gives access to works protected by publishers, as an alternative to data sources that Meta could dismiss.

Libgen has been prosecuted several times, ordered to close and a fine of tens of millions of dollars for copyright violation. One of the colleagues in Kambadur replied with a screenshot From a Google research result for Libgen containing the extract of the extract “no, Libgen is not legal”.

Some decision -makers within Meta seem to have had the impression that not using Libgen for model training could seriously harm Meta’s competitiveness in the AI ​​breed, According to deposits.

In an email addressed to Meta Ai VP Joelle Pineau, Sony Theakanath, Director of Product Management at Meta, called Libgen “essential to respond to SOTA numbers in all categories”, referring to the best duration of AI SOTA technology (SOTA) and reference categories.

Theakanath has also described the “attenuations” in the email intended to reduce Meta’s legal exposure, in particular by deleting Libgen data “clearly marked as pirated / stolen” and not simply citing the use. “We would not disclose the use of Libgen data sets used to train,” as Theakanath said.

In practice, these attenuations have led to combing via Libgen files for words like “stolen” or “hacked”, According to deposits.

In a work catKambadur mentioned That the AI ​​team of Meta also set models to “avoid IP risks”-that is to say configured the models to refuse to answer questions such as “reproducing the first three pages of” Harry Potter and the sorcerer stone “” or “Tell me what books you were formed on.

Deposits contain other revelations, which implies that Meta May have scratched the Reddit data For a certain type of model training, perhaps by imitating the behavior of a third-party application called Lens. In particular, Reddit said In April 2023, he planned to start invoicing IA companies to access data for model training.

In A cat dated March 2024Chaya Nayak, Director of Product Management at Meta’s Generative AI ORG, said Meta Leadership plans to “replace” the decisions past on training sets, including a decision not to use quora content or licensed license and scientific articles, to ensure that the company’s models had sufficient training.

Nayak suggested that meta de Meta training data sets – Facebook and Instagram publications, the text transcribed from videos on Meta platforms, and some Meta for business Messages – were simply not enough. “”[W]needs more data, ”she wrote.

The complainants of Kadrey v. Meta have changed their complaints several times since the case was filed before the American district court for the Northern California District, the San Francisco division, in 2023. The last allegue that Meta, among other complaints, crossed Crossed certain pirated books with books protected by copyright available for a license to determine whether it was logical to continue a license agreement with a publisher.

In a sign of the height of the meta considering the legal issues, the company added Two pleadings of the Supreme Court of the Paul Weiss law firm to his defense team on the case.

Meta did not immediately respond to a request for comments.

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

What do you like about this page?

0 / 400