The New York Times Company v. OpenAI Update Court Filing, retrieved on February 26, 2024 is part of HackerNoonâs Legal PDF Series. You can jump to any part in this filingÂ
II. BACKGROUND
A. OpenAIâs Pioneering Research
OpenAI was founded in 2015 to âadvance digital intelligence in the way that is most likely to benefit humanity as a whole.â Compl. ¶ 56. It entered the field of ânatural language processingâ (NLP), which includes the development of statistical tools called âlanguage models.â[15] These models can âpredict[] words that are likely to follow a given string of textâ based on statistics derived from a body of textâmuch like a weather model can predict the rain using statistics derived from historical weather data. Compl. ¶ 75. By 2015, research had already unlocked âsubstantial progressâ on âtasks such as reading comprehensionâ and âquestion answering.â[16]
Those early models, however, were âbrittleâ and ânarrow.â[17] Researchers built them by âmanually creat[ing] and label[ling]â datasets to âdemonstrate[e] correct behaviorââlike sets of English-to-French text translationsâand using that data to âtrain a system to imitate [that] behavior[].â GPT-2 Paper at 1, 3. The resulting models, while impressive, could only carry out the specific tasks demonstrated by the training data. Id.; GPT-3 Paper at 3 (âneed for task-specific datasetsâ was âa major limitationâ). âTo be broadly usefulâ to ordinary people, language models needed the ability to âseamlessly mix together or switch between many tasks and skillsâ without being specifically trained to carry out each task. GPT-3 Paper at 4. In other words, the models needed to be âcompetent generalists,â not ânarrow experts.â GPT-2 Paper at 1.
OpenAIâs researchers set out to solve that complex, scientific problem. In 2019, they posited that the way to build more capable, generalist models was to use âas large and diverse a dataset as possible [] to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.â GPT-2 Paper at 3. The hypothesis was that â[]training at a large enough scale [might] offer a ânaturalâ broad distribution of tasks implicitly contained in predicting the text itself.â GPT-3 Paper at 40. So instead of training its models âon a single domain of text,â OpenAI chose to use a richer and more diverse source: the Internet. GPT-2 Paper at 3.
OpenAIâs researchers identified text from webpages whose URLs had been publicly shared on a social media platform. Id. This became a dataset called âWebText,â which OpenAI used to train a model called âGPT-2.â Id.; see also Compl. ¶ 85. WebText contained a wide array of text from internet forums, restaurant reviews, recipe websites, blogs, shopping websites, dictionaries, medical websites, how-to pages, and more.[18] The dataset was so diverse that even though Times content represented only a tiny fraction of the data, the âNYTimesâ was one of the âtop 15 domains by volumeâ in the collection. See GPT-2 Model Card. This happened not because OpenAI believed Times articles are more âvalu[able]â than other content, contra Compl. ¶ 2 (suggesting OpenAI intentionally âgave Times content particular emphasisâ), but because of the frequency with which certain social media users shared links to the Timesâs content, see GPT-2 Paper at 3.
The results of this sophisticated research were impressive. The GPT-2 model proved able to answer trivia questions and perform higher-function tasks like âresolv[ing] ambiguities in text.â GPT-2 Paper at 6â7. The model even showed a âsurprisingâ ability to translate French to English, even though OpenAI had âdeliberately removed non-English webpagesâ from the training dataset. Id. at 7. These research results were âexcitingâ not only because of the modelâs capability, but because they scientifically confirmed that the ability to âperform commonsense reasoningâ increased dramatically with the size and diversity of the training data. Id. at 6 (Figure 3).
Continue Reading Here.
[15] SĂ©bastien Bubeck, et al., Sparks of Artificial General Intelligence: Early Experiments with GPT-4 at 4, 98 (Apr. 13, 2023), https://arxiv.org/pdf/2303.12712.pdf (âBubeck Paperâ); Compl. ¶¶ 71, 91 nn.9 & 24 (citing articles). By ârefer[ing] [to these documents] in [its] complaint,â the Times incorporated them by reference. DiFolco v. MSNBC Cable L.L.C., 622 F.3d 104, 111â12 (2d Cir. 2010).
[16] OpenAI, Language Models are Few-Shot Learners at 3 (July 22, 2020), https://arxiv.org/pdf/2005.14165.pdf (âGPT-3 Paperâ); see also Compl. ¶¶ 86, 90 & nn.18, 22 (citing and quoting this paper).
[17] OpenAI, Language Models are Unsupervised Multitask Learners at 1 (Feb. 14, 2019), https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (âGPT-2 Paperâ); see also Compl. ¶ 85 n.15 (citing and quoting this paper).
[18] See OpenAI, GPT-2 Model Card, Github, https://github.com/openai/gpt-2/blob/master/model_card.md (last updated Nov. 2019) (âGPT-2 Model Cardâ); see also Compl. ¶ 85 nn. 14, 16, 17 (citing and quoting this source).
About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.
This court case retrieved on February 26, 2024, from fingfx.thomsonreuters.com is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.