OpenAI launches a default opt-in crawler to scrape the Internet, while FTC pursues an obscure consumer deception investigation
Last week, Open AI (maker of ChatGPT) officially announced their web crawlerâââthis is a piece of software that scrapes content from all websites across the internet, which is then used for AI model training.
The existence of the crawler is not surprising and several legitimate web crawlers exist today, including Googleâs crawler that indexes the entire internet.
However, this is the first time OpenAI explicitly announced its existence and also provided a mechanism for websites to opt out of being scraped.
Note that the crawler is opt-in by default, i.e., you need to explicitly change a piece of code on your website to ask the crawler not to scrape your data. Opt-in /out defaults are sticky and often determine what the majority behavior is because most people donât take the effort to change defaults.
It is the same reason why Appleâs iOS14 privacy changes have had a major impact on the digital advertising industry.
So, why even provide the opt-out? This is likely a preemptive move from OpenAI in response to recent lawsuits against the company alleging that content ownersâ copyright was infringed (deeper article on data scraping if you want to poke more).
ChatGPT competitor Google Bard faces a similar challenge but Google has not yet announced an equivalent solutionâââthey did put out a request for comment on how to upgrade robots.txt to address this issue (written with some neat PR penmanship).
In this article, weâll dive into:
- Implications of OpenAIâs crawler for content owners
- FTCâs current investigation into OpenAI
- Todayâs legal landscape that we operate in
- Why the FTCâs approach of going after OpenAI is (yet another) misstep
Implications of OpenAIâs Crawler for Content Owners
While the announcement provides an option for advertisers to block OpenAIâs crawler from scraping their data, a couple of things are not great:
-
Itâs opt-in by default, which means OpenAI can keep scraping till sites explicitly tell them not to
-
There hasnât been a clear legal ruling one way or another about content ownersâ rights when their data is scraped for model training without consent (which would essentially be the case with anyone who is forced into a default opt-in)
Today, there are two legal constructs that determine whether itâs okay or not for language models to take all this data without consentâââCopyright and Fair Use.
Copyright provides protection to specific types of content but also has carve-outs /exceptions:
Copyright protection subsists, in accordance with this title, in original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.
Works of authorship include the following categories: (1) literary works; (2) musical works, including any accompanying words; (3) dramatic works, including any accompanying music; (4) pantomimes and choreographic works; (5) pictorial, graphic, and sculptural works; (6) motion pictures and other audiovisual works; (7) sound recordings; and (8) architectural works.
(b) In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work
For example, copyright protects most original work (e.g., if you wrote an original blog article or book on a topic) but does not protect broad ideas (e.g., you cannot claim that you were the first person to write about how AI impacts data rights, and therefore the idea belongs to you).
Another carve-out/exception from Copyright protection is Fair Use:
The fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright.
In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.
For example, if you picked up content from a research paper and wrote a critique about it, thatâs okay, and you are not infringing on the content ownerâs copyright. Itâs the same situation when I link another article from this page and add quoted text from that article.
Both of these concepts were created to protect content ownersâ rights while also allowing the free flow of information, especially in the context of education, research, and critique.
I am not a legal expert but based on my research/understanding of the language above, where this gets fuzzy with AI models scraping training content is:
- AI companies typically scrape full text from a content ownerâs website (this is protected by Copyright), train the models to learn the âideaâ/âconceptâ/âprincipleâ (this is not protected by Copyright), and then the models eventually spit out different text. In this case, does the content owner receive Copyright protection or not?
- Since the trained language models are now eventually used for commercial purposes (e.g., ChatGPT Plus is a paid product), is that a violation of the content ownerâs Copyright (because the Fair Use exception no longer applies)?
There have been no court rulings around this yet, so itâs hard to predict where this lands. My not-a-lawyer take is that the second one is probably easier to land: OpenAI scraped data and used it to create a commercial product, and therefore, they do not get an exception under Fair Use.
I would imagine the first one (did the model train on an âideaâ or just original text) is anyoneâs guess.
Note that both those bullets need to be in content ownersâ favor for them to win, i.e., content owners only win if both the above exceptions (âideaâ exception or Fair Use exception) donât apply to OpenAI.
I bring up this nuance because in the spectrum of AI risks (non-exhaustive)âââfrom content ownersâ rights to amplifying fraud to jobs being automated to AGI / destruction of humanityâââthe most pressing near-term issue is content ownersâ rights, as evidenced by the flurry of lawsuits and the impact on content platforms (e.g., the StackOverflow story).
While regulators like the FTC can ponder about the really long-term problems and come up with hypothetical/creative ways to address these risks, their real short-term potential lies in being able to tackle risks that will impact us in the 5â10 year horizon. Like copyright infringement.
Which brings us to what the FTC is doing about it.
FTCâs Current Investigation Into OpenAI
In mid-July, FTC announced that it is investigating OpenAI. What makes it interesting (and frustrating) is the reason FTC is investigating them for.
The maker of ChatGPT is being investigated to evaluate whether the company broke any consumer protection laws by putting personal reputation and data at risk.
Doesnât make sense? Youâre not alone. Letâs lay out some more background on how this came to be.
The FTCâs most vocal stance on AI regulation came out in April: âThere is no AI exemption to the laws on the books, and the FTC will vigorously enforce the law to combat unfair or deceptive practices or unfair methods of competition.â
Then came a couple of defamation-related issues: Radio host Mark Walters sued OpenAI after ChatGPT accused him of defrauding a non-profit, and a law professor was falsely accused by ChatGPT of sexual harassment.
Both these scenarios suck for the people involved, and I empathize with that. However, itâs a known fact that language models (like GPT) and products built on top of them (like ChatGPT) âhallucinateâ and are often incorrect.
The first half of FTCâs premise for the investigation is thatâââChatGPT hallucinates and therefore creates reputational harm.
In a heated Congressional hearing, one representative (rightfully) asks the FTC why they are going after defamation and libel, which are typically handled by state laws. FTC Chairperson Lina Khan gives a convoluted argument:
Khan responded that libel and defamation arenât a focus of FTC enforcement, but that misuse of peopleâs private information in AI training could be a form of fraud or deception under the FTC Act.
âWeâre focused on, âIs there substantial injury to people?â Injury can look like all sorts of things,â Khan said.
To tie up the full argumentâââFTC is saying that ChatGPTâs hallucination produces incorrect information (including defamation), which then could be a form of consumer deception.
Additionally, sensitive user private information could have been used/leaked (based on one bug that OpenAI quickly fixed).
As part of the investigation, the FTC has asked for a long list of things from OpenAIâââ from details about how their model is trained to what data sources they use to how they position their product to customers to situations where model releases have been paused because of identified risks.
The question isâââIs the best approach for the FTC to regulate what is arguably going to be one of the largest AI companies, especially given the current legal landscape?
Todayâs Legal Landscape That We Operate in
To critique FTCâs strategy with OpenAI, itâs useful to understand the legal landscape we operate in today. We wonât go into too much detail, but letâs do this briefly with the history of anti-trust as an example:
- In the 1900s, massive conglomerates (âtrustsâ) came into existence, and the balance of public-private power shifted to these companies.
- In response, the Sherman Act of 1890 was passed to add checks on private power and preserve competition; this law was used to litigate and break down âtrustsâ that were engaged in anti-competitive practices (predatory pricing, cartel deals, distribution monopoly).
- Around the 1960s, judges faced a lot of backlash for judging based on the spirit of the law instead of the letter of the law; for example, interpreting the Sherman law to determine if a set of companies âunreasonably restrain tradeâ involved subjectivity, and judges were accused of engaging in judicial activism.
- To introduce objectivity, the Chicago School pioneered the consumer welfare standardââââcourts should be guided exclusively by consumer welfareâ (e.g., a monopoly increasing prices in a blatant manner is wrong but, for other activities, the burden of proof is on regulators to prove consumer harm.)
- This continues to be the standard today and is one of the reasons the FTC and DOJ have a difficult job taking down big techâââfor example, the FTC cannot make the argument that Google is increasing prices since most of their products are free, even if Google is engaged in other anti-competitive practices.
The takeaway from this isâââwe continue to operate today in a landscape where cases are litigated heavily on the âletter of the lawâ and not the âspirit of the law.â This, along with the composition of the US Supreme Court today, has resulted in fairly conservative interpretations of the law.
What this means for the FTC is to embrace the reality of this landscape and figure out a way to win cases. The operating model of the FTC and DOJ (rightfully so) is to go after a handful of big cases and lay down harsh enforcement so that the long tail of companies think twice before breaking laws.
To make that happen, FTC needs to win big on a few issues, and it needs a winning strategy within the constraints of the current legal landscape.
Why the FTCâs Approach of Going After OpenAI Is (Yet Another)Â Misstep
The FTC has had a streak of losses against Big Tech, and I would argue that the losses can all be attributed to a failed âwe hate everything big techâ, hammer-not-scalpel strategy of taking on these companies.
For example, FTC took a brute-force approach to stop the $69B Microsoft-Activision acquisition and lost (pretty badly, Iâd say). FTC argued that Microsoft acquiring Activision would kill competition in the gaming market.
The judge wrote a fairly blunt ruling throwing out all of FTCâs arguments; hereâs one of the judgeâs comments:
There are no internal documents, emails, or chats contradicting Microsoftâs stated intent not to make Call of Duty exclusive to Xbox consoles. Despite the completion of extensive discovery in the FTC administrative proceeding, including production of nearly 1 million documents and 30 depositions, the FTC has not identified a single document which contradicts Microsoftâs publicly-stated commitment to make Call of Duty available on PlayStation (and Nintendo Switch).
Another brute force case was the FTCâs attempt to block Metaâs acquisition of a VR company Within, and they lost. Why did they pursue this? They wanted to test out the waters to see if there was an appetite to block acquisitions before a particular market becomes large, and given the current legal landscape, it was unsurprisingly thrown out.
The problem with FTCâs investigation of OpenAI is similar:
-
They are going after (what in my opinion) is a pretty trivial issue and a known limitation of language modelsâââhallucinations; they should instead be focusing on actual AI issues that matter in the 5â10 year horizon, like Copyright.
-
Despite multiple âcreativeâ legal approaches being thrown out in the current legal landscape, they are attempting another creative argument: hallucination â defamation â consumer deception.
The generous interpretation of their actions is that they want to set a precedent for their âAI is not exempt from existing lawsâ stance and that this wild goose chase gets them a large amount of self-reported data from OpenAI (FTC issues 20 pages of asks).
However, given their track record of repeatedly pursuing brute force/anything big tech is uncompetitive approach + combining those with creative arguments which are getting repeatedly dismissed in courts, I believe that the FTC has not earned the benefit of the doubt in this case.
Conclusion
I absolutely think OpenAI should be regulated. Not because their LLMs hallucinate (of course, they do) but because they are blatantly using creatorsâ content without permission. Not because it will change the past but because it will help set up content owners for a healthy future where their copyrights cannot be blatantly infringed upon.
But the FTC is repeating its missteps with the hammer-not-scalpel approach. There is a clear precedent for successes against big tech with a scalpel approach, the most notable one being UKâs Competition and Markets Authority.
The two big cases they won against Google have focused on specific anti-competitive mechanisms: stopping Google from providing preferential treatment to its own product in the AdTech stack and allowing other payment providers for in-app payments.
If FTC continues on its current path, its streak of losses is going to embolden tech companies to continue doing whatever they want because they know they can win in court. Itâs time the FTC reflected on its failures, learned from other regulatorsâ successes, and course corrected.
đ If you liked this piece, consider subscribing to my weekly newsletter. Every week, I publish one deep-dive analysis on a current tech topic/product strategy in the form of a 10-minute read.
Best, Viggy.
Also published here