Reddit has filed a lawsuit in the U.S. District Court for the Southern District of New York against Perplexity AI and three data-scraping companies—Oxylabs, AWMProxy, and SerpApi—accusing them of unlawfully harvesting Reddit’s user-generated content to train AI models without authorization. The complaint alleges that these entities circumvented Reddit’s anti-scraping measures, including using Google search results as a conduit for data extraction. Reddit claims that Perplexity’s use of its content increased significantly after a cease-and-desist letter was issued in May 2024, suggesting a deliberate disregard for the platform’s data protection efforts (New York Post).
In response, Perplexity has denied the allegations, asserting that it does not train its AI models on Reddit data. The company maintains that it only summarizes Reddit posts with proper citations and does not engage in data scraping. Perplexity further contends that Reddit’s lawsuit is an attempt to exert pressure in ongoing negotiations with other AI companies, such as Google and OpenAI, over data licensing agreements (Search Engine Journal).
This legal dispute highlights the broader tensions between AI companies seeking access to vast datasets for model training and content platforms aiming to protect their intellectual property and user data. The outcome of this case could have significant implications for data usage policies and licensing practices in the AI industry.
-
Reddit has filed a lawsuit against Perplexity AI and three data-scraping firms, accusing them of unlawful, industrial-scale scraping of user-generated content for commercial gain, violating copyright and deliberately bypassing technological protections.
The lawsuit is a major legal challenge to the AI industry’s data acquisition practices and the concept of “fair use” for publicly available web content.
Case Study: Reddit’s Allegations Against Perplexity
The lawsuit, filed in a New York federal court, targets Perplexity AI along with three data-scraping intermediaries: Oxylabs UAB, AWMProxy, and SerpApi.
Key Claims by Reddit
- “Data Laundering”: Reddit accuses the firms of fueling an “industrial-scale ‘data laundering’ economy.” It claims the scrapers bypass its anti-scraping measures by disguising their bots as regular users and, crucially, scraping Reddit content indirectly through Google Search results.
- Circumvention Tactics: Reddit asserts that Perplexity chose to acquire data sourced through these unauthorized channels rather than entering into a lawful agreement. This is highlighted by the fact that Reddit has already secured lucrative, paid licensing deals for its data with major AI players like Google and OpenAI.
- “Smoking Gun” Evidence: Reddit claims it set a “trap” by creating a test post only crawlable by Google’s search engine and otherwise inaccessible on the site. According to the lawsuit, Perplexity’s answer engine surfaced the content of this hidden post within hours, which Reddit argues proves the defendants were scraping its content from Google’s search results.
Perplexity’s Public Response and Defense
Perplexity’s public response is a firm denial, framing the lawsuit as an attempt by a large company (Reddit) to restrict access to public knowledge.
Perplexity’s Main Arguments
- Application vs. Model Training: Perplexity argues that as an “application-layer company,” it does not train its foundation AI models on Reddit content. It claims its use of Reddit content is limited to summarizing discussions and citing the Reddit threads, a practice it defends as essential for users to verify the accuracy of the AI-generated answers.
- Open Internet Principle: The company asserts it “will always fight vigorously for users’ rights to freely and fairly access public knowledge” and “will not tolerate threats against openness and the public interest.” It frames Reddit’s lawsuit as “strong-arm tactics” meant to “extort” payment even for lawful data access.
- Refusal to Pay for Summarization: Perplexity publicly stated that Reddit insisted it pay for content licensing even after the AI company explained it does not train models on the data. Perplexity refused, claiming, “We won’t be extorted.”
Comments & Broader Implications
The lawsuit is a significant event in the ongoing legal and economic battle over AI training data and could set a key legal precedent.
Legal & Economic Comments
- Defining “Fair Use”: The case will test the boundaries of the “fair use” doctrine in the context of generative AI. AI companies often argue that using billions of works to train a model is a “transformative” use that is necessary for innovation. Content creators, conversely, argue that the large-scale, commercial use of their work without compensation constitutes massive theft that damages the emerging market for AI data licensing.
- Monetizing Human Conversation: Reddit, along with other content platforms like The New York Times, views its user-generated content as a high-value asset, which they are now successfully monetizing through licensing deals. The lawsuit reinforces the shift away from the “open web” culture, where crawling was largely symbiotic, to a “gated ecosystem” where platforms demand compensation for data consumed by commercial AI models.
- Liability of Intermediaries: A critical aspect of the case is the inclusion of the data-scraping firms. By targeting the entire “data laundering” pipeline, Reddit seeks to establish legal liability for the intermediaries that facilitate the unauthorized acquisition of content.
- Economic Cost of Licensing: Opponents of restrictive data access warn that “pay-per-crawl” requirements and expensive licensing deals favor large tech giants that can afford the costs, potentially creating a barrier to entry for smaller AI startups like Perplexity and stifling innovation in the long term.
