Perplexity accused of scraping sites that barred AI scraping

Cloudflare accuses Perplexity of bypassing website no-scrape directives using stealth methods.

: Cloudflare has accused AI startup Perplexity of scraping content from websites that explicitly disallowed such actions through the Robots.txt file. Perplexity allegedly hid its identity using methods like changing bots' 'user agent' and altering their autonomous system networks, as stated in Cloudflare's research. Cloudflare noted this activity across tens of thousands of domains, amounting to millions of requests per day, despite being delisted from Cloudflare’s verified list of crawlers. Perplexity, however, dismissed these claims, arguing that the crawler mentioned was not theirs, and accused Cloudflare of presenting a 'sales pitch.'

Cloudflare, a notable internet infrastructure provider, has made allegations against AI startup Perplexity for scraping content from websites that have specifically expressed the desire not to be scraped. This accusation was made public via a detailed research report published by Cloudflare. The report outlines how Perplexity allegedly maneuvers around blocks set by websites using sophisticated methods. These methods included altering their bots’ 'user agent', which is essentially a signal that identifies a website visitor's device and version type, as well as modifying their autonomous system networks (ASN), a number type that identifies large networks on the internet.

Cloudflare's researchers claim that Perplexity's evasive tactics are designed to obscure its identity and intentions when scraping content. This behavior was recorded across tens of thousands of domains with millions of requests made each day. Through a combination of machine learning and network signals, Cloudflare was able to track and fingerprint the scraping activities of this AI startup. Despite Cloudflare's efforts to combat such behavior by delisting Perplexity's bots from its verified list and developing new techniques for blocking them, the problem appears to persist.

Perplexity’s spokesperson Jesse Dwyer responded to these allegations, dismissing Cloudflare's claims as a 'sales pitch.' Dwyer noted in an email communication that screenshots provided by Cloudflare did not show any accessed content and stated that the bot mentioned in Cloudflare's blog post 'isn't even ours.' This stance highlights a contentious issue between companies that manage internet infrastructure and those that develop AI technology, focusing on data use ethics and intellectual property rights.

In recent times, AI companies have frequently scraped data from various sources without explicit permission, raising ethical concerns. Websites have made attempts to combat this trend by using the Robots.txt file, a web standard that communicates which pages can or cannot be indexed by search engines and AI entities. Nonetheless, according to Reuters, these attempts have yielded mixed results, with many AI entities finding ways to bypass these restrictions.

Cloudflare, standing firm against unauthorized AI data scraping, recently announced a marketplace platform permitting website owners to charge AI scrapers visiting their sites. This move is part of Cloudflare's broader strategy to protect publishers' business models from disruptions imposed by AI technological advancements. Furthermore, Cloudflare has been active in providing tools to help website domains prevent AI bots from scraping their content for training purposes.

Sources: Cloudflare, TechCrunch, Reuters