AI companies are reportedly still scraping websites despite protocols meant to block them

Multiple AI companies such as Perplexity, OpenAI, and Anthropic are reportedly bypassing the robots.txt files, which contain instructions for web crawlers on which pages to avoid. Despite these companies previously claiming to respect these protocols, they have continued to scrape website content for training their technologies.

In a letter from TollBit, a startup that partners publishers with AI firms for licensing deals, it was revealed that various AI agents are ignoring the robots.txt protocol. Wired reported that Perplexity's tools produced results closely paraphrased from its articles and sometimes generated inaccurate summaries, calling into question the ethical practices of these AI companies.

Perplexity CEO Aravind Srinivas defended the company, stating that the Robots Exclusion Protocol is not a legal framework. He suggested that a new relationship needs to be established between publishers and AI companies, though he admitted the company's use of third-party web crawlers may have contributed to the issues.