AI companies are reportedly still scraping websites despite protocols meant to block them

AI companies are ignoring robots.txt protocols to scrape websites and use content for training, says Reuters.

: Multiple AI companies, including Perplexity, OpenAI, and Anthropic, are reportedly bypassing robots.txt files to scrape website content. This protocol, which instructs web crawlers which pages to avoid, has been disregarded despite claims of compliance. The issue has raised concerns about the ethical use of content and the need for new regulations.

Multiple AI companies such as Perplexity, OpenAI, and Anthropic are reportedly bypassing the robots.txt files, which contain instructions for web crawlers on which pages to avoid. Despite these companies previously claiming to respect these protocols, they have continued to scrape website content for training their technologies.

In a letter from TollBit, a startup that partners publishers with AI firms for licensing deals, it was revealed that various AI agents are ignoring the robots.txt protocol. Wired reported that Perplexity's tools produced results closely paraphrased from its articles and sometimes generated inaccurate summaries, calling into question the ethical practices of these AI companies.

Perplexity CEO Aravind Srinivas defended the company, stating that the Robots Exclusion Protocol is not a legal framework. He suggested that a new relationship needs to be established between publishers and AI companies, though he admitted the company's use of third-party web crawlers may have contributed to the issues.