Amazon investigating Perplexity AI after accusations it scrapes websites without consent

Amazon Web Services has initiated an investigation to determine if Perplexity AI is contravening its guidelines by operating a web crawler that ignores the Robots Exclusion Protocol. Wired discovered a virtual machine, hosted by AWS and operated by Perplexity, ignoring robots.txt files and scraping content from websites like Condé Nast properties, The Guardian, Forbes, and The New York Times over the past three months.

To verify these activities, Wired used Perplexity's chatbot and found that it delivered paraphrased outputs from their content with minimal attribution. Responding to Wired's allegations, Perplexity spokesperson Sara Platnick and CEO Aravind Srinivas both denied breaching protocol, yet admitted to using third-party crawlers and bypassing robots.txt when users directly include specific URLs in their inquiries.

Amazon emphasizes that its users must adhere to robots.txt specifications and AWS Terms of Service, which prohibit illegal activities. While Perplexity maintains its compliance with these rules, the revelations of third-party crawler use and partial bypassing leave significant questions. This case is part of a broader scrutiny about how AI companies gather data to train large language models, suggesting further regulatory challenges ahead.