Amazon Web Services investigates whether Perplexity uses 'web scrapping' to train its AI

Amazon Web Services (AWS) has announced that it has begun an investigation into the operation of Perplexity – which uses its servers – to determine whether this company uses the ‘web scraping’ technique to train its Artificial Intelligence (AI) models.

Also known as data scraping, it is a process by which content is collected from web pages using ‘software’ that extracts the HTML code of these sites to filter the information and store it, which is compared to the process automatic copy and paste.

Developer Robb Knight and Wired have recently discovered that AI search startup Perplexity has violated the so-called Robots Exclusion Protocol for certain websites and used this technique to train its AI models.

This Protocol responds to a web standard that consists of placing a plain text file (robots.txt) on a domain to indicate which pages robots and automated crawlers should not access, as explained by said medium.

Based on these allegations, Amazon Web Services has launched an investigation to determine whether Perplexity, which uses AWS to train its AI, is violating the rules and running ‘web scrapping’ on websites that tried to prevent it.

This was confirmed to Wired by an AWS spokesperson, who noted that its terms prohibit its customers from using its services for any illegal activity and that they are responsible for complying with its conditions “and all applicable laws.”

From the ‘startup’ they have indicated that Perplexity “respeta robots.txt” and that the services it controls “do not track in any way that violates AWS’s terms of service,” in the words of spokesperson Sara Platnick.

However, the company explained that its bot will ignore the robots.txt file when a user enters a URL in their query, a “rare” use case. “When a user enters a specific URL, it does not trigger a crawling behavior” but rather “the agent acts on behalf of the user to retrieve the URL. It works the same as if the user were to go to a page, copy the text of the article and then paste it into the system,” it said.

In this regard, Wired has stressed that, according to the spokesperson’s description, it is confirmed that the investigations it has carried out are true and that its ‘chatbot’ ignores robots.txt in certain cases to collect information in an unauthorized manner.

By Editor

One thought on “Amazon Web Services investigates whether Perplexity uses ‘web scrapping’ to train its AI”

Leave a Reply