ByteDance It has been extracting data from the Internet for months with a ‘bot’ called Bytespider, an activity it does at a faster speed than the ‘bots’ of other leading companies in the large language models (LLM) market.
Large language models need enormous amounts of data for their training and these can only be found on the Internet, where several ‘bots’ already operate to ‘scrape’ or extract information from websites.
Firms such as Google, Meta, Amazon, OpenAI and Anthropic use their own ‘bots’, but they are not the only ones, since ByteDance also has its own, called Bytespider, which appeared sometime in April, as confirmed by firms specialized in this type of automations Kasada and Dark Visitors to Fortune.
Bytespider has the peculiarity that In a short time he has become very aggressive in data collection, as evident from Kasada’s reports. According to the CEO of this firm, Sam Crowther, extracts data at 25 times faster than GPTbot (OpenAI) and 300 times higher than that of ClaudeBot (Anthropic).
The ByteDance ‘bot, in addition, does not respect the robots.txt line of codewhich media publishers can embed on their website to tell bots not to extract data. It is also not respected by GPTbot and ClaudeBot.
Behind this massive data extraction seems to be the development of a new LLM by ByteDance, a source familiar with the matter shared with Fortune, which would be used to TikTok’s search function, according to another source.