Classifieds

Half of Top News Sites Blocked OpenAI’s Crawlers in 2023

[ad_1]

At the end of 2023, nearly one-half (48%) of the top news websites, based on reach, across 10 countries blocked OpenAI‘s crawlers, while nearly one-quarter (24%) blocked Google’s AI crawler, according to a study by Reuters Institute.

Reuters Institute analyzed the robots.txt of the 15 online news sources with the widest reach, including titles like The New York Times, BuzzFeed News, The Wall Street Journal, The Washington Post, CNN and NPR, across countries including Germany, India, Spain, the U.K. and the U.S.

In the absence of clear regulatory frameworks governing generative artificial intelligence‘s use of copyrighted material, many large publishers have taken matters into their own hands, taking AI firms to court, updating terms of service, blocking crawlers or making deals to protect premium content, data and revenues.

The study grouped outlets into three categories: legacy print publications, television and radio broadcasters and digital-born outlets.

Over one-half (57%) of the websites of legacy print publications, such as The New York Times, blocked OpenAI’s crawlers by the end of 2023, compared with 48% of television and radio broadcasters and 31% of digital-born outlets.

Similarly, 32% of print outlets blocked Google’s crawlers, while 19% of broadcasters and 17% of digital-born outlets did the same.

“The Reuters study highlights a fundamental challenge for generative AI: its dependence on authentic content generated by real people who see it as a threat to their livelihoods,” said Gartner VP distinguished analyst Andrew Frank.

Meanwhile, a recent study by Cornell University found that when new AI models are trained on data derived from prior models rather than human input, they tend to ‘model collapse’ or degenerate, leading to increased errors and misinformation in the generated output.

“This suggests that large language model developers need to find ways to compensate people who create or report true content, not just for the sake of society, but also for their own commercial interests,” said Frank.

Website crawlers are deployed for many reasons. Crawlers like Google’s Googlebot index publisher websites in the tech giant’s search results. Meanwhile, OpenAI’s crawler, GPTBot, collects data across the internet to train its large language models such as ChatGPT. This lets AI tools generate accurate, contemporaneous data—a capability that news publishers especially are uniquely positioned to provide: LLMs overweigh premium publishers’ content by a factor of between 5 and 100. AI-powered solutions are emerging as alternatives to traditional search engines.

1 2Next page

Related Articles

Back to top button