A comprehensive analysis of websites archived by the Internet Archive reveals that AI-generated text has become pervasive online, making web content more uniform and artificially positive.
By mid-2025, roughly 35% of newly published websites were partially or entirely AI-generated, according to researchers from Imperial College London, Stanford University, and the Internet Archive. Before the launch of ChatGPT in late 2022, that figure was essentially zero.
The team analyzed a representative sample of English-language websites from the Wayback Machine, spanning 33 months from August 2022 to May 2025. To detect AI text, they used the Pangram v3 detector, which performed best in their robustness tests.
Among six common hypotheses about AI's effect on the web, only two were statistically supported: semantic contraction and a positivity shift. AI-generated texts were 33% more semantically similar to each other than human-written content, suggesting that language models gravitate toward the average of their training data, potentially narrowing the range of online discourse. Additionally, AI texts scored 107% higher on positive sentiment than human-written content, attributed to language models' tendency toward sycophancy and overoptimism. The researchers warn that a flood of sanitized, relentlessly cheerful prose could sideline human dissent.
“Rather than forcing models to be perfectly compliant and agreeable, allowing them to have a more distinct personality or 'friction' might help them act as a creative partner rather than a replacement for human voice,” said co-author Jonas Dolezal, an AI researcher at Stanford.
However, the study found no evidence to support four other hypotheses: a disappearance of individual writing styles, a decline in external links, a drop in information density, or an increase in factual errors. The truth-decay test, though methodologically limited, showed no correlation between AI content and the share of refuted claims. The researchers note that this does not rule out a rise in unverifiable claims, which are harder to detect.
The study also included a survey of 853 U.S. adults, which found that public perception often diverges from the data. For instance, 83% believed that individual writing styles are vanishing, but the analysis did not confirm this. People who rarely use AI were more likely to believe in negative effects than regular users.
The researchers warn that the high prevalence of AI content raises the risk of “model collapse,” where AI models degrade by training on their own outputs. They recommend cryptographic provenance standards like C2PA and tweaks to search algorithms to promote semantic diversity.
Co-author Maty Bohacek noted that the team is working with the Internet Archive to turn the analysis into a continuous monitoring tool. The study has limitations: it only covered English-language texts, relied on a single AI detector, and drew data solely from the Internet Archive.