Remember the internet a decade ago? It was like a global library connected to a vibrant marketplace – a place created by the people, for the people. Today, although on the surface it looks similar, its fundamental nature is undergoing a rapid transformation.
It increasingly resembles a giant, automated server farm, where machines carry on lively conversations mainly among themselves. Behind the scenes of the AI revolution we see every day, a quiet but powerful transformation is taking place: the internet is becoming the fuel and training ground for artificial intelligence.
This process generates an unimaginable amount of traffic, the true purpose and scale of which most of us are completely unaware.
The main driver of this change is a self-perpetuating mechanism in which AI bots massively index the web to train more, even more powerful models. This cycle, reminiscent of the mythological Ouroboros – a snake eating its own tail – is generating huge, often hidden costs, putting a strain on global infrastructure and raising fundamental questions about the future of digital information. It is time to take a look at who is really behind this wave of automation and what the non-obvious consequences are.
The new ruler of the internet – the AI bot
The figures don’t lie. The latest figures show that almost 40% of all internet traffic is the work of bots. This is no longer a fringe, it is a powerful force that is shaping the digital landscape. But more importantly, driving this change are AI-powered bots, which already account for 80% of activity in their category.
They, rather than simple spam scripts, are the real rulers of the web today.
So who is pulling the strings? The answer may be surprising. While the obvious suspects seem to be Google or OpenAI, the creator of ChatGPT, another giant has come to the fore in the race for mass data acquisition.
It is Meta that generates more than half of the large-scale indexing traffic, ahead of Google and OpenAI combined. This information sheds light on the scale of the ‘data harvesting’ that major technology corporations are doing.
It’s no longer just about improving the search engine or AI assistant – it’s a global sourcing operation that is driving an entire technological revolution.
Big ‘data harvest’ – what is it all for?
The purpose of this massed activity is simple: data acquisition (scraping) to train Large Language Models (LLMs). One can compare the internet to an inexhaustible mine of raw materials, and AI bots to fully automated mining machines.
They work 24/7, digging through billions of pages, articles, comments and code snippets. The more varied ‘toil’ they collect, the more powerful, ‘intelligent’ and versatile an AI model can be built.
In this way, the paradox of ‘eating its own tail’ is being realised. Machines are creating web traffic to teach other machines, which in the near future will start generating vast amounts of content themselves. This content will in turn become fodder for the next generation of AI.
It is a closed circuit in which the role of the human being as creator and receiver of information is becoming less and less central.
Unintended consequences, real costs
This revolution is not happening without cost. The first casualty is the infrastructure. Even ‘legitimate’, non-malicious bot traffic, but carried out on a massive scale, can completely clog up servers. Its access characteristics, which involve sending thousands of requests in a short period of time, generate effects almost identical to DDoS (Distributed Denial of Service) attacks, leading to the slowing down or complete paralysis of services.
From a business perspective, this means the emergence of a kind of ‘hidden AI tax’. Every company that maintains a website is unknowingly bearing the cost of this global training. It pays for extra bandwidth and more server processing power to handle the army of bots collecting data.
It is a form of forced subsidy for the development of technology, which is paid for by everyone present on the web, often without them even knowing it. There is also the problematic issue of intellectual property – mass scraping is in fact the automated copying of content on an unimaginable scale, which is already leading to numerous legal disputes around the world.
The echo effect and a future skewed by data
Perhaps the most serious long-term consequence, however, is that of an informational and cultural nature. It is telling that as much as 90% of the entire AI training movement is concentrated in North America.
This means that language models that aspire to become global tools learn from data filtered through the prism of a single cultural, linguistic and economic background.
Thus, unintentionally, we are building a global artificial intelligence with a very strong ‘American accent’. Its understanding of the world, cultural subtleties and even value systems will inevitably be biased.
This raises a fundamental question about the neutrality of technology to shape our future. Will AI be able to understand the problems and contexts of the people of Central Europe, Southeast Asia or Africa with the same precision as those of Silicon Valley?
We are creating a powerful tool that can perpetuate existing information inequalities, locking us into a global filter bubble.
The internet as we knew it is becoming a thing of the past. From a medium created by humans for humans, it is becoming primarily an infrastructure for the development of artificial intelligence. This change brings with it the promise of extraordinary progress, but also risks that we are only just beginning to understand.
The biggest challenge may be when AI-generated content dominates the web to such an extent that future generations of models will learn mainly from it. Are we then in danger of digital ‘degeneration’, an era of information recycling and loss of connection to real, human experience?
Understanding this phenomenon today is a key task not only for engineers, but for anyone who wants to consciously navigate the digital world. We must learn to manage this new automated reality before it can fully manage us.