AI hacking AI. Why can companies lose this arms race?

Until two years ago, ‘hacking’ artificial intelligence was mainly associated with internet trivia. Users would outdo each other in coming up with funny commands (“Play the role of the evil twin…”) to force a chatbot to swear or express a controversial opinion. Today, from a business security perspective, those days seem like prehistory.

We are entering an era where the role of hackers is being taken over by… other AI systems. We are no longer dealing with humans manually typing in trick questions, but with automated attack engineering, where modern reasoning models (reasoning models) deliberately and logically undermine the protection mechanisms of company systems. For IT integrators and security officers, this is a wake-up call: classic protection methods such as firewalls or static rules are becoming useless against intelligent, multi-step manipulation.

A technological arms race is beginning, in which the attacker thinks faster than the defender.

A new quality of threat: Automated Jailbreaking

For a long time, attacks on language models (LLMs) relied on simple social engineering tricks called jailbreaking. However, what used to require human creativity is now automated.

The biggest challenge is modern models capable of so-called reasoning. These systems do not just generate text, but can plan intermediate steps, draw conclusions and correct their actions in real time. If the first attempt to bypass security fails, the attacking AI model analyses the denial, changes its strategy and tries another path – until it succeeds.

In practice, this means that attackers can use their own AI models to carry out thousands of iterations of ‘conversations’ with the victim’s system in a matter of minutes. The aim is to find a gap in the security policy to extract data or inject malicious code (prompt injection). What previously required manual expertise becomes available as a ready-made, automated attack tool.

Knock-on effects in autonomous systems

The problem becomes critical when AI stops being just a chatbot and becomes part of business processes. Companies are increasingly integrating models with customer databases, API systems or workflow engines.

In such an environment, a successful jailbreak is not just an ‘ugly response’ from the model. It is a real risk of triggering a domino effect. Imagine an autonomous AI agent with permissions to edit records in a CRM system. If successfully manipulated by a multi-stage attack, it could not only expose sensitive data, but also perform unauthorised business operations.

The threat, moreover, is not just coming from outside. The growing network of AI systems increases the risk of internal abuse. Employees – intentionally or by mistake – can use reasoning models to bypass corporate locks to ‘make their job easier’, unknowingly exposing the organisation to data leakage. The more autonomous the model and the broader its powers, the greater the potential for harm in the event of a breach.

Why does classic Pentesting not work?

For the IT industry, this is a turning point, requiring a change in mentality. Classic IT security measures are based on determinism: identical inputs always produce the same result. This makes traditional vulnerability scanners and penetration tests work effectively.

However, AI systems are inherently probabilistic – they can answer the same question in a hundred different ways. Traditional security tools are blind here. Therefore, standard network and application penetration testing must give way to a new discipline: AI Red-Teaming.

AI Red-Teaming is not checking for open ports, but analysing the ‘logic’ and behaviour of the model. It involves simulating targeted attacks (such as model theft, data extraction or the aforementioned prompt injection) to see how the system will behave under boundary conditions. Crucially, due to the unpredictability of AI, these tests cannot be a one-off ‘audit before deployment’. They need to become an ongoing process, with specialised systems constantly trying to ‘crack’ our defences to detect weaknesses before the criminals do.

Defence: Architecture instead of ‘patches’

As models are vulnerable to manipulation and attacks become more sophisticated, how do we defend ourselves? The answer lies in the *Secure-by-Design* approach.

We cannot trust the model itself to ‘be polite’. Security principles must be anchored in the architecture surrounding the AI, not in the algorithm itself. The key elements of such a strategy are:

1 External Guardrails: Security mechanisms must be outside the model. Independent filters should check both what goes into the model and what comes out of it. Even if the jailbreak succeeds and the model wants to reveal the credit card number, an external validator should block that response.

2 Minimising rights: AI systems should only be given access rights that are absolutely necessary to perform the task.

3 Contextual control: access to data must depend on the context – who is asking, for what purpose and whether they have the authority to do so.

A race that has no finish line

In the short term, the risk remains high. Attack methods, driven by increasingly better reasoning models and the availability of open-source tools, are maturing faster than defence standards. It is a classic arms race. On the one hand, we have advances in ‘constitutional AI’ and ever-better filters, on the other, increasingly creative autonomous attacks.

For technology companies and integrators, the conclusion is one: security AI is not a product that can be bought and installed. It is a process. It requires the construction of secure runtime environments, the implementation of continuous monitoring and, perhaps most importantly, humility towards a technology that can surprise even its creators. The future will belong to those who understand that when faced with an intelligent attack, the only effective defence is an equally intelligent security architecture.