Teaching AI to hack may be the only way to secure the internet


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 06-02-2026 09:46 IST | Created: 06-02-2026 09:46 IST
Teaching AI to hack may be the only way to secure the internet
Representative Image. Credit: ChatGPT

Defensive cybersecurity models that relied on limited human expertise are collapsing as AI agents take over tasks once constrained by skill and labor. A new research paper finds that the attacker–defender balance has already shifted and warns that cautious, safety-first AI policies may be increasing exposure. The authors state that anticipating AI-enabled attacks now requires a deliberate and contested approach: allowing AI systems to conduct controlled hacking to strengthen defenses.

The preprint, To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack, released on arXiv, challenges the prevailing defensive mindset in both cybersecurity and AI safety communities, warning that a refusal to develop offensive AI capabilities will not prevent misuse but instead guarantee strategic disadvantage as attackers exploit the same tools without restraint.

AI agents upend the economics of cyber attacks

For more than a decade, cybersecurity strategy has been shaped by an implicit economic assumption: sophisticated cyber attacks are costly and therefore limited in scale. Building tailored exploits required deep expertise, manual effort, and time, forcing attackers to concentrate on high-value targets or rely on blunt, automated techniques against large populations. Defenders structured their systems around this constraint, assuming most assets would remain unattractive due to limited attacker resources.

AI agents eliminate this constraint. By automating vulnerability discovery, exploit construction, and post-exploitation analysis, AI agents dramatically reduce the marginal cost of attacks. Success no longer depends on flawless execution. Instead, attackers can tolerate high failure rates as long as a small fraction of attempts succeed across thousands of targets. This shift makes it economically viable to attack the long tail of systems that were previously ignored due to low individual value or high manual cost.

Unlike traditional automated tools that rely on known signatures or predefined rules, AI agents exhibit adaptive behavior. They can reason about unfamiliar codebases, infer system structure, and adjust strategies in response to partial success or resistance. The authors emphasize that this capability mirrors human strategic agency but operates continuously and at machine speed. As a result, AI agents can scale both the breadth and depth of attacks, combining mass automation with victim-specific customization.

The study further warns that current defensive mechanisms are poorly suited to this threat model. Static signatures, rule-based detection, and log monitoring are optimized for predictable attacks. AI agents, by contrast, can vary their methods, operate within legitimate system functionality, and blend malicious actions into normal workflows. The convergence of adaptability and scale, the authors argue, creates the conditions for what they describe as superhuman cyber attacks.

Why current AI safety defenses fall short

The paper explains why widely promoted AI safety measures are insufficient against determined adversaries. Data governance practices such as filtering training corpora and removing explicit exploit examples are designed to prevent models from memorizing harmful content. The authors argue that this approach misunderstands the source of risk. Modern AI systems generate attacks not by recalling exploit code, but by reasoning from first principles about software behavior.

Safety alignment techniques, including fine-tuning and reinforcement learning to discourage harmful outputs, are also criticized as fragile. The authors note that alignment can be bypassed through jailbreaking, objective distortion, or simple fine-tuning once attackers control a model. Moreover, harmful behavior may emerge across long action sequences even when individual steps appear benign, making single-turn safeguards ineffective in agentic workflows.

Output guardrails and access controls face similar limitations. Guardrails assume malicious intent will surface clearly in prompts or responses, an assumption that breaks down when AI agents distribute harmful actions across multiple steps and tools. Access restrictions, meanwhile, lose effectiveness as open-weight models proliferate and computing resources become cheaper. Once a model is locally deployed, defenders lose visibility and control.

Taken together, these failures reveal a deeper problem, according to the authors. Most AI safety strategies focus on restricting model behavior rather than understanding how adversaries will use AI systems in real-world attack scenarios. This model-centric view, they argue, leaves defenders reactive and blind to emerging exploitation patterns.

Offensive AI as essential defensive infrastructure

Cybersecurity, the study argues, must return to a principle long recognized in traditional security practice: effective defense requires understanding offense. Penetration testing and red teaming have historically allowed defenders to exploit their own systems to identify weaknesses before attackers do. The authors argue that AI agents require a similar shift, but at machine scale.

Rather than avoiding offensive capability, the paper proposes developing AI agents capable of hacking within strictly controlled environments. These agents would be used to model attacker behavior, discover vulnerabilities, and simulate realistic attack trajectories across diverse systems. The goal is not deployment in the wild, but predictive intelligence that allows defenders to anticipate how attacks will unfold at scale.

To make this approach viable and responsible, the authors outline three major actions. First, they call for comprehensive benchmarks that cover the full cyber attack lifecycle, from reconnaissance and exploitation to post-exploitation and remediation. Existing benchmarks, they argue, are fragmented and fail to capture the dynamic, multi-step nature of real attacks.

Second, the paper urges a transition from workflow-based tools to trained, adaptive agents. Current systems rely heavily on fixed pipelines and human-designed scaffolding, limiting their ability to generalize. Trained agents operating in realistic cyber environments could learn strategies directly from interaction, improving their capacity to discover in-the-wild vulnerabilities.

Third, the authors propose a governance framework that restricts offensive AI agents to audited cyber ranges. Offensive capabilities would be versioned, evaluated, and gated based on measured competence. Findings from offensive agents would be distilled into defensive-only systems that focus on detection, analysis, and remediation, ensuring that the most dangerous capabilities are never released.

The paper acknowledges ethical concerns around developing offensive AI, including the risk of leakage and misuse. However, the authors argue that refusing to study these capabilities does not eliminate the threat. Instead, it leaves defenders unprepared for adversaries who will pursue the same capabilities independently. In their view, the choice is not whether offensive AI will exist, but whether it will be mastered responsibly or encountered only after damage occurs.

  • FIRST PUBLISHED IN:
  • Devdiscourse

TRENDING

DevShots

Latest News

OPINION / BLOG / INTERVIEW

China’s banking sector reveals what AI can do for global finance

Why the shift from IoT to AIoT matters for food security in low-income countries

Rational but wrong: How AI misinterprets choices and quietly skews decisions

Patients welcome AI support in healthcare, but not without transparency

Connect us on

LinkedIn Quora Youtube RSS
Give Feedback