AI Red Teamer course review banner

Hack The Box Academy: AI Red Teamer

Course Review

Overall Course Review

The AI Red Teamer Job Role Path is Hack The Box Academy's comprehensive training program for cybersecurity professionals looking to specialize in offensive security testing against AI systems. Developed in collaboration with Google and aligned with Google's Secure AI Framework (SAIF), this path bridges the gap between traditional red teaming and the emerging discipline of AI security.

The course spans 12 modules covering the full spectrum of AI attack surfaces: from foundational AI/ML concepts through prompt injection, model privacy attacks, adversarial evasion techniques, supply chain risks, and deployment-level vulnerabilities. It combines theoretical grounding with hands-on exercises, giving learners practical experience manipulating model behaviors and developing AI-specific red teaming strategies.

This path fundamentally changed how I view machine learning systems. The early modules build intuition around how AI works, while the later modules, particularly the evasion trilogy and data attacks, dive deep into the mathematics of adversarial machine learning. Expect to hit walls with the math. I found ChatGPT invaluable for decoding unfamiliar mathematical notation and building the intuition needed to really understand these attacks. The struggle is worth it, by the end you'll understand not just how to break AI systems, but why the attacks work at a fundamental level and how to defend against them.

What sets this path apart is its dual focus on offense and defense. Throughout the modules, attack techniques are paired with mitigations, and the final modules on AI Privacy and AI Defense ensure you can secure the systems you've learned to compromise. The integration of Google's SAIF and OWASP Top 10 frameworks early on provides mental scaffolding that pays dividends as you progress through increasingly technical content.

The path drives home that AI systems are vulnerable at every layer, model, data, application, and system, and that traditional web vulnerabilities like XSS, SQLi, and command injection compound with AI-specific risks in ways that are often overlooked. You'll develop mathematical intuition around norms, gradients, and optimization that's essential for understanding adversarial ML at a fundamental level. While jailbreaks and prompt injection are the entry point that most people think of, data poisoning and evasion attacks are where the deep exploitation happens. You'll also learn that defensive measures like adversarial training, DP-SGD, and guardrails come with real tradeoffs against model utility, there's no free lunch in AI security.

If you're serious about AI security, this course is the real deal.

***

Module Reviews

Module 1: Fundamentals of AI

Created by PandaSt0rm, this foundational module clocks in at around 8 hours across 24 sections of pure theory, no hands-on exercises, no labs, just a deep dive into the concepts that underpin AI systems. It serves as both an approachable introduction and a reference you'll come back to throughout the path. The content spans five major domains:

Supervised Learning: Linear regression, logistic regression, decision trees, naive bayes, and support vector machines
Unsupervised Learning: K-means clustering, principal component analysis (PCA), and anomaly detection
Reinforcement Learning: Q-learning and SARSA algorithms for decision-making agents
Deep Learning: Perceptrons, multi-layer perceptrons, convolutional neural networks (CNNs), and recurrent neural networks (RNNs)
Generative AI: Large language models and diffusion models

The module touches on mathematical foundations but doesn't make them the focus. That said, having some background in statistics, linear algebra, and calculus will make the content click faster. If you're coming in cold on the math, expect to do some supplementary reading.

Rated Medium difficulty at Tier 0, I'd say that's spot on. No multivariable calculus or actual coding here, but the topics are certainly more advanced than your typical intro material. Overall an extremely solid survey of AI and machine learning and a wonderful place to start the learning journey.

Module 2: Applications of AI in InfoSec

Created by PandaSt0rm with vautia, this 25-section module spans roughly 8 hours and is where theory meets practice. You'll build a complete AI development environment from scratch using Miniconda for package management and JupyterLab for interactive experimentation, then work through the entire model development lifecycle from raw data to trained classifiers.

The module covers:

Environment Setup: Miniconda, JupyterLab, and dependency management
Python Libraries: Scikit-learn and PyTorch for model training and evaluation
Datasets: Structure, loading, inspection, and identifying potential issues
Data Preprocessing: Cleaning, imputing missing values, encoding categorical features, handling skewed distributions
Data Transformation: One-hot encoding, data splitting, preparing data for modeling
Spam Classification: Naive Bayes with text-to-numerical feature extraction
Network Anomaly Detection: Random forests on the NSL-KDD dataset
Malware Classification: Converting malware to images and classifying with ResNet50

The module recommends setting up your own environment rather than using the provided VM for better training performance. A machine with at least 4GB RAM and a reasonably modern multi-core CPU will serve you well here. GPU is optional but helpful.

Rated Medium and costed at Tier 0, I'd say that's spot on. The module guides you through each step in detail without leaving you stranded. What I thoroughly enjoyed was how applicable everything felt, by the end I had working proof-of-concept models for real-world cybersecurity use cases: spam detection, network anomaly detection, and malware classification. That's not just theory, that's something you can actually build on.

Module 3: Introduction to Red Teaming AI

Created by vautia, this 11-section module runs about 4 hours and shifts gears from building AI systems to breaking them. It's the bridge between the foundational modules and the attack-focused content ahead, giving you the threat landscape overview before diving into specific exploitation techniques.

The module covers three core frameworks:

ML OWASP Top 10: Security vulnerabilities specific to machine learning systems
LLM OWASP Top 10: Vulnerabilities in large language model deployments
Google's Secure AI Framework (SAIF): Industry-standard approach to AI security

From there, it breaks down attack surfaces by component:

Model Components: Manipulating the model itself
Data Components: Attacking the data pipeline
Application Components: Exploiting the application layer
System Components: Targeting the underlying infrastructure

Prerequisites are the previous two modules (Fundamentals of AI and Applications of AI in InfoSec), which makes sense - you need to understand how these systems work before you can break them.

Rated Medium and costed at Tier I, this is an excellent introduction to secure AI and AI red teaming. What I particularly enjoyed was the early introduction of the Google SAIF and OWASP Top 10 frameworks, having these in the back of your mind as you dive deeper into the technical topics gives you a mental scaffold to hang everything on.

Module 4: Prompt Injection Attacks

Created by vautia, this 12-section module runs about 8 hours and marks your entry into Tier II offensive content. It's a comprehensive deep dive into one of the most prominent attack vectors against large language models: prompt injection.

The module covers four main areas:

Direct Prompt Injection: Manipulating LLM behavior through user-controlled input
Indirect Prompt Injection: Attacks via external data sources the LLM processes
Jailbreaking: Techniques to bypass safety guardrails and content filters
Mitigations: Both traditional and LLM-based defensive approaches

You'll start with prompt engineering fundamentals, move through reconnaissance techniques, then work through increasingly sophisticated injection and jailbreak methods. The "Tools of the Trade" section introduces automation and tooling for prompt injection testing.

Prerequisites build on everything prior: Fundamentals of AI, Applications of AI in InfoSec, and Introduction to Red Teaming AI.

Rated Medium and costed at Tier II, I thoroughly enjoyed this module as it introduces the kind of thinking required to perform adversarial AI testing. The different jailbreaking techniques provide a lot of real-world value, and honestly, when people think of AI pentesting, jailbreaks are what they tend to think of first. This module delivers on expectations.

Module 5: LLM Output Attacks

Created by vautia, this 14-section module runs about 8 hours and flips the script from input manipulation to output exploitation. Where prompt injection focuses on what goes into an LLM, this module examines what comes out and how that output can compromise systems.

The module covers traditional web vulnerabilities in an LLM context:

Cross-Site Scripting (XSS) in LLM applications
SQL Injection through LLM-generated queries
Command Injection via LLM outputs
Function Calling attacks against LLM tool use
Exfiltration attacks to leak prompt contents
Security issues from LLM hallucinations

It also covers abuse attacks, the weaponization side of LLMs: hate speech campaigns, misinformation generation, and the detection and mitigation of these attacks. The module wraps up with safeguard case studies and legislative regulation, grounding the technical content in real-world policy.

Prerequisites are notably heavier here, requiring both the AI path modules (Fundamentals, Applications, Intro to Red Teaming, Prompt Injection) plus traditional web security knowledge (XSS, SQL Injection Fundamentals, Command Injections).

Rated Medium and costed at Tier II, this module is highly relevant to real-world work. It introduces the layer of traditional security vulnerabilities and demonstrates how they compound with AI technologies. While not overly technically deep, it continues introducing learners to the attack surface of AI systems and how it's exploited. The combination of classic web vulns with LLM contexts is exactly what you'll encounter in production AI applications. As with other modules in the path, it pairs offensive techniques with their corresponding defenses, reinforcing that red teaming is ultimately about improving security posture, not just finding holes.

Module 6: AI Data Attacks

Created by PandaSt0rm, this 25-section module is estimated at 3 days and marks the first Hard-rated content in the path. Where previous modules focused on prompt-level and output-level attacks, this one goes deeper into the data pipeline itself, targeting the foundation that AI systems are built on.

The module covers several sophisticated attack categories:

Data Poisoning: Label flipping and targeted label attacks to corrupt training data
Feature Attacks: Clean label attacks that poison data without changing labels
Trojan Attacks: Embedding backdoors into AI models that activate on specific triggers
Tensor Steganography: Hiding malicious payloads within model artifacts
Model Artifact Exploitation: Attacking the saved model files themselves (pickle vulnerabilities)

The hands-on content is substantial, walking through baseline model creation, attack implementation, and evaluation for each technique. You'll work with logistic regression, CNN architectures, and custom attack tooling.

Prerequisites include the core AI path modules plus strong Python skills and Jupyter Notebook familiarity. HTB highly recommends using your own machine for the practicals rather than the provided VM, as the training workloads benefit from local compute.

Rated Hard and costed at Tier II, I ran into the math like it was a brick wall. Luckily I was able to use ChatGPT to decode the mathematical symbols I didn't know and began developing the mathematical intuition required to really understand and exploit AI systems. The Hard rating is truly justified, this will be the first real roadblock in the course pathway for many learners. If you've been coasting on the Medium modules, expect to slow down here.

Module 7: Attacking AI - Application and System

Created by vautia, this 14-section module runs about 8 hours and zooms out from model and data attacks to examine the application and system layers of AI deployments. After the math-heavy data attacks module, this returns to Medium difficulty while covering equally critical attack surfaces.

The module covers application and system component vulnerabilities:

Model Reverse Engineering: Extracting model architecture and parameters
Denial of ML Service: Resource exhaustion and availability attacks
Insecure Integrated Components: Third-party library and dependency risks
Rogue Actions: Unauthorized model behaviors
Excessive Data Handling & Insecure Storage: Data exposure through poor practices
Model Deployment Tampering: Attacking the deployment pipeline
Vulnerable Framework Code: Exploiting ML framework vulnerabilities

A significant portion focuses on the Model Context Protocol (MCP), the AI orchestration protocol introduced in 2024. You'll get a practical introduction to how MCP works, then dive into attacking vulnerable MCP servers and the risks of malicious MCP servers, with mitigations to round it out.

Prerequisites include the AI path modules plus SQL Injection Fundamentals, Command Injections, and Web Attacks.

Rated Medium and costed at Tier II, this module is an excellent blend of classic attack vectors (command injection, SQLi, info disclosure) with modern AI exploitation via MCP. It clearly shows how legacy flaws and AI risks combine, making this a highly valuable and insightful module. A welcome return to Medium and a little less math after the Hard data attacks module.

Module 8: AI Evasion - Foundations

Created by PandaSt0rm, this 12-section module runs about 8 hours and kicks off the three-module evasion series. It introduces inference-time evasion attacks, techniques that manipulate inputs to bypass classifiers or force targeted misclassifications at prediction time.

The module establishes the evasion threat model:

White-box vs Black-box: Understanding attacker knowledge levels and their implications
Transferability: How attacks crafted on one model can transfer to others
Feature-obfuscation: The GoodWords attack methodology

You'll build and attack a spam filter using the UCI SMS dataset, implementing GoodWords attacks in both white-box and black-box settings. The black-box content covers operating under query limits, including candidate vocabulary construction, adaptive selection strategies, and small-combination testing to minimize detection.

Prerequisites build on the full path so far including the AI Data Attacks module. Strong Python skills and Jupyter Notebook familiarity are mandatory, and HTB highly recommends using your own machine for the practicals.

Rated Medium and costed at Tier II, this is when the course starts getting really into the weeds, readying you for deeper mathematical attacks and adversarial machine learning in the modules ahead. A great module for building intuition around machine learning and how to attack the models themselves, not just the applications around them.

Module 9: AI Evasion - First-Order Attacks

Created by PandaSt0rm, this 23-section module is estimated at 2 days and returns to Hard difficulty. It takes gradient-based adversarial techniques from theory to implementation, exploiting the differentiable structure of neural networks to craft perturbations that force misclassifications.

The module covers the core first-order attack methods:

Foundations: Norm constraints (L_inf, L_2), local linearity assumptions, high-dimensional accumulation effects
FGSM (Fast Gradient Sign Method): Single gradient step adversarial examples under L_inf budget
Targeted FGSM: Directing misclassifications toward specific target classes
I-FGSM (Iterative FGSM): Multiple smaller steps with projection for improved success rates
DeepFool: Finding minimal perturbations through iterative linearization and geometric projection onto decision boundaries

Each attack includes full implementation, evaluation metrics, and visualization. The module has two skills assessments, reflecting the depth of content.

Prerequisites include all previous AI modules plus basic understanding of neural networks and gradient computation. Strong Python and Jupyter skills are mandatory, and using your own machine is highly recommended. I also highly recommend using ChatGPT to help you understand the mathematical concepts, it's a powerful tool and will help you greatly.

Rated Hard and costed at Tier II, this was another module where I ran into a brick wall with the mathematics. At first much of the content was hieroglyphics to me, however I was eventually able to understand it and it fundamentally began to change the way I view the world and systems. The mathematical intuition gained here is second to none in regards to how machine learning actually works and how it's exploitable. Worth the struggle.

Module 10: AI Evasion - Sparsity Attacks

Created by PandaSt0rm, this 28-section module is estimated at 3 days and is the most technically dense of the evasion trilogy. Where first-order attacks focus on minimizing perturbation magnitude, sparsity attacks flip the constraint: minimize how many features change, not how much they change.

The module covers the mathematical foundations and implementations:

L0 Budgets: The pseudo-norm counting modified coordinates
L1-Induced Sparsity: Using regularization to promote sparse solutions
Saliency-Based Feature Selection: Identifying the most impactful features to modify
ElasticNet Attack (EAD): Combining L1 and L2 regularization for sparse yet smooth perturbations
FISTA Optimization: Proximal gradient descent with momentum for solving non-smooth objectives
JSMA (Jacobian-based Saliency Map Attack): Explicit L0 budgets by modifying one or two features per step
Single-Pixel and Pairwise Variants: Balancing attack efficiency with modification counts

The hands-on content is extensive, covering environment setup, implementing FISTA components, loss gradients, binary search, attack execution, visualizations, and sparsity analysis for both ElasticNet and JSMA approaches.

Prerequisites include all previous AI modules through First-Order Attacks, plus understanding of neural networks, gradient computation, and optimization methods. Own machine highly recommended.

Rated Hard and costed at Tier II, this one is much like the last, very tough on the mathematics. But the intuition gained is transformative, especially around the different norms (L0, L1, L2, L_inf) and how they interact. It also gave insight into how adversarial algorithms are conceived and developed, not just how to use them. You'll leave understanding why these attacks work, not just that they work.

Module 11: AI Privacy

Created by PandaSt0rm, this 21-section module is estimated at 2 days and is notably the first Defensive-focused module in the path. It explores the privacy dimension of AI security, both the attacks that extract sensitive information from trained models and the defenses that protect against them.

The attack side covers Membership Inference Attacks (MIA):

Shadow Model Methodology: Training attack classifiers to detect membership based on prediction confidence patterns
Exploiting Overfitting: How models behave differently on training data vs unseen data, creating detectable fingerprints
Privacy Implications: If a model trained exclusively on cancer patients reveals someone as a training member, their medical status is exposed

The defense side covers differential privacy approaches:

Differential Privacy Fundamentals: The mathematical framework for quantifying privacy guarantees
DP-SGD (Differentially Private Stochastic Gradient Descent): Per-sample gradient clipping and calibrated noise injection to limit individual influence
PATE (Private Aggregation of Teacher Ensembles): Architectural separation with multiple teachers on disjoint data, noisy vote aggregation for student training
Privacy-Utility Tradeoffs: Understanding what you sacrifice in model performance to gain privacy guarantees

Prerequisites include all previous AI modules, plus solid PyTorch familiarity and understanding of neural network training and optimization. Own machine is highly recommended over Pwnbox.

Rated Medium and costed at Tier II, I found this module informative in showing how membership inference attacks work and how models can be secured through training methods. Honestly, I thought it could have been rated Hard, the math and algorithms pushed the limits of what I'd consider Medium territory. What I found particularly valuable was seeing the ML training side in more depth, understanding how different training approaches (standard vs DP-SGD vs PATE) produce models with fundamentally different security properties, and the concrete tradeoffs between privacy guarantees and model utility.

Module 12: AI Defense

Created by vautia with PandaSt0rm, this 21-section capstone module is estimated at 2 days and brings everything full circle. After spending the path learning to attack AI systems, you now learn to defend them, understanding both sides of the adversarial equation.

The module covers three main defensive approaches:

LLM Guardrails (application-layer defenses):

Character-based Validation: Input filtering and sanitization
Traditional Content-based Validation: Pattern matching and blocklists
AI-based Guardrails: Using models to detect malicious inputs
Guardrail Libraries and Services: Production-ready defensive tooling

Model-level Defenses:

Adversarial Training: Incorporating adversarial examples during training to build robust models
Adversarial Tuning: Fine-tuning existing models for safety through supervised learning on jailbreak/refusal pairs
Understanding and mitigating jailbreak attacks

The module also covers advanced prompt injection tactics like priming attacks and how to evade guardrails, giving you both the offensive and defensive perspective.

Prerequisites include the core AI modules plus the evasion modules. Note that running adversarial training and tuning code requires powerful hardware, but it's optional, you can complete the module without running it yourself.

Rated Medium and costed at Tier II, this was a wonderful capstone to the course. It covers advanced techniques while tying everything together and pushing the path over the line from educational to practically useful. This is the module that enables effective AI red teaming against real systems with safeguards in place. What I particularly appreciated was how it equips you not just to attack AI systems and find vulnerabilities, but to secure them and defend against the very attacks you've learned. That dual perspective is what separates a red teamer from just a hacker.

Final Verdict

The AI Red Teamer path delivers on its promise. At an estimated 19 days of content, it took me about 2 months to complete while balancing other commitments. That time investment was worth every hour. The path takes you from zero AI security knowledge to being capable of performing meaningful offensive assessments against AI systems, while also equipping you to recommend and implement defenses. The collaboration with Google shows in the quality and real-world relevance of the content.

This is not an easy path. The Medium-rated modules are genuinely Medium, and the Hard-rated modules (AI Data Attacks, First-Order Attacks, Sparsity Attacks) will challenge anyone without a strong mathematical background. But that challenge is precisely what makes the learning valuable. You'll emerge with skills that are genuinely rare in the security industry.

Ready to become an AI Red Teamer? [Check out the course on HTB Academy](https://academy.hackthebox.com/path/preview/ai-red-teamer)