Machine Learning Evasion Attacks: How Adversaries Trick AI Models

Introduction

A street sign showing a 35 mph speed limit, but with the

To you and me, that sign says 35 mph. A self-driving car might read it as 85. Researchers have shown exactly that: tiny, barely visible changes to a speed limit sign can make a Tesla's vision system misclassify it. That's an evasion attack—small perturbations that push a machine learning model to the wrong answer while looking fine to humans.

ML models are everywhere now: facial recognition, fraud detection, malware scanners, autonomous systems. They're powerful and brittle. This post walks through how adversaries trick them. We'll cover four attack families:

White box attacks—full access to the model's internals
Black box attacks—only the final decision is visible
Gray box attacks—probability scores or partial outputs
Transfer-based attacks—craft on one model, deploy on another

Same goal in each case: find inputs the model misclassifies. The tactics differ by how much the attacker can see.

Background: How Models Learn (and Where They're Fragile)

A model is a function that maps inputs to outputs—trained on data to classify images, predict values, or the like. Classifiers put things in buckets (cat vs. dog); regressors predict numbers (e.g. house price). Training adjusts the model's parameters to minimize error. The result is a decision boundary: the surface that separates one predicted class from another.

Diagram showing the structure of a simple neural network.

Optimization techniques like gradient descent move the parameters toward a minimum of the loss. In high dimensions, though, those decision boundaries aren't solid walls—they're thin, and small changes to the input can push it across. Adversarial attacks target exactly those fragile regions.

Illustration of gradient descent on a loss surface, showing a series of descending steps toward a minimum.

Decision boundary example illustrating the separation between cat and dog classifications.

Adversarial Examples

Inputs that look virtually identical to humans can make a model output the wrong class. Small, crafted perturbations exploit the fragile decision boundaries we just talked about.

Example of an image where minor perturbations cause misclassification.

How you attack depends on what you can observe. Two dimensions matter: how much access you have to the model, and whether you want any wrong answer or a specific one.

Levels of Model Access

Attack strategies are primarily defined by how much an attacker knows about the model:

White-Box Attacks: The attacker has full access to the model's architecture, weights, and gradients.
Gray-Box Attacks: The attacker has partial access, often seeing output probabilities but not internal mechanics.
Black-Box Attacks: The attacker only knows the final decision (e.g., "yes" or "no") without insight into how it was reached.

Targeted vs. Non-Targeted Attacks

Beyond the level of access, the intent of the attack further differentiates these strategies:

Targeted Attacks: The attacker wants a specific incorrect output (e.g., making a "stop" sign appear as a "speed limit" sign).
Non-Targeted Attacks: The goal is simply to cause an incorrect classification, regardless of the specific output.

Detailed Exploration of Evasion Attack Techniques

White-Box Attacks

With full access to the model (architecture, weights, gradients), an attacker can craft very effective adversarial examples. The 2015 Goodfellow et al. paper is still the right place to start.

Fast Gradient Sign Method (FGSM)

FGSM uses the gradient of the loss with respect to the input: take its sign, scale by ε, and add it to the input in one step. Single-step, fast, and surprisingly effective. The classic demo is a panda image perturbed until the model labels it a gibbon with high confidence—changes almost invisible to a human.

FGSM Implementation in PyTorch

import torch
import torch.nn.functional as F

def fgsm_attack(model, image, label, epsilon):
"""Generates an adversarial example using FGSM."""
image.requires_grad = True # Enable gradient tracking for the input
output = model(image) # Forward pass
loss = F.nll_loss(output, label) # Compute loss
model.zero_grad() # Reset gradients
loss.backward() # Compute gradients
perturbation = epsilon * image.grad.sign() # Compute adversarial perturbation
adversarial_image = image + perturbation # Apply perturbation
adversarial_image = torch.clamp(adversarial_image, 0, 1) # Ensure valid pixel values
return adversarial_image

Projected Gradient Descent (PGD) Attack

PGD is FGSM done iteratively: small gradient steps, projected back into an ε-ball around the original input. In practice we reach for PGD more than FGSM when we have gradients—it finds stronger adversarial examples and often bypasses defenses that only consider single-step attacks.

PGD Implementation in PyTorch

import torch.optim as optim

def pgd_attack(model, image, label, epsilon, alpha, num_iter):
"""Generates an adversarial example using PGD."""
perturbed_image = image.clone().detach().requires_grad_(True)

    for _ in range(num_iter):
        output = model(perturbed_image)
        loss = F.nll_loss(output, label)
        model.zero_grad()
        loss.backward()
        perturbation = alpha * perturbed_image.grad.sign()  # Small step in gradient direction
        perturbed_image = perturbed_image + perturbation  # Apply perturbation
        perturbed_image = torch.clamp(perturbed_image, image - epsilon, image + epsilon)  # Project back
        perturbed_image = torch.clamp(perturbed_image, 0, 1)  # Keep within valid pixel range
        perturbed_image = perturbed_image.detach().requires_grad_(True)  # Reset for next iteration

    return perturbed_image

Carlini & Wagner (C&W) Attack

C&W frames the problem as optimization: find the smallest perturbation that forces misclassification. It uses gradient descent (e.g. Adam) on the input and a confidence term to tune how aggressive the perturbation is. Result: very small, often hard-to-detect changes. It's been used to bypass commercial malware detectors—adjust a few features of a malicious file and the classifier flips.

C&W Attack Implementation in PyTorch

import torch.optim as optim

def cw_attack(model, image, label, confidence=0, lr=0.01, max_iter=1000):
"""Generates an adversarial example using the Carlini & Wagner attack."""
perturbed_image = image.clone().detach().requires_grad_(True)
optimizer = optim.Adam([perturbed_image], lr=lr)

    for _ in range(max_iter):
        output = model(perturbed_image)
        loss = -F.nll_loss(output, label) + confidence  # Minimize perturbation while ensuring misclassification
        model.zero_grad()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()  # Update the perturbation

    return perturbed_image.detach()

Jacobian-Based Saliency Map Attack (JSMA)

JSMA perturbs only the most influential pixels. It builds a saliency map from the Jacobian, then flips a small set of pixels until the model's prediction changes. Fewer changes than a full gradient attack, and you can target a specific wrong class.

JSMA Implementation in PyTorch

import torch

def jsma_attack(model, image, target_label, num_features):
"""Generates an adversarial example using the Jacobian-based Saliency Map Attack (JSMA)."""
perturbed_image = image.clone().detach().requires_grad_(True)

    for _ in range(num_features):
        output = model(perturbed_image)
        loss = -F.nll_loss(output, target_label)
        model.zero_grad()
        loss.backward()
        saliency = perturbed_image.grad.abs()
        max_saliency = torch.argmax(saliency)
        perturbed_image.view(-1)[max_saliency] += 0.1  # Slightly modify the most influential pixel

    return perturbed_image.detach()

Defenses that only consider single-step attacks (e.g. FGSM) are insufficient; PGD and C&W routinely get through. If an attacker has gradient access, assume they can produce adversarial examples—adversarial training and certified defenses are the direction that actually holds up.

Gray-Box Attacks

When you don't have gradients but you do have output probabilities (or soft labels), you can still approximate a direction to perturb. ZOO does it with finite differences: probe the model with small input changes, estimate the gradient from the output shifts, then update. NES uses a different idea—sample random perturbations, keep the ones that hurt the true class, and iterate. Square Attack restricts changes to small square patches in the image, which cuts down the number of queries. All three have been shown to work against real systems: cloud vision APIs (e.g. Google Cloud Vision), fraud detectors (transactions tweaked until classified legitimate), and robust image classifiers that use JPEG compression or smoothing—localized or query-efficient attacks can still get through.

Black-Box Attacks

When the only thing you get from the model is the final label—no probabilities, no gradients—you're in the black-box setting. Query-based and transfer-based methods still work.

The HopSkipJump Attack

HopSkipJump is built for label-only access. You start from a misclassified input and nudge it along the decision boundary with minimal queries, refining until you have a small adversarial perturbation. No gradient estimates, just boundary probing.

HopSkipJump in PyTorch

import torch

def hopskipjump_attack(model, image, target_label, num_iter=50, step_size=0.01):
"""Black-box attack using decision boundary exploration."""
perturbed_image = image.clone()

    for _ in range(num_iter):
        # Generate random perturbation direction
        direction = torch.randn_like(image).sign()

        # Test the perturbation
        test_image = perturbed_image + step_size * direction
        if model(test_image).argmax() == target_label:
            perturbed_image = test_image  # Keep the adversarial example

    return perturbed_image.detach()

HopSkipJump and similar ideas have been used against cloud ML APIs (e.g. Google Cloud Vision) and fraud detection—repeated queries with small changes until the model flips.

The Boundary Attack

Boundary Attack starts from a wrong classification (e.g. random noise the model labels as something) and walks it toward the clean input while keeping the label wrong. You end up with a minimal adversarial example. No gradients or scores required.

Boundary Attack in PyTorch

import torch

def boundary_attack(model, image, target_class, step_size=0.01, num_iter=100):
"""Boundary Attack: Starts with an adversarial example and refines it."""
perturbed_image = torch.randn_like(image) # Start with random noise

    for _ in range(num_iter):
        perturbed_image = perturbed_image + step_size * (image - perturbed_image)
        output = model(perturbed_image)
        if output.argmax() == target_class:
            break  # Stop when the target misclassification is achieved

    return perturbed_image.detach()

The famous Tesla case fits here: researchers put small stickers on a 35 mph sign and got the car's vision system to read 85 mph. Fully black-box—no access to the model. Physical adversarial examples are a real risk for autonomous vehicles and similar systems.

The One-Pixel Attack

Change one pixel (or a handful). A genetic-algorithm-style search finds which pixel and what value cause misclassification. Surprisingly effective and very hard to spot. It's been used to break image classifiers (panda→gibbon), fool facial recognition, and defeat CAPTCHAs.

One-Pixel Attack in PyTorch

import torch
import random

def one_pixel_attack(model, image, label, num_iterations=100):
"""One-Pixel Attack: Modifies a single pixel to fool the classifier."""
perturbed_image = image.clone()

    for _ in range(num_iterations):
        x, y = random.randint(0, image.shape[2] - 1), random.randint(0, image.shape[3] - 1)
        perturbed_image[:, :, x, y] = torch.rand(1)  # Change one pixel

        if model(perturbed_image).argmax() != label:
            break  # Stop when misclassification occurs

    return perturbed_image.detach()

Transfer-Based Attacks

Adversarial examples often transfer: an input that fools one model frequently fools another. So you can train a substitute model (on data labeled by the target, or on a similar dataset), craft adversarial examples on the substitute with white-box methods, and fire them at the real target without ever touching its internals.

The Substitute Model Attack

Papernot et al. (2017): query the target for labels, train a surrogate on those labels, then run FGSM or similar on the surrogate. The resulting adversarial examples often transfer to the target with high success rates (e.g. 80%+).

Code Example: Training a Substitute Model in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim

def train_substitute_model(target_model, train_loader, num_epochs=5):
"""Trains a substitute model by querying the target model for labels."""
substitute_model = nn.Sequential(
nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(16 * 32 * 32, 10) # Assuming CIFAR-10
)

    optimizer = optim.Adam(substitute_model.parameters(), lr=0.001)
    loss_fn = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        for images, _ in train_loader:
            labels = target_model(images).argmax(dim=1)  # Querying the target model for labels
            optimizer.zero_grad()
            outputs = substitute_model(images)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

    return substitute_model  # Trained to imitate the target model

Momentum Iterative FGSM (MI-FGSM)

Early transfer attacks overfitted to the substitute. Dong et al. (2018) added momentum to the gradient updates so the perturbations generalize better across models. Standard choice when you care about transferability.

MI-FGSM in PyTorch

import torch

def mi_fgsm_attack(model, image, label, epsilon, alpha, num_iter, decay_factor=1.0):
"""Momentum Iterative FGSM (MI-FGSM) attack for transferability."""
perturbed_image = image.clone().detach().requires_grad_(True)
momentum = torch.zeros_like(image)

    for _ in range(num_iter):
        output = model(perturbed_image)
        loss = torch.nn.functional.cross_entropy(output, label)

        model.zero_grad()
        loss.backward()

        grad = perturbed_image.grad.data
        momentum = decay_factor * momentum + grad / torch.norm(grad, p=1)  # Apply momentum
        perturbation = alpha * momentum.sign()  # Compute adversarial perturbation

        perturbed_image = perturbed_image + perturbation
        perturbed_image = torch.clamp(perturbed_image, image - epsilon, image + epsilon)  # Keep within bounds
        perturbed_image = perturbed_image.detach().requires_grad_(True)  # Reset for next step

    return perturbed_image

Ensemble Attacks

Craft adversarial examples against several substitute models at once; the combined perturbation often transfers better to an unseen target because it hits shared decision boundaries. Higher success rates in practice.

Ensemble Attack in PyTorch

def ensemble_attack(models, image, label, epsilon):
"""Generates an adversarial example that transfers across multiple models."""
perturbed_image = image.clone().detach().requires_grad_(True)

    for model in models:  # Loop through multiple substitute models
        output = model(perturbed_image)
        loss = torch.nn.functional.cross_entropy(output, label)

        model.zero_grad()
        loss.backward()

        perturbation = epsilon * perturbed_image.grad.sign()
        perturbed_image = perturbed_image + perturbation
        perturbed_image = torch.clamp(perturbed_image, 0, 1)  # Ensure valid pixel values

    return perturbed_image.detach()

Adversarial patches take this into the physical world: a pattern on a T-shirt (2020 work used MI-FGSM and ensembles) that makes person detectors fail, and the same patch transferred across multiple detectors. So you don't need to query the target at all—train surrogates, craft on them, and the examples often still work. Input filtering and defensive distillation usually aren't enough against good transfer attacks.

Conclusion

Assume your model is white-box to an attacker who can query it—that's the safe assumption for deployment. If they only get labels, transfer and query-based attacks still work. Defenses that help: adversarial training, certified robustness where feasible, and input preprocessing that's validated against the same attack class you care about. For a practical next step, run PGD on your own models and see how many queries it takes to break them; then treat that as a baseline for what an adversary might do.