paint-brush
Adversarial Machine Learning Is Preventing Bad Actors From Compromising AI Modelsby@thepraisejames
266 reads

Adversarial Machine Learning Is Preventing Bad Actors From Compromising AI Models

by Praise James January 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Adversaries are resilient and always looking for innovative ways to tamper with ML models.
featured image - Adversarial Machine Learning Is Preventing Bad Actors From Compromising AI Models
Praise James  HackerNoon profile picture

Not everyone has good intentions. As more artificial intelligence (AI) researchers try to make better machine learning (ML) models, attackers try to disrupt these systems. These attackers are called adversaries and they use deceptive data known as adversarial examples to lure ML models into failure.


Due to the increasing adoption of AI in our daily lives, it's crucial that we understand how and why adversaries carry out these attacks to avoid destructive consequences for us. This is why the field of Adversarial Machine Learning (AML) aims to study, anticipate, and prepare for these AI systems' compromises.


In this article, you'll learn what AML is about, the different types of adversarial attacks and why security is important for building trustworthy ML models.

What is Adversarial Machine Learning?

Adversarial Machine Learning (AML) is the study of attacks against ML models to understand how and why adversaries exploit the weaknesses of these systems and stay prepared with robust defensive measures. It's like watching YouTube cooking tutorials to understand how to prepare a particular meal, and why every ingredient used is essential, to avoid making a subpar dish.


AML is important because AI has become ingrained in our lives. Self-driving cars depend on AI to navigate roads, the healthcare sector uses AI in critical scans such as MRIs and CT scans, the finance sector uses it for fraud detection and loan approval, and reputable businesses such as Amazon have integrated AI into their day-to-day decisions. The reliability of AI systems is increasing exponentially and sometimes, it may seem like we forget these systems are machines undergoing constant improvement. Thus, they are imperfect and vulnerable, and it's important we protect their integrity since they have become a huge part of every technology we interact with.


Due to this mainstream use of AI, adversaries can cause tremendous damage by manipulating the systems to their own advantage. These attackers often use deceptive input (adversarial examples) to corrupt models and trick them into misclassification or revealing sensitive information. Imagine what could happen if an adversary accessed users' financial data or manipulated a self-driving car. Both scenarios can have devastating impacts on people.

What is an Adversarial Example?

Adversarial examples are harmful inputs which are fed into ML models to manipulate them. To the ordinary eye, these inputs are very difficult to tell apart from legitimate ones, but they contain perturbations that attack models.


Photo of adversarial examples by Author



A perturbation is a small, barely noticeable change to input data, like changes in pixel values in the case of images. These changes might seem insignificant or “normal” to us humans but they can cause a model to give inaccurate outputs. In real life, sometimes we misclassify a picture of a toad as that of a frog; that is a good illustration of the aim of adversarial examples.


Adversarial examples are often used to fool classifiers, tamper with training data, or acquire sensitive information. They can fool linear regression models, decision trees, nearest neighbours, and even neural networks. The image below is an example of how an adversary can trick a model into classifying the colour green as red.

Photo of AML attack by the Author


Why does an adversarial example work? During training, ML algorithms learn the best way to separate data into different classes. It's like learning what unique features tell toads and frogs apart. The region where this classification is made is known as the decision boundary. It's the thin line between making the right or wrong classification as shown in the image above.


Attackers understand the fragility of this boundary, so they create adversarial examples with perturbations that will push these malicious inputs across the boundary, thus leading to misclassification. For example, if you've noted that the key difference between a frog and a toad is in their skin—a frog has smooth skin whereas a toad has bumpy skin—and someone showed you a picture of a frog with minor skin irregularities, you might classify it as a toad.

Types of Adversarial Attacks

Photo of black-box models and white-box models by Author


Adversarial attacks depend on how much the adversary knows about a model. There are two broad categorizations of ML models: black-box models and white-box models.


  1. Black-box models: A black-box model is an AI system where the results are observable, but the basis for the results cannot be explained. That is, you can't tell why the model produces certain results for specific inputs. A good example is Large Language Models (LLMs) such as ChatGPT. In a black-box attack, the adversary doesn't know much about the model like its parameters, training data used, weights, or architecture. So, the only thing the adversary can do is devise an attack from the model's outputs.
  2. White-box models: These models have interpretable algorithms that show how every input impacts the output. Thus, an adversary knows the model's intricacies and finds it easier to discover weaknesses. Some examples of white-box models are decision trees, Generalized additive models (GAMs), and open-source LLMs such as BLOOM.


Considering these two categories of models above, adversaries often perform three main types of attacks:

Poisoning attacks

In this type of adversarial attack, the adversary focuses on tampering with input data or labels during model training for either targeted or non-targeted purposes. For targeted poisoning attacks, the adversary might modify the training data to ignore future suspicious activities or prepare for backdoor vulnerabilities, without reducing the performance of the model. For example, from our toad and frog analogy, an attacker might be focused on modifying a specific species of frog, so it can be misclassified as a toad in the future. Targeted poisoning attacks have a specific goal.


In a non-targeted poisoning attack, the adversary introduces malicious data to cause output errors, biases, or reduce the functionality or predictive capabilities of the AI system. Using our analogy, an attacker might randomly swap labels of frogs and toads and distort their features. The attacker doesn't care about a specific toad or frog; it just wants the model to perform poorly.


A famous example of a poisoning attack happened in 2016 when Microsoft introduced Tay (Thinking About You), an AI chatbot, to Twitter (now X). Tay started interacting with Twitter users and without any anti-offensive algorithm, it picked up on the inappropriate content that users were feeding it. Microsoft had to stop the Tay project because it started producing offensive, racist, sexist, and inappropriate content like the public tweets it was designed to learn from.

Evasion attacks

An adversary introduces carefully modified input data (adversarial examples) during model deployment to cause a misclassification in the output. The attacker doesn't tamper with training data or the trained classifier, rather the person waits till the inference stage (during deployment) to introduce adversarial examples that can lead to incorrect classifications. The goal of the adversary is to bypass the classifier in order to, for instance, fool spam detectors or evade a biometric verification system.



Photo of comparison between evasion attacks (a) and poisoning attacks (b) by IEEE Computer Society Digital Library


The main difference between a poisoning attack and an evasion attack is the time it occurs in the ML pipeline. Poisoning attacks occur during model training, while evasion occurs during testing or deployment.

Model extraction or stealing attacks

AI companies such as OpenAI do not reveal their training data to the public. Also, black-box models have been trained not to reveal their parameters or technical details about their architecture. Below, I asked You.com, an AI chatbot, to share its parameters with me.

Photo by Author


Since adversaries can’t access the training data of these black-box models, they try to steal a fragment of it in an attack known as model extraction. In a model extraction attack, the adversary queries a target black-box model using some set of inputs to obtain a small, often disjoint set of its training data from the output.


The goal of this theft is to reverse-engineer the target model’s decision-making process and replicate a substitute model that will match the functionality of the target model.


If an adversary succeeds in creating a close replica of their target model, the person has successfully stolen intellectual property and can extract confidential information or evade paying for the model's services. Using our frog and toad analogy, if your model was a comprehensive wildlife guide, the attacker will ask different questions concerning frogs and toads, and note down the model's responses. The person will repeat this process until they can obtain enough information to create a copy of your guide without needing to pay for yours, thus stealing your hard work.


Ultimately, even though the aims and means behind these adversarial attacks are different, the motivation is the same: to fool AI systems.

GANs Vs. Adversarial attacks

Generative Adversarial Networks (GANs) are neural networks that work with and against each other to reach their optimum level. GANs have two models: a discriminator and a generator. These two neural networks compete with each other in a zero-sum game of generating new images and trying to misclassify them.


A zero-sum game is a term in game theory that involves two players, where one player's gain is the other’s loss.


For GANs, this is how the game is played: the generator creates new, realistic samples from a random input. Then, the discriminator compares these fake samples with real samples and tries to predict if the generator's sample is real or fake. If the discriminator predicts that the generator's sample is fake, it gives feedback on how the generator can improve its next iteration.


This back and forth continues until the generator can create realistic fake samples that the discriminator can no longer distinguish from real samples.


Photo of Generative Adversarial Networks (GANs) by Author


More specifically, the generator tries to trick the discriminator into misclassification, while the discriminator helps the generator get better at creating realistic samples. So, although they are working against each other, they are also working together to minimize error or loss.


Therefore, the key difference between GANs and adversarial attacks is that GANs are well-intended while adversarial attacks are ill-intended.

Defense Against Adversarial Attacks

Defending ML models from adversarial attacks is not a walk in the park. To quote research scientist Ian Goodfellow, during his lecture at Stanford University in 2017, “Attacking models is extremely easy, defending them is difficult.” Nonetheless, below are some defensive tactics AI researchers have proposed against adversarial attacks.

Adversarial training

Adversarial training involves training models with adversarial examples and known vulnerabilities, so they can become familiar with possible attacks and prepare accordingly. It's asking, “what would adversaries do if they wanted to attack this system?” and then proactively preparing your models to resist this attack. Sometimes, the best way to catch a thief is to think like one. When a model is proactively designed to detect adversarial examples and attacks, it can recognize threats and have robust protections against them.

Defensive distillation.

In 2016, Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami released a paper titled "Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks.” Today, defensive distillation is used to protect neural networks from adversarial attacks.


This technique involves training a smaller neural network known as the student using a large neural network known as the teacher. This student model is trained on the teacher's soft labels (probability values for given classes) in order to be able to predict the output probabilities of the teacher.


The aim is to create a simpler yet efficient student model that matches the functionality of the teacher model but isn't overconfident in its predictions. This is because adversaries often leverage neural networks’ overconfidence in prediction accuracy to cause wrong predictions to the model. So, by fixing this overconfidence in outputs, we can improve the robustness of the model and ensure they're better calibrated for slight domain shifts and can detect wrong predictions based on their confidences.


Robustness in ML is the ability of a model to perform well on both training data and unseen data. Regardless of the introduction of new, skewed data, the model still makes accurate predictions.


The downside to defensive distillation is that the student model is still loyal to the teacher. Thus, if an attack on the teacher model is successful, the attack will also impact the student.

Continuous monitoring

One way to detect adversarial attacks is to continuously monitor the AI system for adversarial perturbations even when it has been deployed into the real world. We should consistently perform detailed analysis of the training data to catch any false inputs or potential risk in time. Also, we should watch the model's performance closely for any decline or abnormal behaviors, so we can evade adversarial attacks before it has destructive effects.

Awareness

Anyone involved in the process of developing AI systems should be familiar with the potential risks of adversarial attacks and the importance of cross-checking and verifying training data regularly. Data scientists and AI engineers should continuously retrain their models with adversarial examples and new training data. They should also test the model for any potential weaknesses that could allow adversaries to disrupt AI systems.


Finally, these defense tactics might not be a silver bullet but they have proven useful in resisting malicious attacks so far. More importantly, they will continue to undergo refinement as ML models advance.

The Future of Adversarial Machine Learning (AML)

If we are going to build trustworthy AI systems, we must treat their security with paramount attention. Over the years, AML has experienced massive awareness and advancements, but there's still room for improvement.


What we will start seeing next is a better understanding of new state-of-the-art adversarial attacks and the development of stronger defense systems that cut across different models, datasets, and attack types. Also, there will be more solutions to finding possible vulnerabilities ourselves before attackers do, so that we can anticipate threats and counteract them.


As ML advances, it will be a technological race between adversaries and defenders. It's important we win the race because as Ian said, “Before ML became as accurate as humans, computers making a mistake was a rule not an exception. However, with the credibility ML is getting and the way it's integrated into our daily lives, a mistake is very costly.” We must enhance the robustness of these models to prevent these costly mistakes.

Final Thoughts

While it's great that new AI systems with never-seen-before capabilities are deployed daily, the same level of advancements needs to be extended to the security of these systems. Adversaries are resilient and always looking for innovative ways to tamper with ML models. Therefore, it's crucial we stay one step ahead of their malicious goals to safeguard these models that have brought us technical breakthroughs.