Adversarial Là Gì


Adversarial examples are inputs to lớn machine learning models that an attacker has intentionally designed to cause the mã sản phẩm lớn make a mistake; they"re lượt thích optical illusions for machines. In this post we"ll show how adversarial examples work across different mediums, & will discuss why securing systems against them can be difficult.

Bạn đang xem: Adversarial là gì

At, we think adversarial examples are a good aspect of security to work on because they represent a concrete problem in AI safety that can be addressed in the short term, và because fixing them is difficult enough that it requires a serious retìm kiếm effort. (Though we"ll need lớn explore many aspects of machine learning security khổng lồ achieve sầu our goal of building safe, widely distributed AI.)

To get an idea of what adversarial examples look like, consider this demonstration from Explaining and Harnessing Adversarial Examples: starting with an image of a pandomain authority, the attacker adds a small perturbation that has been calculated lớn make the image be recognized as a gibbon with high confidence.

An adversarial input đầu vào, overlaid on a typical image, can cause a classifier to miscategorize a panda as a gibbon.

The approach is quite robust; recent research has shown adversarial examples can be printed out on standard paper then photographed with a standard điện thoại thông minh, and still fool systems.

Adversarial examples can be printed out on normal paper and photographed with a standard resolution điện thoại thông minh và still cause a classifier to, in this case, label a "washer" as a "safe".

Adversarial examples have sầu the potential to lớn be dangerous. For example, attackers could target autonomous vehicles by using stickers or paint to lớn create an adversarial stop sign that the vehicle would interpret as a "yield" or other sign, as discussed in Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples.

Reinforcement learning agents can also be manipulated by adversarial examples, according khổng lồ new retìm kiếm from UC Berkeley,, và Pennsylvania State University, Adversarial Attacks on Neural Network Policies, & research from the University of Nevada at Reno, Vulnerability of Deep Reinforcement Learning khổng lồ Policy Induction Attacks. The research shows that widely-used RL algorithms, such as DQN, TRPO, & A3C, are vulnerable lớn adversarial inputs. These can lead to lớn degraded performance even in the presence of pertubations too subtle to lớn be percieved by a human, causing an agent khổng lồ move a pong paddle down when it should go up, or interfering with its ability to spot enemies in Seaquest.

If you want khổng lồ experiment with breaking your own models, you can use cleverhans, an open source library developed jointly by Ian Goodfellow and Nicolas Papernot to kiểm tra your AI"s vulnerabilities to lớn adversarial examples.

Adversarial examples give sầu us some traction on AI safety

When we think about the study of AI safety, we usually think about some of the most difficult problems in that field — how can we ensure that sophisticated reinforcement learning agents that are significantly more intelligent than human beings behave in ways that their designers intended?

Adversarial examples show us that even simple modern algorithms, for both supervised and reinforcement learning, can already behave sầu in surprising ways that we bởi not intover.

Attempted defenses against adversarial examples

Traditional techniques for making machine learning models more robust, such as weight decay and dropout, generally bởi not provide a practical defense against adversarial examples. So far, only two methods have sầu provided a significant defense.

Adversarial training: This is a brute force solution where we simply generate a lot of adversarial examples và explicitly train the mã sản phẩm not khổng lồ be fooled by each of them. An open-source implementation of adversarial training is available in the cleverhans library và its use illustrated in the following tutorial.

Defensive sầu distillation: This is a strategy where we train the Model khổng lồ output probabilities of different classes, rather than hard decisions about which class to lớn output. The probabilities are supplied by an earlier Mã Sản Phẩm, trained on the same task using hard class labels. This creates a Mã Sản Phẩm whose surface is smoothed in the directions an adversary will typically try lớn exploit, making it difficult for them to discover adversarial input đầu vào tweaks that lead to incorrect categorization. (Distillation was originally introduced in Distilling the Knowledge in a Neural Network as a technique for model compression, where a small Model is trained khổng lồ imitate a large one, in order to lớn obtain computational savings.)

Yet even these specialized algorithms can easily be broken by giving more computational firepower lớn the attacker.

A failed defense: “gradient masking”

To give sầu an example of how a simple defense can fail, let"s consider why a technique called "gradient masking" does not work.

"Gradient masking" is a term introduced in Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. to describe an entire category of failed defense methods that work by trying to lớn deny the attacker access to a useful gradient.

Xem thêm: Cách Phân Tích Biểu Đồ Nến Đảo Chiều Và Cơ Bản, Cách Đọc Biểu Đồ Nến Nhật Hiệu Quả Nhất

Most adversarial example construction techniques use the gradient of the mã sản phẩm to lớn make an attack. In other words, they look at a picture of an airplane, they kiểm tra which direction in picture space makes the probability of the “cat” class increase, và then they give sầu a little push (in other words, they perturb the input) in that direction. The new, modified image is mis-recognized as a mèo.

But what if there were no gradient — what if an infinitesimal modification to lớn the image caused no change in the output of the model? This seems to lớn provide some defense because the attacker does not know which way to lớn “push” the image.

We can easily imagine some very trivial ways to lớn get rid of the gradient. For example, most image classification models can be run in two modes: one mode where they output just the identity of the most likely class, và one mode where they output probabilities. If the model’s output is “99.9% airplane, 0.1% cat”, then a little tiny change khổng lồ the đầu vào gives a little tiny change to lớn the output, & the gradient tells us which changes will increase the probability of the “cat” class. If we run the Model in a mode where the output is just “airplane”, then a little tiny change to the đầu vào will not change the output at all, & the gradient does not tell us anything.

Let’s run a thought experiment lớn see how well we could defend our Mã Sản Phẩm against adversarial examples by running it in “most likely class” mode instead of “probability mode.” The attacker no longer knows where khổng lồ go lớn find inputs that will be classified as cats, so we might have sầu some defense. Unfortunately, every image that was classified as a cát before is still classified as a mèo now. If the attacker can guess which points are adversarial examples, those points will still be misclassified. We haven’t made the Mã Sản Phẩm more robust; we have sầu just given the attacker fewer clues to lớn figure out where the holes in the models defense are.

Even more unfortunately, it turns out that the attacker has a very good strategy for guessing where the holes in the defense are. The attacker can train their own model, a smooth model that has a gradient, make adversarial examples for their mã sản phẩm, & then deploy those adversarial examples against our non-smooth Model. Very often, our Mã Sản Phẩm will misclassify these examples too. In the kết thúc, our thought experiment reveals that hiding the gradient didn’t get us anywhere.

The defense strategies that perkhung gradient masking typically result in a model that is very smooth in specific directions and neighborhoods of training points, which makes it harder for the adversary to find gradients indicating good candidate directions khổng lồ perturb the input in a damaging way for the Mã Sản Phẩm. However, the adversary can train a substitute model: a copy that imitates the defended mã sản phẩm by observing the labels that the defended model assigns lớn inputs chosen carefully by the adversary.


A procedure for performing such a model extraction attack was introduced in the black-box attacks paper. The adversary can then use the substitute model’s gradients lớn find adversarial examples that are misclassified by the defended Model as well. In the figure above, reproduced from the discussion of gradient masking found in Towards the Science of Security and Privacy in Machine Learning, we illustrate this attaông chồng strategy with a one-dimensional ML problem. The gradient masking phenomenon would be exacerbated for higher dimensionality problems, but harder khổng lồ depict.

We find that both adversarial training và defensive distillation accidentally persize a kind of gradient masking. Neither algorithm was explicitly designed lớn perform gradient masking, but gradient masking is apparently a defense that machine learning algorithms can invent relatively easily when they are trained khổng lồ defend themselves & not given specific instructions about how lớn vì so. If we transfer adversarial examples from one model to a second Mã Sản Phẩm that was trained with either adversarial training or defensive sầu distillation, the attaông xã often succeeds, even when a direct attaông xã on the second Model would fail. This suggests that both training techniques vị more to lớn flatten out the Mã Sản Phẩm and remove the gradient than to lớn make sure it classifies more points correctly.

Why is it hard khổng lồ defover against adversarial examples?

Adversarial examples are hard to defend against because it is difficult to lớn construct a theoretical mã sản phẩm of the adversarial example crafting process. Adversarial examples are solutions lớn an optimization problem that is non-linear & non-convex for many ML models, including neural networks. Because we don’t have good theoretical tools for describing the solutions to these complicated optimization problems, it is very hard lớn make any kind of theoretical argument that a defense will rule out a set of adversarial examples.

Adversarial examples are also hard lớn defkết thúc against because they require machine learning models khổng lồ produce good outputs for every possible input. Most of the time, machine learning models work very well but only work on a very small amount of all the many possible inputs they might encounter.

Every strategy we have tested so far fails because it is not adaptive: it may bloông xã one kind of attaông chồng, but it leaves another vulnerability open to lớn an attacker who knows about the defense being used. Designing a defense that can protect against a powerful, adaptive attacker is an important retìm kiếm area.


Adversarial examples show that many modern machine learning algorithms can be broken in surprising ways. These failures of machine learning demonstrate that even simple algorithms can behave very differently from what their designers intkết thúc. We encourage machine learning researchers to get involved and thiết kế methods for preventing adversarial examples, in order khổng lồ cthua trận this gap between what designers intkết thúc & how algorithms behave sầu. If you"re interested in working on adversarial examples, consider joining

Xem thêm: Bff Là Gì Trên Facebook Mọi Người Viết Bff Có Nghĩa Là Gì? Bf, Gf Là Gì?

For more information

To learn more about machine learning security, follow Ian và Nicolas"s machine learning security blog