Evasion

Attacks against classification models that construct special inputs (a.k.a AEs- adversarial examples) that appear natural to a human but are misclassified by the model

While other implementations of AE attacks exist, this one is meant to be as easy, accessible, informative and modular as training model. In fact, this implementation uses fastai’s Learner class, and inherits its functionality, such as the progress bar, the losses table, and even early stopping and lr scheduling.

API


source

PGDCallback

 PGDCallback (epsilon=0.3, rand_init=True)

Implementes Projected Gradient Descent by bounding some \(l_p\) norm of the perturbation


source

PGDCallback.rand_init

 PGDCallback.rand_init (shape)

Initialize a random perturbation in the \(\epsilon\)-ball


source

PGDCallback.steepest_descent

 PGDCallback.steepest_descent ()

Edit the perturbation’s gradient to implement steepest descent


source

PGDCallback.project_pert

 PGDCallback.project_pert ()

Project the perturbation to the \(\epsilon\)-ball

In order to demonstrate the attacks, let’s first setup training data and an accurate classifier:

from similarity_learning.all import *
mnist = MNIST()
classifier = MLP(10)
learn = Learner(mnist.dls(), classifier, metrics=accuracy)
learn.fit_one_cycle(1)
epoch train_loss valid_loss accuracy time
0 0.105063 0.093810 0.971000 00:18
sub_dsets = mnist.valid.random_sub_dsets(64)
acc = learn.validate(dl=sub_dsets.dl())[1]
test(acc, .9, ge)

For reference, here is what the original input look like:

learn.show_results(shuffle=False, dl=sub_dsets.dl())

This is enough for an untargeted attack, where we want to make AEs that the classifier misclassifies. In a targeted attack, we require the AEs to be classified as specific classes. To demonstrate that, we’ll construct a version of the data with random labels:

item2target = {item: str(random.choice(range(10))) for item in sub_dsets.items}
random_targets = TfmdLists(sub_dsets.items, [item2target.__getitem__, Categorize()])
random_targets_dsets = Datasets(tls=[sub_dsets.tls[0], random_targets])
random_targets_dsets.dl().show_batch()

Since a targeted attack adds an additional requirement from the perturbation, we should use a bigger epsilon and more iterations.

\(l_\infty\) Norm


source

LinfPGD

 LinfPGD (epsilon=0.3, rand_init=True)

Implements PGD by bounding the \(l_\infty\) norm

Untargeted

attack = InputOptimizer(classifier, LinfPGD(epsilon=.15), n_epochs=10, epoch_size=20)
perturbed_dsets = attack.perturb(sub_dsets)
epoch train_loss time
0 -3.317830 00:00
1 -6.035948 00:00
2 -7.208374 00:00
3 -7.782593 00:00
4 -8.100239 00:00
5 -8.288338 00:00
6 -8.405439 00:00
7 -8.480438 00:00
8 -8.529491 00:00
9 -8.561683 00:00
acc = learn.validate(dl=TfmdDL(perturbed_dsets))[1]
test(acc, .1, le)
learn.show_results(shuffle=False, dl=TfmdDL(perturbed_dsets))

Targeted

attack = InputOptimizer(classifier, LinfPGD(epsilon=.2), targeted=True, n_epochs=10, epoch_size=30)
perturbed_dsets = attack.perturb(random_targets_dsets)
epoch train_loss time
0 2.587283 00:00
1 1.211362 00:00
2 0.713616 00:00
3 0.493275 00:00
4 0.380555 00:00
5 0.316781 00:00
6 0.281744 00:00
7 0.262587 00:00
8 0.252041 00:00
9 0.246094 00:00
acc = learn.validate(dl=TfmdDL(perturbed_dsets))[1]
test(acc, .9, ge)
learn.show_results(shuffle=False, dl=TfmdDL(perturbed_dsets))

\(l_2\) Norm


source

L2PGD

 L2PGD (epsilon=0.3, rand_init=True)

Implements PGD by bounding the \(l_2\) norm

Untargeted

Note that the \(l_2\) norm can be up to \(\sqrt{d}\) bigger than the \(l_\infty\) norm, where \(d\) is the dimension, so we need to use a bigger epsilon to obtain similar results:

attack = InputOptimizer(classifier, L2PGD(epsilon=15), n_epochs=10)
perturbed_dsets = attack.perturb(sub_dsets)
epoch train_loss time
0 -3.854619 00:00
1 -4.788039 00:00
2 -5.098700 00:00
3 -5.251333 00:00
4 -5.340628 00:00
5 -5.398513 00:00
6 -5.437977 00:00
7 -5.466318 00:00
8 -5.487265 00:00
9 -5.503058 00:00
acc = learn.validate(dl=TfmdDL(perturbed_dsets))[1]
test(acc, .1, le)
learn.show_results(shuffle=False, dl=TfmdDL(perturbed_dsets))

Targeted

attack = InputOptimizer(classifier, L2PGD(epsilon=25), targeted=True, n_epochs=10, epoch_size=20,)
perturbed_dsets = attack.perturb(random_targets_dsets)
epoch train_loss time
0 1.122446 00:00
1 0.638857 00:00
2 0.481712 00:00
3 0.408478 00:00
4 0.367309 00:00
5 0.339710 00:00
6 0.323739 00:00
7 0.314110 00:00
8 0.276853 00:00
9 0.240247 00:00
acc = learn.validate(dl=TfmdDL(perturbed_dsets))[1]
test(acc, .9, ge)
learn.show_results(shuffle=False, dl=TfmdDL(perturbed_dsets))