Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras implementation? #14

Open
michetonu opened this issue Jul 13, 2017 · 50 comments
Open

Keras implementation? #14

michetonu opened this issue Jul 13, 2017 · 50 comments

Comments

@michetonu
Copy link

Hey,

First of all thanks a lot for this. I was wondering whether there is an easy way to make the gradient flipping work in Keras. Someone has done it for the Theano backend, but not for the Tensorflow. Would it be feasible to combine the two?

Thanks!

@pumpikano
Copy link
Owner

pumpikano commented Jul 13, 2017

I see two options, but I'm not knowledgeable enough with Keras to judge how easy they would be.

First, you could define a GradientReversal layer in Keras - this PR attempted that in the Keras 1.0 API, but was never finished. I believe this is feasible and would be best in terms of usability by other Keras users.

Second, I believe that Keras 2.0 and TF can interoperate relatively seamlessly, so you could use Keras to define most of your model and use TF for the gradient reversal portion. Here is a simple example of mixing the two. Again, I haven't used this myself, but it seems like a feasible option.

@kilickaya
Copy link

Adding to pumpikano's answer, you can implement gradient reversal layer as 'maximizing' the objective of interest rather than minimization. This can be implemented as minimizing the negative of the cost function in Tensorflow, and could be very easy with Keras also. So your objective may look like this:

min (object_cost - domain_cost)

where you favor good object recognition (object_cost) and hurt your model's ability to differentiate images from two different domains (domain_cost).

This leaves no need for explicitly implementing Gradient Reversal Layer, as it also achieves the same objective after all.

@michetonu
Copy link
Author

@pumpikano Thanks, that looks feasible - I'll take a look and keep you updated.

@kilickaya That's true! Should I then basically just define that as a custom loss function for the domain adaptation part? Would that produce the same output as the paper?

@kilickaya
Copy link

In the paper, the Gradient Reversal Layer is used to go in reverse direction of Gradients. This can be achievable in two ways. You can either reverse the sign of gradients and minimize the same objective, or you can reverse the loss function, as I mentioned above. Implementation wise it is slightly different, but they serve for the same purpose (optimizes the same objective).

Either the authors did not recognize that while writing the paper, that this idea is that simple to implement, or they wanted to make it look more complex than necessary.

In their later manuscript, they also state that this can be implemented by maximization.

I don't know about Keras, but you can implement in Tensorflow like this (assume you use Adam optimizer):

optimizer = tf.train.AdamOptimizer(learning_rate=learningrate).minimize(object_cost - domain_cost, var_list= vars)

@pumpikano
Copy link
Owner

Just to clarify, you will need two alternating optimizations with this approach

min_D(domain_cost)
min_P(object_cost - domain_cost)

where D is the domain prediction subnetwork, P is the class prediction subnetwork, and min_X means "minimize w.r.t. function X". This is a classic GAN setup.

It is worth noting that it also gives you the freedom to allow the domain cost to differ between the two optimizations. For instance, many GAN implementations actually minimize the discriminator objective with flipped labels rather than maximizing it with correct labels. This amounts to a different objective and usually works better in practice in GANS.

@michetonu
Copy link
Author

Could I just define a custom loss such as:

def custom_categorical_crossentropy(y_true, y_pred):
    return -K.categorical_crossentropy(y_pred, y_true)

?

@pumpikano what exactly do you mean by 'alternating'? I would just use one cost function in the label branch and the other one in the domain subbranch, correct?

@pumpikano
Copy link
Owner

I just mean that usually you will take a step with one loss function and then a step with the other for each batch, usually updating the discriminator first.

@michetonu
Copy link
Author

@pumpikano So I managed to implement the layer for the TF backend expanding on the link you provided, you can find it here: https://github.com/michetonu/gradient_inversion_keras_tf/blob/master/flipGradientTF.py

Regarding the loss function steps, Keras works with multi-output models, and I think the loss functions are just additive. Correct me if I'm wrong, but I think I can just create one model (it seems to work). The way I'm doing it is I'm feeding both distributions as input, but setting the target samples' loss weights to zero in the classifier, so that they don't get considered in the backprop update, while removing the need to alternate inputs. I'll happily share my code once I'm sure it works properly :)

@pumpikano
Copy link
Owner

Cool! Yeah that seems like a reasonable approach.

@ghost
Copy link

ghost commented Sep 17, 2017

@michetonu Did you got your code working properly? Would love a Keras DANN implementation example. (Especially how you set the target samples' loss weights to zero in the classifier)

@michetonu
Copy link
Author

@Wojova yes! Here is the gradient inversion layer: https://github.com/michetonu/gradient_reversal_keras_tf

For the sample weights you just need to create your input batches by alternating samples from the source and target domain, and pass a sample_weights array of 1s and 0s accordingly.

@ghost
Copy link

ghost commented Sep 17, 2017

Thanks!

@erlendd
Copy link

erlendd commented Sep 29, 2017

Responding to some of the previous comments on here, I don't think that simply reversing the loss function (on the domain part of the network) is a replacement for the gradient reversing layer. The reason is that simply maximising the domain loss won't necessarily lead to domain-invariant features in the shared feature layer, as the weights of the domain-classifier layer will be able to give a bad loss even if the shared feature layer isn't domain invariant.

@michetonu
Copy link
Author

@erlendd if you read the original al paper by Ganin et al. (and the few other papers implementing the same domain adversarial training approach), that's exactly what the gradient inversion layer does. It does nothing on the forward pass, and it multiplies the loss by a negative constant during backprop.

@erlendd
Copy link

erlendd commented Sep 29, 2017

@michetonu yeah I know, I was replying to @kilickaya 's comment above.

Am I correct in saying that the implementation of DANN here using the target labels during training? I.e. it isn't unsupervised?

@michetonu
Copy link
Author

@erlendd it is unsupervised if you set the target's loss_weights to 0. That way they will still contribute to the accuracy (or whatever measure you choose) shown by Keras, but they won't contribute the loss of the first classifier, which is what you want. Basically the labels are there but don't matter. You can actually check that this works by randomising the target's labels during training and getting the same performance (using the real labels at test time)

@erlendd
Copy link

erlendd commented Sep 29, 2017

@michetonu sorry I was referring to the TF code here: https://github.com/pumpikano/tf-dann/blob/master/Blobs-DANN.ipynb. I don't see a loss_weights variable.

@michetonu
Copy link
Author

@erlendd ah sorry, the trick is how they create the batches and run the network. They alternate the branches rather than doing forward and back prop simultaneously summing the losses. On the top branch only sample from the source domain are used for training, while the bottom branch receives both. The code is not super clear at first sight (took me a while as well) but that's how their generator works.

@erlendd
Copy link

erlendd commented Sep 30, 2017

Thanks, I get how it works now, but isn't there an issue with the batches not being fully shuffled? For example the first half of each batch is always the "source" set and the second half is the "target" set. I think the only way to get around this would be to use a Boolean mask, but I'm unsure.

@michetonu
Copy link
Author

@erlendd just shuffle the domains separately before you create the batches! It shouldn't matter if they are split half-half once they go through the net, right? But in any case, you could even shuffle the batches one by one.

@erlendd
Copy link

erlendd commented Oct 1, 2017

@michetonu well the domains are already shuffled inside the batch generator code, so that's not an issue. The issue I think is that the training batch is constructed such that the first half is from the source domain and the second half is from the target domain, so with respect to the domain classifier the data has not been shuffled. Probably it won't matter if you use a smaller batch size, but if you used a much larger batch size there could be an issue, I guess.

Also I don't think it's possible to simply shuffle the training batches, as the tensorflow model here assumes that the first half of the batch is from the source domain and the second half is from the target domain. That's why I was suggesting that it could possibly be fixed using a mask in place of tf.cond in the code.

@michetonu
Copy link
Author

@erlendd you are right that they are already shuffled! I wasn't looking at the code. It does not matter whether one half is always first - the model does not "remember" the labels of previous samples, so the order of training samples in each batch doesn't really matter IMHO.

@erlendd
Copy link

erlendd commented Oct 1, 2017

@michetonu I could be wrong, but I believe it does matter if the first half of samples always come from one distribution and the second half from another distribution - otherwise why would there be any necessity to shuffle in a simpler neural network model? I'm currently re-writing using a boolean mask so will check if this makes any sort of a difference.

@michetonu
Copy link
Author

@erlendd my understanding was that shuffling prevents having the same train-test split every time, but I might be wrong!

@erlendd
Copy link

erlendd commented Oct 1, 2017

@michetonu you're right - order of training samples within a batch doesn't matter.

@pumpikano
Copy link
Owner

The reason that the order of examples within a batch does not matter is that the loss of the batch is the sum (really, a normalized sum, i.e. mean) of the losses of the examples. The gradient of a sum is the sum of the gradients of the terms, and addition is commutative, so order doesn't matter.

Of course, the batches need to be fair samples in order for the gradient of each minibatch loss to approximate the gradient of the full training set loss. An easy algorithm for getting unbiased minibatches and insuring that all examples are used is to shuffle the training examples and take minibatches sequentially until all examples are used, then repeat. Order matters in the sense that the shuffle is the reason this algorithm creates correctly sampled minibatches. Hope this helps!

@michetonu
Copy link
Author

Great explanation!

@qianyizhang
Copy link

@pumpikano I'm a bit confused after erlendd's comment... Just to be clear, you don't necessarily need the gradient reversal layer as long as you reverses the domain_cost in the final objective (total_cost), correct?

@erlendd
Copy link

erlendd commented Oct 17, 2017

@qianyizhang reversing domain_cost at the final objective does do the same thing as a GRL. If you just maximise the domain cost you won't know if it's because the shared feature layer has invariance or if the final layer of the domain classifier is bad at separating the domains.

@pumpikano
Copy link
Owner

pumpikano commented Oct 17, 2017

@erlendd I think you meant to say "does not do the same thing as a GRL"?

In any case, looking at the diagram from the paper might be helpful. Descent on the parameters of G_d (the domain classifier) are minimizing domain classification loss, whereas descent on the parameters of G_f (the feature extractor) are maximizing domain classification loss because the gradients were reversed. If you removed the GRL and invert the domain cost, descent on parameters of G_d and parameters of G_f would both be maximizing domain cost.

@qianyizhang
Copy link

@erlendd
@pumpikano
I see now.
It's tricky to understand that the feature extractor part and domain extractor part are optimizing for the opposite objective, thanks for answering :-)

@tmullen93
Copy link

@michetonu

I see how setting the sample_weights to zero for the target will prevent the classifier from updating, but doesn't it also keep the domain classifier from updating? Can't you not use different sample_weights for different outputs?

@michetonu
Copy link
Author

@tmullen93 if your model has two outputs you can define sample_weights as a 2D array such as [[1,1,1...0,0,0], None], one for each output.

@tmullen93
Copy link

@michetonu Ahh that makes sense thank you!

What do you mean by make your batch alternate through source and target? Does this mean making a generator and forcing it to build a batch that is equal? Could this be why when I take your suggestion for the sample weight but don't make a generator to make the even batch my loss becomes NaN?

I should also note that I'm trying to do multivariate regression instead of classification.

@michetonu
Copy link
Author

@tmullen93 just create mini-batches manually (with a generator or otherwise) in which target and source observations are alternated 50/50 (make sure to do the same for the labels, and careful if you re-shuffle), and then set training batch_size to a multiple of your mini-batch size. So for example if alternate target and source observations 1 by 1, you can choose any power of 2 for your batch size, as it will have the same proportion of target/source samples.

@GlastonburyC
Copy link

What can I do if I have multiple domains I'd like to account for? How would I set sample_weights? Arbitrarily choose a source domain and set everything else to one?

@erlendd
Copy link

erlendd commented Jan 22, 2018

@GlastonburyC I tried this method on data with about 100+ domains, and a single target domain. Trying to use many source domains at once with this method doesn't really work (the network can't learn very well). You might be able to use an unsupervised method (e.g. dAE, TCA) to bring the source domains closer together, but I guess it's still an open question.

@GlastonburyC
Copy link

GlastonburyC commented Jan 22, 2018 via email

@erlendd
Copy link

erlendd commented Jan 22, 2018

@GlastonburyC you mean if you wanted to append a 'weight' column to your data, where that weight might be different for each domain?

I don't have the code in front of me but I did it using Pandas. If the training covariates are stored in a dataframe called df you can add a new column doing:

df['name_of_col'] = values

If you want to make it conditional on the domain, you can use this:

mask = df['domain'] == 'domain1'
df.loc[mask, 'name_of_col'] = values

@GlastonburyC
Copy link

@michetonu I'm performing image segmentation, so I've bolted on a domain classifier with a gradient reversal layer.

if my batch is 50 / 50 'source' and 'target' images and I don't set any sample_weights, is the gradient reversal layer sitting between the domain classifier and my segmentation network equivalent to forcing my segmentation network to learn domain invariant features? I don't necessarily have a 'source' and 'target' I just want the segmentation network to work on a test set that maybe from a different distribution to my input?

@michetonu
Copy link
Author

michetonu commented Jan 26, 2018

@GlastonburyC if you don't 'hide' the target samples from the classifier, you are not performing unsupervised domain adaptation anymore. It will probably still help, but I think there are better ways to do this if you have the labels for your 'target' dataset.

@GlastonburyC
Copy link

GlastonburyC commented Jan 26, 2018

My thought process is that you don't sometimes know your test data's target domain. Therefore if you negate the gradient of the domain classifier for both source and target samples ( where the source and target are merely two distributions from a possible population of distributions) the classifier network is forced to learn domain invariant features? Is that logic correct?

Would appreciate it if you pointed me to a paper on a better method! :) Cheers for the help.

@puntopasta
Copy link

Thanks for this contribution. Previously we have implemented this in a very difficult way by having two different models with tied encoder weights but separate loss functions, this is way more elegant and less error-prone.

However I'm not sure what the hp_lambda stands for. What are we supposed to pass there?
Mustafa

@GlastonburyC
Copy link

GlastonburyC commented Feb 7, 2018 via email

@puntopasta
Copy link

Right, thanks for the explanation (and quick response).

I suspected something like this since it was multiplied with the gradient, good to know for sure. Going to try it out now :)

@puntopasta
Copy link

puntopasta commented Feb 7, 2018

By the way, as an alternative to passing custom sample weights to do the iterative training of the two targets, you could probably just compile two models.

i = input()
e = encoder(i)
c = classifier(e)
a = adversarial(e)  # this includes the gradient reversal

classifier_trainer = Model(i, c); classifier_trainer.compile()
adversarial_trainer = Model(i,a); adversarial_trainer.compile()

A simple minimax training would then look something like this:

for b in range(batches):
   x, class, domain = batch(b)
   classifier_trainer.train(x,class)
   adversarial_trainer.train(x, domain)

Or am I overseeing something?

EDIT: seems to be working. The convenience is that when you're done you can just take the classifier_trainer model and use that as the "final model" for evaluation

@erlendd
Copy link

erlendd commented Mar 23, 2018

I just wondered how one would modify the code to do the multiple-source variant of this. I.e. following this paper: https://arxiv.org/abs/1705.09684. It isn't clearly explained how the batches are constructed with multiple source domains.

@anamaqueda
Copy link

anamaqueda commented Mar 6, 2019

@michetonu I have followed the same strategy as you regarding the network architecture: one model with two outputs.

Regarding the loss function steps, Keras works with multi-output models, and I think the loss functions are just additive. Correct me if I'm wrong, but I think I can just create one model (it seems to work). The way I'm doing it is I'm feeding both distributions as input, but setting the target samples' loss weights to zero in the classifier, so that they don't get considered in the backprop update, while removing the need to alternate inputs. I'll happily share my code once I'm sure it works properly :)

Since I am working with the fit_generator function, I have implemented the following custom loss instead of using sample weights:

def custom_categorical_crossentropy(y_true, y_pred):
    source_true = y_true[:FLAGS.batch_size]
    source_pred = y_pred[:FLAGS.batch_size]
    loss = K.categorical_crossentropy(source_true, source_pred)
    return K.mean(loss)

That makes sense to me because I create mini-batches manually alternating 50-50 source and target samples. However, I am not getting domain adaptation, even if I change hyperparameters. That's why I wonder if this custom loss implementation is correct.

I'd really appreciate any help. Thanks!

@michetonu
Copy link
Author

@amn-gti-upm Sorry for the late reply!
How are you using this loss? This makes sense to me as the loss for the classifier part of the network, but you also need to sum the inverted loss of the domain classifier. Are you doing that? If so, can you share the code?

@anamaqueda
Copy link

@michetonu As I pointed out in my previous comment, I used one model with two outputs: the label predictor (LP) and the domain classifier (DC). I used the custom_categorical_crossentropy loss for the former, and binary cross-entropy loss for the latter, like this:

model.compile(loss=[custom_categorical_crossentropy, 'binary_crossentropy'],
                  optimizer=SGD(momentum=0.9))

These two losses are additive. In order to maximize the domain classifier loss, i.e. the binary cross-entropy, I used your Gradient Reversal Layer (GRL) implementation. I finally got domain adaptation with this implementation, but fixing the lambda parameter at 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants