Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ChannelShuffle #418

Closed
wants to merge 4 commits into from
Closed

Fix ChannelShuffle #418

wants to merge 4 commits into from

Conversation

innat
Copy link
Contributor

@innat innat commented May 7, 2022

Close #259

@innat innat marked this pull request as ready for review May 7, 2022 22:22
@bhack
Copy link
Contributor

bhack commented May 8, 2022

Was this done for shufflenet?

@innat
Copy link
Contributor Author

innat commented May 8, 2022

I didn't test with the actual ShuffleNet model but I tested with this model, works fine so far. So, I think it should also work in general.

@bhack
Copy link
Contributor

bhack commented May 8, 2022

What I meant is that shufflenet has not this randomizzation requirements so probably we need to do something different.

I don't think that we care about gradients in KLPs cause we have already many other ops without gradient in KLPs so I don't think we are tracking this.

@innat
Copy link
Contributor Author

innat commented May 8, 2022

What I meant is that shufflenet has not this randomizzation requirements so probably we need to do something different.

Like this?

tf.random.set_seed(self.seed)
tf.random.uniform(shape=[self.group], seed=self.seed)

@bhack
Copy link
Contributor

bhack commented May 8, 2022

@innat
Copy link
Contributor Author

innat commented May 8, 2022

Yes. I found that earlier but couldn't adopt it because initially this layer was proposed as an input channel shuffle and randomness was a must.

However, now I think we can use tf.random.set_seed to make it static or a one-time change for each channel layer's instance.

@bhack
Copy link
Contributor

bhack commented May 8, 2022

Let's see if we could unify the two use cases.

But It isn't clear to me if we want to reuse the KLP namespace in differentiable layers...

@bhack
Copy link
Contributor

bhack commented May 10, 2022

I think that we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

Can we maintain the original impl for KLP?

@LukeWood
Copy link
Contributor

Let's see if we could unify the two use cases.

But It isn't clear to me if we want to reuse the KLP namespace in differentiable layers...

I think it is fine. Layers are imported from keras_cv.layers.*, where the code lives is less important. ShuffleNet is a rare case.

@LukeWood
Copy link
Contributor

I think that we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

Can we maintain the original impl for KLP?

thanks for the note: can we clarify this point? How does the behavior change? This may be going over my head here

@bhack
Copy link
Contributor

bhack commented May 11, 2022

I think that we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

Can we maintain the original impl for KLP?

thanks for the note: can we clarify this point? How does the behavior change? This may be going over my head here

What we are trying to fix with this PR? The missing gradient? avoid vmap fallback? XLA coverage? Something else?

@LukeWood
Copy link
Contributor

LukeWood commented May 11, 2022

I think that we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.
Can we maintain the original impl for KLP?

thanks for the note: can we clarify this point? How does the behavior change? This may be going over my head here

What we are trying to fix with this PR? The missing gradient? avoid vmap fallback? XLA coverage? Something else?

I imagine the missing gradient but that is a separate discussion imo.

What behavior mismatches here? Functionally the layer does the same thing, right? Or do I misunderstand

@innat
Copy link
Contributor Author

innat commented May 15, 2022

@bhack

I think we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

The operation will have a subtle difference if we want to maintain two separate implementations and the naming perhaps.

Also, currently, it uses two times transpose operation. But I think the new PR is more light. What do you think?

@innat
Copy link
Contributor Author

innat commented May 15, 2022

What I meant is that shufflenet has not this randomizzation requirements so probably we need to do something different.

Like this?

tf.random.set_seed(self.seed)
tf.random.uniform(shape=[self.group], seed=self.seed)

@LukeWood
Is it ok to use the above way?

@LukeWood
Copy link
Contributor

@bhack

I think we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

The operation will have a subtle difference if we want to maintain two separate implementations and the naming perhaps.

Also, currently, it uses two times transpose operation. But I think the new PR is more light. What do you think?

Can we clarify what the differences are? I am not convinced we need 2 implementations unless the differences are significant

@bhack
Copy link
Contributor

bhack commented May 15, 2022

Can we clarify what the differences are?
Two points:

  • I've linked above a quite popular TF implementation at that time and If I remember correctly from the paper (but I need to check it again) channel_shuffle has no random concept in ShuffleNet.
  • Correct me if I am wrong. By an API point of view as our KLP layers are preprocessing layers they are not guaranteed do be differentiable. So it could be a little bit strange to use a layer from a KLP namespace in a "learnable" section of the network graph.

@innat innat requested review from LukeWood May 15, 2022 19:29
@innat
Copy link
Contributor Author

innat commented May 15, 2022

@bhack

Correct me if I am wrong. By an API point of view as our KLP layers are preprocessing layers they are not guaranteed do be differentiable. So it could be a little bit strange to use a layer from a KLP namespace in a "learnable" section of the network graph.

I think it's fine to consider it a special case. The static channel shuffle operation is offered in pytorch officially but not as a random layer for image augmentation. But it does on the albumentaiton library. Either it treats as a differentiable layer or processing layer, in both case it has to subclass the layer module. HERE was a relevant query. Also, I think the KPL layers may contain differentiable layers in the future.

@bhack
Copy link
Contributor

bhack commented May 15, 2022

I think the KPL layers may contain differentiable layers in the future.

Isn't this point partially the root cause of this PR? If we are going in this direction we need to care every time about using or not differentiable ops when we contribute a new component in the preprocessing folder.

Your detection emerged with a sort of "e2e integration test" #218 (comment)):

@LukeWood
Copy link
Contributor

I think the KPL layers may contain differentiable layers in the future.

Isn't this point partially the root cause of this PR? If we are going in this direction we need to care every time about using or not differentiable ops when we contribute a new component in the preprocessing folder.

Your detection emerged with a sort of "e2e integration test" #218 (comment)):

From what I can tell I think ShuffleNet is a super rare one off.

@bhack
Copy link
Contributor

bhack commented May 16, 2022

From what I can tell I think ShuffleNet is a super rare one off.

Also If we consider this a quite rare case (I am not so future proof on this), for the API consistency we need to consider that this approach will require to remove the Random prefix from the name also if It has the (optional) random logic.
For all the other KLPs we used Random prefix name when there was a random logic.
Instead for this one, if we use the dual scope, we need to control its roandomicty by an extra param.
It think we are looping a Little bit of coherence.

I think it's fine to consider it a special case. The static channel shuffle operation is offered in pytorch officially but not as a random layer for image augmentation

About pytorch there is a more general request of a random shuffle ops with axis param:

pytorch/pytorch#71409

@@ -52,6 +54,11 @@ def __init__(self, groups=3, seed=None, **kwargs):
self.groups = groups
self.seed = seed

tf.random.set_seed(self.seed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeWood cc @bhack

This line is only set to satisfy the shuffle-net's requirements. By this, it will produce a static shuffled operation.

aug_fn = ChannelShuffle(groups=3, seed=101)
aug_fn(a, training=True) 

aug_fn = ChannelShuffle(groups=3, seed=42)
aug_fn(a, training=True) 

But for random operation, seed=None.
But this may also be a bit in conflict with other KPLs from implementation perspective. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the point in my last comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeWood cc @bhack

This line is only set to satisfy the shuffle-net's requirements. By this, it will produce a static shuffled operation.

aug_fn = ChannelShuffle(groups=3, seed=101)
aug_fn(a, training=True) 

aug_fn = ChannelShuffle(groups=3, seed=42)
aug_fn(a, training=True) 

But for random operation, seed=None. But this may also be a bit in conflict with other KPLs from implementation perspective. Thoughts?

I don't think we should be modifying seed inside of any layer.

image = tf.random.shuffle(image, seed=self.seed)
image = tf.transpose(image, perm=[1, 2, 3, 0])

rand_indices = tf.argsort(self.rand_uniform(shape=[self.groups]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeWood cc. @bhack
I've noticed that, the tf.argsort fallback to while loop here.

Tensor("loop_body/argsort/TopKV2:1", shape=(3,), dtype=int32)
WARNING:tensorflow:Using a while_loop for converting TopKV2

Copy link
Contributor

@bhack bhack May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeWood cc. @bhack I've noticed that, the tf.argsort fallback to while loop here.

Tensor("loop_body/argsort/TopKV2:1", shape=(3,), dtype=int32)
WARNING:tensorflow:Using a while_loop for converting TopKV2

Just another one to add to the list. But as you can see with the master implementation we have already RandomShuffle: #291 (comment)

So one in, one out.

@innat
Copy link
Contributor Author

innat commented May 16, 2022

@bhack cc. @LukeWood

Also If we consider this a quite rare case (I am not so future proof on this), for the API consistency we need to consider that this approach will require removing the Random prefix from the name also if It has the (optional) random logic.

IMO, I think it's not required to add a random prefix if we consider it as a special case. Curious, will keras_cv always provide only the random transformation layer?

I'm ok to have two separate layers if it's accepted to keras_cv. But that will also feel a bit odd and naming might be confusing.

I'm more interested to unify the two cases, i.e. for the random operation as KPL and for static operation as Model layers.

@bhack
Copy link
Contributor

bhack commented May 16, 2022

Curious, will keras_cv always provide only the random transformation layer?

I think this part of the topic was early covered at #122 (comment). IMHO from that discussion we have not got any particular outcome.

@LukeWood
Copy link
Contributor

From what I can tell I think ShuffleNet is a super rare one off.

Also If we consider this a quite rare case (I am not so future proof on this), for the API consistency we need to consider that this approach will require to remove the Random prefix from the name also if It has the (optional) random logic. For all the other KLPs we used Random prefix name when there was a random logic. Instead for this one, if we use the dual scope, we need to control its roandomicty by an extra param. It think we are looping a Little bit of coherence.

I think it's fine to consider it a special case. The static channel shuffle operation is offered in pytorch officially but not as a random layer for image augmentation

About pytorch there is a more general request of a random shuffle ops with axis param:

pytorch/pytorch#71409

Ok, yeah if there is no random behavior in one case and it is fully random in the other it should really be a different layer. I misunderstood the original paper and meaning of ShuffleNet.

@bhack
Copy link
Contributor

bhack commented May 16, 2022

Ok, yeah if there is no random behavior in one case and it is fully random in the other it should really be a different layer. I misunderstood the original paper and meaning of ShuffleNet.

It was the original point #259 (comment) 13 days ago 😉

@LukeWood LukeWood added the awaiting review PRs that need to be reviewed label Jul 14, 2022
Copy link
Contributor

@tanzhenyu tanzhenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@@ -52,6 +54,11 @@ def __init__(self, groups=3, seed=None, **kwargs):
self.groups = groups
self.seed = seed

tf.random.set_seed(self.seed)
self.rand_uniform = keras_cv.UniformFactorSampler(
lower=0, upper=1, seed=self.seed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add unit test for this gradient fix?

@LukeWood
Copy link
Contributor

Closing for now, as this does not seem to do the same thing as RandomChannelShuffle(). Instead; this appears to shuffle channels uniformly in order to support ShuffleNet. If we ever support ShuffleNet, we can introduce a one-off layer to support that model architecture, in a gradient friendly approach.

Thank you for your enthusiasm as always!

@LukeWood LukeWood closed this Apr 26, 2023
freedomtan pushed a commit to freedomtan/keras-cv that referenced this pull request Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review PRs that need to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LookupError: gradient registry has no entry for RandomShuffle
4 participants