-
Notifications
You must be signed in to change notification settings - Fork 577
RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training. #237
base: master
Are you sure you want to change the base?
Conversation
@byronyi If we are going to contribute to addon first, do we need a RFC here? |
Since this RFC targets for SIG AddOns, add SIG AddOns leads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very interesting proposal.
From a high level view (and I'm probably wrong) it looks like it proposes a new type of variable and a new type of optimizer which can update that variable. Given that this is the case I think we can implement this in addons or some other SIG package as long as there are APIs in core TF to ensure that this variable can declare itself checkpointable, be tracked by something like tf.Module / keras.Model (so you can do model.trainable_sparse_variables), and maybe be automatically watched via the gradient tape.
Can you expand the document to clarify the details of these changes to existing parts of TF as opposed to most of the content which is on the new types?
Thanks!
It requires changes to core that we should discuss now. From my point of
view the most important feature tf core can offer here is allowing
experimentation and development of this type of problem (for which there is
very high demand at least in industry) to happen without needing to involve
tf core.
Separately from that I think the design of the actual components here has
many interesting parts, and a fairly close version of these components to
what is proposed here should be in core, but I think it's more important
now that we make core properly extensible than that we debate the details
of this component.
…On Thu, Apr 30, 2020 at 10:56 AM Bairen Yi ***@***.***> wrote:
@byronyi <https://github.com/byronyi> If we are going to contribute to
addon first, do we need a RFC here?
I guess the design was originally targeted to TF core.
As @alextp <https://github.com/alextp> said, if part of it still requires
changes to TF core, then we still need a (probably smaller) RFC here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABHRI2TS4DEYRY3FU2IBDRPG3TXANCNFSM4MQXNN6A>
.
--
- Alex
|
Thank you, The main changes:
Thanks! |
The change to the existing SparseApply* kernels which removes Ref(T) from
the signature is backwards incompatible and can't be done.
Adding new kernels for the hash apply is fine, though.
I do wonder if we need the Optimizer method _apply_dense_hash or whether we
can use a separate optimizer-like class which knows about the hash
application. This has the advantage that it naturally covers the use cases
where people want different optimizers for the really sparse embedding
layers (which I think is relatively common).
…On Mon, May 4, 2020 at 10:17 AM rhdong ***@***.***> wrote:
That's a very interesting proposal.
From a high level view (and I'm probably wrong) it looks like it proposes
a new type of variable and a new type of optimizer which can update that
variable. Given that this is the case I think we can implement this in
addons or some other SIG package as long as there are APIs in core TF to
ensure that this variable can declare itself checkpointable, be tracked by
something like tf.Module / keras.Model (so you can do
model.trainable_sparse_variables), and maybe be automatically watched via
the gradient tape.
Can you expand the document to clarify the details of these changes to
existing parts of TF as opposed to most of the content which is on the new
types?
Thanks!
Thank Alex,
In fact, My initial idea was to encapsulate a some kind of
ResourceVariable backed Hashtable, as we know TF is not good at training
any non tf.Variable. I reuse lookup.MutableHashTable because I don't like
to write a new hash lib in TF , especially, lookup.XX support
checkpointable and deployable on tf.distribute.Server.
Here is the compare based on v1.15.2 shows that the range of core effected
by the RFC:
https://github.com/tensorflow/tensorflow/compare/v1.15.2...rhdong:rfc?expand=1
The main changes:
1. supporting the random initiallizer on lookup.MutableHashTable.Find
2. Four stateful optimizers(Adagrad, Adam, FTRL, Momentum)
adaptation.(Maybe cancelled in new schema)
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABHRMN3Q36LC7PKGV6URDRP32D7ANCNFSM4MQXNN6A>
.
--
- Alex
|
Yes, you're right, this is only a temp version. I have changed the name to _apply_dense_unstateful, XX_hash is a bad name. |
I think TensorFlow can provide a way to extend optimizers so that you can extend existing optimizers to handle your sparse weights. |
+1 to Yuefeng's suggestion.
Can this proposal be enhanced with a section discussing such extension?
…On Mon, May 4, 2020 at 12:14 PM Yuefeng Zhou ***@***.***> wrote:
I think TensorFlow can provide a way to extend optimizers so that you can
extend existing optimizers to handle your sparse weights.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABHRPX63755ZBUA6BC5ODRP4HXRANCNFSM4MQXNN6A>
.
--
- Alex
|
cc @omalleyt12 who proposes the new customizable optimizer in #234. Mind to shed some light on this? |
@yuefengz @byronyi @alextp @smilingday @facaiy @seanpmorgan @omalleyt12 |
I think this version scheme is simple and natural enough for core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. Several issues that may worth considering:
- These APIs might not worth a separate TF sub-package, i.e.
tf.dynamic_embedding
. How about, e.g.,tf.nn.dynamic_embedding_lookup
? - Try to override methods in subclass and avoid modifying the base class if not necessary.
trainable_wrap
is a resource variable, butdynamic_embedding.Variable
is not (any specific reason?). I confirmed with the author offline that thedynamic_embedding.Variable
represents the whole embedding layer, whileTrainableWrap
wraps a single value lookup from embedding. Current naming doesn't reflect this semantics.
name='embedding_lookup', | ||
max_norm=None): | ||
"""Provides a dynamic version of embedding_lookup | ||
similar to tf.nn.embedding_lookup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this deserves another top-level package name. How about tf.nn.dynamic_embedding_lookup
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this deserves another top-level package name. How about
tf.nn.dynamic_embedding_lookup
?
Good idea!
### | ||
|
||
@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"]) | ||
def embedding_lookup_sparse(params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, how about tf.nn.dynamic_embedding_lookup_sparse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to put this in a separate repo first. We can allow it to graduate into tf core if it has large number of users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to put this in a separate repo first. We can allow it to graduate into tf core if it has large number of users.
Maybe we can merge with tf.nn.embedding_lookup
for they has same input arguments.
|
||
##### Runtime random initialization | ||
|
||
Since sparse training does not allocate memory on train-loops start, sparse weights cannot be initialized statically like we do on `tf.Variable`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the value of tf.Variable
is not initialized statically. Perhaps you mean the shape of the variable has to be static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check the code.
I mean here all values in tf.Variable
are determined before starting training, while the values in hash tables will be unknown for the memory has not been allocated.
* tensorflow/core/kernels/lookup_table_op.cc | ||
|
||
```cpp | ||
Status MutableHashTableOfTensors::Find( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's implementation details, but considering function names like FindOrInsert
to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your RFC, Haidong! Some high level comments:
- I think this doc should focus on what changes needs to be made in TF core and probably mention a bit on how you are going to leverage the change.
- your proposal should be tf2-compatible. Some proposed changes, such as in optimizers, seem only applicable to tf1.
- I still have some confusion on the overall workflow. I recommend you implement a demo with minimal viable features and some simple unit tests in your own repo first. You can use monkey-patch to make changes to TF core.
initial_value=vals, | ||
dtype=params.value_dtype, | ||
trainable=params.trainable) | ||
if max_norm is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why TrainableWrap
is not used in this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, it is bug, I will fix it.
### | ||
|
||
@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"]) | ||
def embedding_lookup_sparse(params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to put this in a separate repo first. We can allow it to graduate into tf core if it has large number of users.
slot_name, | ||
op_name): | ||
"""Helper function for creating a slot variable for statefull optimizers.""" | ||
_params_var, _params_ids = _IDS_OF_RESOURCEVARIABLE_MAPPER_.get(primary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is _IDS_OF_RESOURCEVARIABLE_MAPPER_
?
It looks like your slot variable is created for every step since it depends on the ids
(while there is not such requirement in the original optimizers). Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is
_IDS_OF_RESOURCEVARIABLE_MAPPER_
?It looks like your slot variable is created for every step since it depends on the
ids
(while there is not such requirement in the original optimizers). Is that correct?
I will cancel the _IDS_OF_RESOURCEVARIABLE_MAPPER_
by reusing the Optimizer.slots
|
||
```python | ||
|
||
class _DenseDynamicEmbeddingTrainableProcessor(_OptimizableVariable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this class need to be different for different optimizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to be.
|
||
```python | ||
@tf_export("dynamic_embedding.Variable") | ||
class Variable(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably can use a different name instead of Variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably can use a different name instead of Variable?
Yeah, agree with you.
""" | ||
pass | ||
|
||
def remove(self, keys, name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some details for when the remove would be triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the users want to control the total number of features by some custom policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some docs would help.
|
||
## Sparse Domain Isolation | ||
|
||
### Overview of Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the gradient look like? Do you need to define custom gradient function for embedding_lookup
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the gradient look like? Do you need to define custom gradient function for
embedding_lookup
?
No need to define custom gradient function.
I believe the back-propagation should end after the updating to TrainableWrap.
|
||
```python | ||
|
||
class TrainableWrap(ResourceVariable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of TrainableWrap
is a container to hold (dynamic_embedding, ids) but pretends to be a tf.Variable. I am wondering whether it has to a ResourceVariable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of
TrainableWrap
is a container to hold (dynamic_embedding, ids) but pretends to be a tf.Variable. I am wondering whether it has to aResourceVariable
?
For all of optimizers can train ResourceVariable through resource_apply{,_sparse}
Or you has to extend the them one-by-one.
Thanks for ping me, @yuefengz @smilingday . The proposal is very interesting. I'm wondering if we can introduce a new kind of Variable class and reuse all existing optimizers (in tf-core or tf addons). I'm afraid the proposal goes out of scope of tf-addons, so I suggest to put them in a separate repo first. @seanpmorgan Sean, what do you think? |
Sean has discussed with SIG AddOns meetings and replied in seperate email threads that tf-addons might not be a good fit. We are still exploring the right place for those contributions. |
I will provide the source code with unittest cases sooner.
|
Is this RFC related to the recently proposed paper "DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications" by Google? https://arxiv.org/pdf/2004.08366.pdf |
No, this is a different scheme proposed in an earlier paper Distributed Equivalent Substitution Training for Large-Scale Recommender Systems(accepted by SIGIR'2020). |
…s Training. Money patch to core: tensorflow/community#237
…s Training. Please visit tensorflow/community#237
…s Training. Please visit tensorflow/community#237
d243b14
to
6ea93a6
Compare
@yuefengz @tanzhenyu @byronyi @alextp Hi, I just updated this RFC and this update contains some key features include the scheme of compatible with all |
is it compatible with tensorflow serving ? @rhdong |
Yes |
Hi @rhdong , I fix some bugs(shape of TrainableWrapper) and build tf 2.4.0, based on your code. It seems the dynamic_embedding didn't updated in training process. Code as follows:
console:
|
@shenbaise Thank you for feedback, I will check and fix it as soon as possible. |
Hi @shenbaise , the reason is that the commit is not compatible with keras, especially the optimizer v2, I need two days to fix it and add the UT cases, please wait a moment, Thank you! |
Hi @shenbaise, I fix the issue and the commit is here |
FYI, @kttian wrote a prototype for a differentiable hash map, roughly the equivalent of TensorList, as part of her internship project. Here's a colab that demonstrates direct gradient updates: https://colab.sandbox.google.com/drive/1hyFmriuq4Bz61_rxg2bfdE_jXHVfX8Rr?usp=sharing#scrollTo=8HDUUBEFAesC There may be an opportunity to join efforts on a core implementation. |
This is good job. But I think it is difficult to make the hash map trainable . |
It already is trainable (at least in the sense of trainable that I believe you're referring to). |
…s Training. Please visit tensorflow/community#237
This PR is one part of RFC:tensorflow/community#237
This PR is one part of RFC:tensorflow/community#237
…s Training. Please visit tensorflow/community#237
This PR is one part of RFC:tensorflow/community#237
This PR is one part of RFC:tensorflow/community#237
…up support full size dynamic default values. This PR is one part of RFC:tensorflow/community#237
@yuefengz Is this still in draft mode? What are the plans with this RFC? |
|
||
### Trainable Wrapper | ||
|
||
In the early scheme, the `dynamic_embedding.Variable` will directly be trained by optimizers and **we had to extand all optimizers one-by-one**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yeyinqcri , thanks for reaching out! You might be interested in this long-maintained repo, which was inspired by this RFC.: https://github.com/tensorflow/recommenders-addons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt response, yes, I am aware of TFRA and understood TFRA uses TrainableWrapper to make hash table trainable with default tensorflow runtime, however, we run into the issue when using TFRA + PS to train model in keras. This is understandable given PS is already on deprecation path. What I am trying to test is that if it is still possible to use the old way (extend optimizer) to make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt response, yes, I am aware of TFRA and understood TFRA uses TrainableWrapper to make hash table trainable with default tensorflow runtime, however, we run into the issue when using TFRA + PS to train model in keras. This is understandable given PS is already on deprecation path. What I am trying to test is that if it is still possible to use the old way (extend optimizer) to make it work.
Oh, I see. Did you submit an issue for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will submit an issue for it, in the meanwhile, we want to explore other solutions (e.g. implement optimizer) to unblock our urgent needs if possible, so appreciate if you can share how the old way works for us to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! More details in an issue will help us assess it better.
Sparse Domain Isolation for supporting large-scale Recommender Systems.
@yuefengz @byronyi
Hi,
This is the RFC of Sparse Domain Isolation for supporting large-scale Recommender Systems.
It ’s still a draft. We will update the latest content as soon as possible, we can improve on this basis. In order to push forward as soon as possible, I first submitted here but the owners are everyone who participated in the discussion in the past, and we will complete the list later.