-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster sequence #8039
Faster sequence #8039
Conversation
keras/utils/data_utils.py
Outdated
# Global variables to be shared across processes | ||
shared_sequence = None | ||
manager = multiprocessing.Manager() | ||
shared_dict = manager.dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use all caps for global variables, and use a _
prefix to indicate they are private.
keras/utils/data_utils.py
Outdated
@@ -516,6 +530,18 @@ def get(self): | |||
self.stop() | |||
raise StopIteration(e) | |||
|
|||
def send_seq(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is only used by the Sequence
instance. Please make it private.
keras/utils/data_utils.py
Outdated
return shared_sequence[i] | ||
|
||
|
||
def update_sequence(seq): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this method private and add a docstring explaining its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
* Make Sequence faster * Don't update if we're done * Change according to review
This breaks on windows. Note that there is no |
Could you show the stacktrace? I don't have a Windows machine. |
I did a fix, the manager is only initialized if we need it. (Which saves memory for everyone if we do not use Sequence) I do not have access to a Windows machine, is there a quick way to see if it fixes the bug? |
What happens if you use a sequence in the generator, while the previous sequence is not closed? |
@Dref360 Yes, it works. Please feel free to ping me for all multiproc related PRs for testing on Windows. |
Will do. Great point Constantin! This is not handled now will be done tonight. |
My concern maybe also the fact that the entire sequence is being send to every processes on end of epoch. Potentially this can easily blow the memory of the machine if sequence has a lot of information and you a lot of processors. For example, like someone mentioned, if you are linking to a file path for 100M files, this info will be copied to every single processes. I think the better solution would be to simply restart the pool on every epoch, thus utilizing the copy on write mechanism, to save memory (at least on Linux, not sure about windows). The overhead of a pool start is not large, as compared to large data cases when you are copying GB of data to each processes. |
Fyi: #8058 |
This reverts commit 19e1be2.
This reverts commit 19e1be2.
…-outputs * master: (68 commits) Change default value of shuffle parameter of Sequential.fit_generator() from True to False. (keras-team#8075) Fix off-by-one bug in predict/evaluate progress bar (keras-team#8071) Revert "Faster sequence" (keras-team#8060) Support NCHW for conv2d. (keras-team#8021) Change compute_accuracy() argument order and names (keras-team#8049) Replace literal constant 10 with variable num_classes in example/ (keras-team#8041) Faster sequence (keras-team#8039) Improve RNN docs. Enable accuracy reporting during training in examples/mnist_siamese_graph.py (keras-team#7997) Bug fix: Models with shared layers shouldn't be considered Sequential like (keras-team#8025) Add 'subtract' merge layer documentation (keras-team#8038) Update inference in seq2seq script to be more efficient Remove lstm_benchmark from examples/README.md (keras-team#8024) Add shuffle to the Model API (keras-team#8023) Add seq2seq example script. fix travis failure (keras-team#8014) Improve TF backend's Switch function (keras-team#7958) Added support for dynamic noise_shape in Dropout (keras-team#7999) Make on_epoch_end optional (keras-team#8007) Incremental tests speed ups. ...
During the initial implementation of Sequence, there have been some issues when using large objects inside of the Sequence like a list of paths. This PR solves these issues by caching the Sequence in each process instead of copying it everytime we need it. Between epochs, we update the Sequence since it may get modified by the
on_epoch_end
.I could have done it in a simpler way by direction accessing
self.pool._processes
, but it would be accessing a private member.Example :
Let's say that we keep a list
x
inside a Sequence wherex = [('abc', 512)] * 100000
.