-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Description of Problem:
We recently identified a memory leak in the training of Rasa models.
This leak was on the main branch, but only detected when it got so bad due to a new change that it was crashing CI test workers.
After much investigation the leak was narrowed to down to a bit of tensorflow code and fixed.
Ideally this memory leak would have been caught sooner, as it would have affected users.
The leak was apparent when either training with a high number of epochs, or training multiple times (like in the test suite).
We would like to have an automated check to test that we don't introduce another memory leak.
Overview of the Solution:
Tensorflow memory leaks can be hard to identify and fix as they often occur when the graph is being executed which could be in c code for example. This means the "leaking" variables are often not visible when looking at all the python objects in the interpreter. However you can identify if a memory leak exists by looking at the total memory usage of the process.
We used https://pypi.org/project/memory-profiler/ to track the memory usage of a python process when training the TED model to find the leak mentioned above.
This tool tracks the total memory usage over time and writes it to a file which can be parsed or plotted.
To use this in a automated fashion we could:
- Create a test which trains a a model with dummy data but a high number of epochs
- Run this python process wrapped in the profiler
- Analyse the output to see the trend of the total memory usage
We could have a threshold that if crossed fails the test, e.g. 1GB
Definition of Done:
- The check can identify the bug that is mentioned in the description.
- The check works in the CI.
- The check works locally.