-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#7150: Reduce peak memory consumption #7149
Conversation
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@YarShev @dchigarev ready for review |
@@ -377,6 +377,8 @@ def deploy_splitting_func( | |||
A list of pandas DataFrames. | |||
""" | |||
dataframe = pandas.concat(list(partitions), axis=axis, copy=False) | |||
# to reduce peak memory consumption | |||
del partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we manually delete objects now, how does this affect performance? Did you try running any benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we manually delete objects now, how does this affect performance? Did you try running any benchmark?
del
statement simply removes a reference to the object, allows the garbage collector to free memory earlier, but does not explicitly call it, so there is no direct impact on performance.
@@ -377,6 +377,8 @@ def deploy_splitting_func( | |||
A list of pandas DataFrames. | |||
""" | |||
dataframe = pandas.concat(list(partitions), axis=axis, copy=False) | |||
# to reduce peak memory consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much does this reduce peak memory consumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the size of the objects in partitions
. Here we don't keep one extra copy of the entire dataframe until the end of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any results on a benchmark we run regularly? It would be great to look at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, only for synthetic bench (roughly repeats what happens when performing an operation on axial partitions):
import numpy as np
import pandas as pd
import time
df1 = pd.DataFrame(np.random.rand(10**6, 2 * 10**2)) # ~ 1GB
df2 = pd.DataFrame(np.random.rand(10**6, 2 * 10**2)) # ~ 1GB
partitions = (df1, df2)
dataframe = pd.concat(list(partitions), axis=0, copy=False)
del partitions, df1, df2
result = dataframe.abs()
del dataframe
time.sleep(5)
# case with del save around ~4GB of peak memory consumption
I guess we can just look at the results of the built-in (in CI) memory consumption check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In timedf the memory consumption is calculated by computing peak memory consumption by checking the stats similar to htop by computing the memory in child thread for every 0.001 seconds.
The peak memory calculated is printed as max_system_memory
in the CI at the end of benchmark run in timedf if we need to compare memory consumption on any of our regular benchmarks.
What do these changes do?
Keeping large objects (in our case, dataframes) in local variables when they are no longer needed is extremely memory-intensive. I propose to review all such places and try to free up memory as early as possible.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
added andpassingdocs/development/architecture.rst
is up-to-date