Conversation
…h looks like we've reduced memory usage as far as possible
* Initial plan * Add memory_profiler and pympler to CI environment dependencies Co-authored-by: charles-turner-1 <52199577+charles-turner-1@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: charles-turner-1 <52199577+charles-turner-1@users.noreply.github.com>
…low & all tests passing at last
|
I ran the same thing as last time, which looks like the same as you used. However, you seem to get lower RAM increments.... and I can't reproduce the same values s last time... I guess I did something wrong, but can't see what. Anyway, here's the profile with this branch: And with The values are not really stable, I see them moving around when I repeat the test. With this uncertainty in mind, it seems to be that this branch moves the increment to the search, but the overall RAM usage is similar ? Maybe a slight decrease. And with Thanks for your work! Sorry that I can't help much right now. |
|
Weird - those numbers are very different to mine... I'll keep digging! |
…ailable - no need to initialise it multiple times
…das dataframe for
|
Okay, with the latest commits:
Newest changes TLDR;
Memory usage is basically unchanged by this. |
|
Following merging #771 into this, total memory usage is now down to 200MiB when using python profile_intake_esm_pascal.py
Searching for variable='tasmin'...
Filename: /Users/u1166368/catalog/intake-esm/profile_intake_esm_pascal.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
6 231.1 MiB 231.1 MiB 1 @profile
7 def main():
8 231.1 MiB 0.0 MiB 1 _ = 1
9 237.9 MiB 6.8 MiB 1 cat = esm_datastore('/Users/u1166368/scratch/simulation.json')
10 237.9 MiB 0.0 MiB 1 print("\nSearching for variable='tasmin'...\n")
11 444.6 MiB 206.7 MiB 1 scat = cat.search(variable='tasmin')
12
13 # print('\n\nscat.df info:')
14 # print(scat.df.info())
15
16 # print('\n\ncat.df info:')
17 # print(cat.df.info())
18
19 # _srcs = scat.unique()Tad frustrating it doesn't seem like we'll be able to get down much further without deferring the creation of a pandas dataframe aggressively, but I think this is probably acceptable? Average time per open and search: 0.0846 seconds |
|
I've started looking at transforming the @aulemahal does this look like the memory consumption is down enough to work well for you guys now? |
Change Summary
_search.pl_searchfunction using polars (as lazily as I can), not pandas, if we don't have a pd.DataFrame instantiated. If we have one, we use the old_search.searchfunction.esm_datastore.search()still triggers the creation of a pandas dataframe, but well afterpl_dfattribute fromFramesModel- it just adds memory overhead & I don't think we're benefiting from keeping it around.NotImplementedErrorif users try to regex search overcolumns_with_iterables(it previously wasn't implemented but didn't raise an error, see Regular expressions when columns_with_iterables #679). I've put some code beyond the error which I think should get that to work, but I think it would make more sense to split that out to a separate PR.@aulemahal are you able to rerun that profiling code you posted in #753 against this branch? I've done some profiling myself (see below), but the numbers I'm getting out are surprisingly different. With that said, it looks like the memory overhead of the search is down substantially on where we were. It also looks like we're not realising the full dataframe into memory so we should be down on previous memory usage (See allocations on line 17).
Things I haven't changed
search_apply_require_all_onstill needs a pandas dataframe, so search still triggers the creation of a pandas dataframe. We can probably get away without doing this with some more work.Memory Profiling
Gives (head of this branch)
With
intake-esm=2025.2.3:It looks like there might be some other memory management issues we'll need to clean up too, but presently I think that the initialisation & search memory usage should now be back in the same ballpark.
Search performance
Average time per search: 0.0060 secondsAverage time per search: 0.0340 secondsThis is obviously quite a bit slower (~5x), which is not great. Rolling together opening & searching the datastore
Average time per open and search: 0.0835 secondsAverage time per open and search: 0.4668 secondsI suspect most of that search time increase is the instantiation of the pandas dataframe - I'm hoping to continue to defer this further, but this memory issue needs fixing first.
Related issue number
#711
#753
Checklist