-
Notifications
You must be signed in to change notification settings - Fork 6
Issue #1696 Optimize modelsplitter #1693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm probably a bit too early FWIW, but I'll comment anyway; I think I recognize the predicate variable approach I suggested in a comment. The activity count function in my draft PR has this: # Use ellipsis to reduce over ALL dimensions except label dims
dims_to_aggregate = [dim for dim in sample.dims if dim not in label_dims]
counts = sample.notnull().sum(dim=dims_to_aggregate).groupby(labels).sum()
return {label: int(n) for label, n in enumerate(counts.data)}This is more robust in terms of IO-overhead, since if the data is lazy, we require a single load of the data to do the groupby. Whereas doing a clip / isel kind of thing creates a necessity to load for each partition created. Of course, the IO per partition is inescapable if you start writing the models, hence my spatial chunking (after reordering via space filling curve) prior to split & write. Even so, I think the groupby on the unpartitioned data is worthwhile since it will lead to less IO regardless of whether the data is spatially chunked. |
…eading op the nc and zarr files
…al applicalble and some unittests are failing
# Conflicts: # pixi.lock
|
Hi @Huite, I tackled the problem in a slightly different way that you do. Or at least I think I'm doing the same but in a different order. As you correctly asset one of the biggest problem with the model splitter is that it is loading and unloading the same package data for each partition. This is happening because due the way the current modelsplitter is designed. It makes a deepcopy of the original model, including all the packages, and then goes over each partitioned model and assesses if the packages in that partitioned model actually belongs to the domein of that model. (You of course already know this but just stating it for future references) If i read your code correctly you solved this by creating a boundary polygon for each partition first. Then when looping over the models and assessing if a package belongs to that model you use the polygon test. This works great because you;ve checked before if a package belongs to a partition or not. I Solved the issue in a different way. Basically your approach and the original code do a double loop with the model as the outer loop and the packages as the inner loop. I turn this around. I first create all the partioned models but without packages. Then I start looping over the packages (outer loop). The package is loaded and stays in memory. Then i loop over the models and check if the package belongs to the model. The latter is done by comparing the a known domain value of the package to the active_domain of the model partition. Doing it this way I ensure that the package data is only loaded once. With this approach the split takes between 4m30 and 5m30 for the big model that you shared. What do you think of this approach? |
|
Alright! The most important part to me is:
This sounds like a good approach!
I'm not entirely sure this is the case, though? The key lines are: active_package_domain = package[
self._pkg_id_to_var_mapping[pkg_id]
].notnull()Unless we force a has_overlap = (
active_package_domain & partition_info.active_domain.astype(bool)
).any() # type: ignoreSince it's the I think there are two minor but possibly impactful improvements: Reduce the |
… the way the solution is updated in the split model
# Conflicts: # pixi.lock
JoerivanEngelen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I have a few comments. Thanks for going through the effort of creating this clean solution and adding a separate ModelSplitter class for it. Depending on capacity we'll see whether you or I will pick up my comments.
pixi.toml
Outdated
| pytest-cases = ">=3.9.1" | ||
| pytest-cov = "*" | ||
| pytest-dotenv = "*" | ||
| pytest-profiling = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see this being used?
imod/mf6/multimodel/modelsplitter.py
Outdated
| if isinstance(package, IAgnosticPackage): | ||
| if sliced_package["index"].size == 0: | ||
| sliced_package = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to:
| if isinstance(package, IAgnosticPackage): | |
| if sliced_package["index"].size == 0: | |
| sliced_package = None | |
| if isinstance(package, IAgnosticPackage) and sliced_package.is_empty: | |
| sliced_package = None |
In this way, you are not referring to the "index" variable, which might be or not be present. I think the check for IAgnosticPackage prevents calling the AllNoDataScheme on large, chunked grids, which .is_empty does.
imod/mf6/multimodel/modelsplitter.py
Outdated
|
|
||
| return partitioned_models | ||
|
|
||
| def update_packages(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this method name a bit generic, how about:
| def update_packages(self) -> None: | |
| def update_packages_after_split(self) -> None: |
Perhaps CoPilot has a briefer suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resorted to "update_dependent_packages" in the end
| continue | ||
|
|
||
| # Slice and add the package to the partitioned model | ||
| sliced_package = clip_by_grid(package, partition_info.active_domain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss the remove_expanded_auxiliary_variables_from_dataset(package) preceding this, that was there before. This is used for multi-species simulations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget what I said, I see a split with multispecies is tested in test_split_flow_and_transport_model, so this aux expansion isn't necessary here.
FYI: We can call the |
JoerivanEngelen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment
imod/mf6/multimodel/modelsplitter.py
Outdated
| if "layer" in ds.dims: | ||
| dims_to_be_removed.append("layer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think layer can be dropped here: What if there is no data present in layer 1, but there is in layer 2? What does xarray do when you then drop_vars on layer? I think we either have to aggregate by summing all notnull across the layer dim, or just leave the layer dim in here and let xarray broadcast.
imod/mf6/multimodel/modelsplitter.py
Outdated
| if isinstance(package, IAgnosticPackage): | ||
| pass | ||
| # No checks are done for these packages | ||
| elif pkg_id in self._pkg_id_skip_active_domain_check: | ||
| pass | ||
| else: | ||
| has_overlap = ( | ||
| active_package_domain & partition_info.active_domain.astype(bool) | ||
| ).any() # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding "pass" is a code smell. I refactored this to a separate method _is_package_to_skip, as the same logic is also used in _get_package_domain.
imod/mf6/multimodel/modelsplitter.py
Outdated
|
|
||
| return partitioned_models | ||
|
|
||
| def update_packages(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resorted to "update_dependent_packages" in the end
|
|
I've added functionality to ignore time, and it reduced the splitting operation to 1.5 minute. |
Fixes #1698 # Description This is part 2 of fixing the performance issues with large model. In part 1 #1693 the modelsplitter has been optimized. In this PR the focus is on wiring the partitioned model. As @Huite pointed out in #1686 the performance bottleneck had to do with the fact that the same package had to be loaded from file multiple times while only a part of the file is actually needed. After digging around for a while i discovered that this had to do with the fact how we open de the dataset. `dataset = xr.open_dataset(path, **kwargs)` In the line above we don't specify anything chunk related. That has as a result that when you access the dataset the entire file has to be loaded from disk. By simply adding `chunks="auto"` this is no longer the case and a huge performance gain is achieved. There are some other changes related to setting chunking to auto. There are some parts of the code that don't expect to receive dask arrays. For instance you can use .item() on a dask array. Instead i now use .values[()]. I was also getting some errors when the to_netcdf method were called on the package. All of them had something to do with wrong/unsupported datatypes. In this PR you will find that an encoding is added for float16 types. And that in some packages the from_file method has been updated to ensure that he loaded type is converted to a supported type An unrelated change but performance wise significant change has been applied to the `_get_transport_models_per_flow_model` method. This method is used to match gwf models to gwt models so that gwfgwt exchanges can be created. This method was doing a full comparison of domains, which is expensive. There is also a method available that does the comparison on domain level. By switching to this method the matching algorithm becomes almost instantaneously. **NOTE** This PR has issue #1699 as a base. The base needs to altered to master once that PR is in **NOTE** This PR also improves the `dump` method **NOTE** some timmings: <img width="833" height="739" alt="image" src="https://github.com/user-attachments/assets/974c841c-0413-4433-8486-1abe98dc0715" /> <img width="843" height="215" alt="image" src="https://github.com/user-attachments/assets/c7082975-af35-4143-a6f9-860557b3eb09" /> <img width="842" height="705" alt="image" src="https://github.com/user-attachments/assets/383bf1a6-f028-4cb4-aa72-48ab95e84e3d" /> <!--- Before requesting review, please go through this checklist: --> - [x] Links to correct issue - [ ] Update changelog, if changes affect users - [x] PR title starts with ``Issue #nr``, e.g. ``Issue #737`` - [ ] Unit tests were added - [ ] **If feature added**: Added/extended example - [ ] **If feature added**: Added feature to API documentation - [ ] **If pixi.lock was changed**: Ran `pixi run generate-sbom` and committed changes --------- Co-authored-by: JoerivanEngelen <joerivanengelen@hotmail.com>




Fixes #1696
Description
This PR implements some of the key insights @Huite found while optimzing the splitting of Models. Although there are some differences in his implementation and this PR. The optimizatiob exists of 2 parts
The First part has to do with the way models and packages are split. As @Huite pointed out the current implementation leads to a package being loaded and unloaded several times, which leads to a huge performance penalty because the package needs to be loaded from disk each time.
This has been solved by reverting the order of looping. In the original implementation there was a loop over the models, foreach partition a deepcopy was made of the original model. Then afterwards we loop over each model and access each package in the model to determine if it belongs to the same domain as the model. This is expensive.
This PR reverts the loop. Frist a shallow copy is made of all the models. Then we loop over each package of the original package and we determine if belongs to once of the submodels. This way we only need the open a package once.
The second part is the way the SSM and BUY packages are updated after the splitting has happend. There was a pience of code that looped over all the flow and transport models to determine which models share the same domain (are in the same partition). This is an expensive operation because the grid data has to be loaded into memory over and over.
In this PR this has been solved by bringing the update of the SSM and BUY package to the place where that information is already know i.e. the model splitter. This way we can avoid the expensive loading of the grid data.
Also by moving it to the modelsplitter we can avoid forcing the load of the grid data manually
With these changes the splitting of the mode @Huite provided me takes now around 5m:30s
Checklist
Issue #nr, e.g.Issue #737pixi run generate-sbomand committed changes