Skip to content

Conversation

@erikvansebille
Copy link
Member

This PR benchmarks a "real-life" simulation on the curvilinear 1/12degree global NEMO data as distributed by MOi. There's an option for both a surface run (where only the surface fields are used; so less I/O) or a full 3D run. Benchmarks are discussed in the comments below

@erikvansebille
Copy link
Member Author

erikvansebille commented Aug 25, 2025

Now that @fluidnumerics-joe's Parcels-code/Parcels#2158 has significantly sped up the curvilinear index search, it's time to focus on the dask data loading again. Below is an analysis of the surface flow experiments

It turns out that calling the dask.compute() in the interpolator is extremely slow, see performance graph below (all these were run on our Lorenz cluster)
Screenshot 2025-08-26 at 08 06 51

As always, black is v3-JIT and grey is v3-Scipy. Red is the original v4-dev with the Morton encoding; it is really very slow...

We can considerably speed up v4-dev by calling a ds.load on the entire Field. If we do that within the benchmarking script, we get the dashed yellow line; which gives a ~30s loading time the first time, and is blazingly fast afterwards. But of course this is a bit cheating, because normally we wouldn't be able to load all data into memory. So I also implemented a field._load_timesteps() function again in Parcels-code/Parcels#2161, which is the green curve. That is fairly fast, roughly a factor 5 slower than JIT.

So on the one hand this is great news (we have a version of Parcels v4 that is within an order of magnitude as the JIT performance!), but it comes at a cost of having to load in the full fields. It must be possible to more smartly use dask to load in only those parts of the fields needed for the interpolation; the question is: how....

@erikvansebille
Copy link
Member Author

Here's a new version of the benchmark analysis for the 2D surface MOi flow fields

Screenshot 2025-08-29 at 08 38 12

In this plot, there's an extra line for a simulation where I used chunking parameter on xr.open_mfdataset(..., chunks= {"time_counter": 1, "depth":2, "y": chunk, "x": chunk}). The cyan line shows the performance for chunk=128. As can be seen, it's slower than load_timesteps() (the green line), but much faster than v4-dev without chunking (the red line, especially for small numbers of particles)

I did a further analysis of the runtime as a function of chunk, keeping the number of particles to 10k
Screenshot 2025-08-29 at 08 34 37

From this analysis, it's clear that in this setup, v4-dev is very slow for very small and very large chunk values, and that the optimum is somewhere around chunk=128.

Finally, I also did a simulation in the full 3-dimensional MOi data:
Screenshot 2025-08-29 at 08 40 20

What's incredible here is that v4-dev with load_timesteps() is actually faster than v3-JIT, up to 100k particles! Chunking with chunk=128 has now roughly the same performance as JIT-v3

So the good news is that we can get v4-dev considerably faster by choosing an appropriate chunk size when opening the xarray dataset; but that puts a lot of responsibility for performance on the users - outside the control of Parcels itself. Perhaps we can - in the coming months - find a way to optimise the chunking automatically - under the hood?

The alternative is still to use load_timesteps() (#2161), but that has disadvantages of its own (forcing to use linear time interpolation; big FieldSet memory footprint for particles in only a small part of the domain, ...)

To be continued...

@erikvansebille erikvansebille marked this pull request as ready for review October 1, 2025 11:57
@erikvansebille erikvansebille merged commit 9e6b09e into main Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants