use a process pool to calculate profile_uncertainty in parallel #223

bmaranville · 2024-11-08T16:12:37Z

Profile uncertainty can take a long time to calculate, and this PR adds a "parallel" kw arg to control parallelism.

A Process pool is launched when paralllel is not equal to 1

Note that this change is incompatible with using bumps.calc_errors to calculate the profile uncertainty, and so requires that #222 be merged first.

currently the parallel argument doesn't get set in any of the existing usage contexts, so it will use the default value of 0

pkienzle

New ticket: The mapping code belongs in bumps since it is not specific to reflectometry. Even better if we extend the existing mapper so that it can return arbitrary python objects instead of just the nllf so that we can use whatever parallel pool (MPI, multiprocessing) that we already have set up.

pkienzle · 2025-02-26T21:51:02Z

refl1d/errors.py

+
+
+def _worker_eval_point(point):
+    return _eval_point(_shared_problem, point)


Please rename _shared_problem to _worker_problem. Each worker has its own version, which it needs so that setp doesn't interfere between processes. The global is ugly, but I don't see a nice way to do this.

bmaranville · 2025-02-26T22:16:45Z

What if we're still fitting? Do we want to use the same pool for making plots as we're currently using for fitting?

pkienzle · 2025-02-26T22:57:58Z

There may be some advantages with MPI, particularly when you have several nodes in the allocation. The work is distributed across the nodes so that they each have roughly the same number of points to evaluate. If the node with the root process is busy evaluating functions for plots then it will be slow compared to the other nodes, which will become idle while waiting for the root node to catch up.

Worst case is if the children of the root process are tied to a single processor (I don't know enough about slurm/mpi/fork to know that this is likely). Then the whole allocation will wait while that one processor evaluates 50 or 100 points. Granted, this is what happens now, which is we don't generate plots as part of the checkpoint and update!

Ignoring the MPI case I'm not sure there is any performance advantage either way. The same amount of work is being done so throughput should be the same. I don't know what cost/complexity is involved in setting up the processing pool, but if it is slow we could keep the pool around for the life of the server.

You're right that a plot request coming from the client while the fit thread is running will be much easier to manage with separate pools.

So keep what you've got for now, but in future we may move the complexity of using the pool to bumps so that other applications can more easily do the same sort of thing.

hoogerheide · 2025-02-27T14:27:43Z

A single data point here. The molgroups plugin generates uncertainty plots during fitting using multiprocessing. I am using the MPMapper to do this. On TACC Stampede (using multiprocessing, not MPI), trying to start the uncertainty pool hangs (at least for several minutes) when the fit is running, but works fine when it's not. On Windows I think this works okay but I haven't tested it recently. There's the fork vs. spawn issue which might be playing a role here.

Point being that there may be some complications in terms of using the same pool, or trying to start a new pool, etc.

pkienzle · 2025-02-28T17:18:59Z

MPMapper is storing global state as class attributes. I'm not surprised it fails if you have two of them.

Brian's separate pool should work fine if you are using only a single node on TACC. Obviously both the fit and the plot will be slower as they compete with each other for resources but this is handled by the unix scheduler.

Rather that recompute the function we could keep a running set of extended outputs. These would be available on demand from the fitness function, and we could add a method to the mapper to request them. So long as they are serializable bumps can track them and store them in HDF. This would work for MPI as well as multiprocessing.

As a user I can imagine flipping to an old fit while a new fit is running to compare plots. Even better if I could see them side by side. This may require having a different fit problem in the process pool. Again, easy enough with a separate pool. Much harder on MPI. Hmmm... I wonder if we want to keep the plot serialization in HDF as well so that server doesn't need to do any work when showing plots from old fits.

bmaranville · 2025-02-28T18:29:47Z

Using an HDF5 dataset as a "live" datastore is a little tricky - there is a SWMR mode (Single Writer, Multiple Reader) which allows concurrent access. I think for contiguous data (not chunked - not compressed) you could probably make it fast. You'd have to know the dataset size at dataset creation to allocate contiguous storage. You can then write to any section of it whenever you want, and read any time.

If the data is chunked I think you have to call the refresh method regularly from the consumers to get updates to the metadata (B-tree of chunk addresses, which changes if chunks are invalidated).

pkienzle · 2025-02-28T19:30:53Z

The calculated model state would be much like the MCMC state. You wouldn't need to save it as live data to the HDF, but it could be useful to save it at the end of the fit so that you can more quickly produce plots from saved fit files. Make sure that "best" is one of the samples, since that will be needed for the reflectivity and profile plots.

This is out of scope for the current PR. Open a new ticket if you think this is worthwhile.

pkienzle · 2025-02-28T19:32:55Z

Note: one difference from MCMC state is that the state for the different points will have different size so you won't be able to allocate space for them in advance. It is more like a list of strings of arbitrary length.

bmaranville · 2025-02-28T19:37:11Z

I was actually wondering about that - ragged arrays are hard in HDF5.

pkienzle · 2025-02-28T19:53:21Z

You could serialize it as a list of dicts so there is only one large json blob to store.

for more information, see https://pre-commit.ci

use a process pool to calculate profiles in parallel

6c4d1d8

currently the parallel argument doesn't get set in any of the existing usage contexts, so it will use the default value of 0

bmaranville requested a review from acaruana2009 November 8, 2024 16:12

acaruana2009 approved these changes Nov 8, 2024

View reviewed changes

fix process pool initialization to work with spawn

68318cf

pkienzle approved these changes Feb 26, 2025

View reviewed changes

hoogerheide mentioned this pull request Feb 28, 2025

Custom uncertainty plots hang when fitting reflectometry/molgroups#6

Closed

bmaranville and others added 3 commits February 28, 2025 16:35

fixing paths in uncertainty.py

2e5d28b

don't need numpy intermediate array for manager shared memory

62a2243

[pre-commit.ci] auto fixes from pre-commit.com hooks

7406220

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use a process pool to calculate profile_uncertainty in parallel #223

use a process pool to calculate profile_uncertainty in parallel #223

Uh oh!

bmaranville commented Nov 8, 2024

Uh oh!

pkienzle left a comment

Uh oh!

pkienzle Feb 26, 2025

Uh oh!

bmaranville commented Feb 26, 2025

Uh oh!

pkienzle commented Feb 26, 2025

Uh oh!

hoogerheide commented Feb 27, 2025

Uh oh!

pkienzle commented Feb 28, 2025

Uh oh!

bmaranville commented Feb 28, 2025

Uh oh!

pkienzle commented Feb 28, 2025

Uh oh!

pkienzle commented Feb 28, 2025

Uh oh!

bmaranville commented Feb 28, 2025

Uh oh!

pkienzle commented Feb 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants



		def _worker_eval_point(point):
		return _eval_point(_shared_problem, point)

use a process pool to calculate profile_uncertainty in parallel #223

Are you sure you want to change the base?

use a process pool to calculate profile_uncertainty in parallel #223

Uh oh!

Conversation

bmaranville commented Nov 8, 2024

Uh oh!

pkienzle left a comment

Choose a reason for hiding this comment

Uh oh!

pkienzle Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

bmaranville commented Feb 26, 2025

Uh oh!

pkienzle commented Feb 26, 2025

Uh oh!

hoogerheide commented Feb 27, 2025

Uh oh!

pkienzle commented Feb 28, 2025

Uh oh!

bmaranville commented Feb 28, 2025

Uh oh!

pkienzle commented Feb 28, 2025

Uh oh!

pkienzle commented Feb 28, 2025

Uh oh!

bmaranville commented Feb 28, 2025

Uh oh!

pkienzle commented Feb 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants