Description
openedon Dec 14, 2022
Is this a new feature, an improvement, or a change to existing functionality?
Change
How would you describe the priority of this feature request
High
Please provide a clear description of problem this feature solves
Currently Morpheus assumes an available user cpuset range hardcoded from 0 up to the number of threads (processors) minus one when setting the options for a MRC/SRF Executor in a pipeline. However, there may be certain environments where cpusets have been created for users that restrict CPU access to certain groups of processes (e.g., Slurm).
https://github.com/nv-morpheus/Morpheus/blob/branch-23.01/morpheus/pipeline/pipeline.py#L70
MRC does not make this same assumption and instead evaluates the hwloc topology of visible CPU and compares that to the user_cpuset that has been configured.
https://github.com/nv-morpheus/MRC/blob/branch-23.01/cpp/mrc/src/internal/system/topology.cpp#L141
If the intersection of the two sets is null, MRC errors out and the pipeline fails with stacktrace.
Describe your ideal solution
Not sure which is ideal but:
- provide an option for Morpheus to pass in a usable cpuset (user responsibility)
- Morpheus doesn't do any cpuset configuration and instead defers to MRC to make a decision, possibly guided by a handful of configurable algorithms
- MRC exposes an interface for the topology queries it is already doing prior to an Executor being built and Morpheus can fail more gracefully informing the user they must choose a usable cpuset from the topology query
Describe any alternatives you have considered
No response
Additional context
====Registering Pipeline====
Error occurred during Pipeline.build(). Exiting.
Traceback (most recent call last):
File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 277, in build_and_start
self.build()
File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 175, in build
self._srf_executor = srf.Executor(self._exec_options)
RuntimeError: intersection between user_cpuset and topo_cpuset is null
Traceback (most recent call last):
File "/data/sdp/cybersecurity_ai/files/pass_thru/run_passthru.py", line 40, in <module>
Exception occurred in pipeline. Rethrowing
Traceback (most recent call last):
File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 251, in join
await self._srf_executor.join_async()
AttributeError: 'NoneType' object has no attribute 'join_async'
====Pipeline Complete====
run_pipeline()
File "/data/sdp/cybersecurity_ai/files/pass_thru/run_passthru.py", line 37, in run_pipeline
pipeline.run()
File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 517, in run
asyncio.run(self._do_run())
File "/opt/conda/envs/morpheus/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/envs/morpheus/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 495, in _do_run
await self.join()
File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 251, in join
await self._srf_executor.join_async()
AttributeError: 'NoneType' object has no attribute 'join_async'
Code of Conduct
- I agree to follow this project's Code of Conduct
- I have searched the open feature requests and have found no duplicates for this feature request
Metadata
Assignees
Type
Projects
Status
Todo