Skip to content

[FEA]: Improve pipeline cpuset logic #551

Open

Description

Is this a new feature, an improvement, or a change to existing functionality?

Change

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

Currently Morpheus assumes an available user cpuset range hardcoded from 0 up to the number of threads (processors) minus one when setting the options for a MRC/SRF Executor in a pipeline. However, there may be certain environments where cpusets have been created for users that restrict CPU access to certain groups of processes (e.g., Slurm).

https://github.com/nv-morpheus/Morpheus/blob/branch-23.01/morpheus/pipeline/pipeline.py#L70

MRC does not make this same assumption and instead evaluates the hwloc topology of visible CPU and compares that to the user_cpuset that has been configured.

https://github.com/nv-morpheus/MRC/blob/branch-23.01/cpp/mrc/src/internal/system/topology.cpp#L141

If the intersection of the two sets is null, MRC errors out and the pipeline fails with stacktrace.

Describe your ideal solution

Not sure which is ideal but:

  • provide an option for Morpheus to pass in a usable cpuset (user responsibility)
  • Morpheus doesn't do any cpuset configuration and instead defers to MRC to make a decision, possibly guided by a handful of configurable algorithms
  • MRC exposes an interface for the topology queries it is already doing prior to an Executor being built and Morpheus can fail more gracefully informing the user they must choose a usable cpuset from the topology query

Describe any alternatives you have considered

No response

Additional context

====Registering Pipeline====
Error occurred during Pipeline.build(). Exiting.
Traceback (most recent call last):
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 277, in build_and_start
    self.build()
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 175, in build
    self._srf_executor = srf.Executor(self._exec_options)
RuntimeError: intersection between user_cpuset and topo_cpuset is null
Traceback (most recent call last):
  File "/data/sdp/cybersecurity_ai/files/pass_thru/run_passthru.py", line 40, in <module>
Exception occurred in pipeline. Rethrowing
Traceback (most recent call last):
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 251, in join
    await self._srf_executor.join_async()
AttributeError: 'NoneType' object has no attribute 'join_async'
====Pipeline Complete====
    run_pipeline()
  File "/data/sdp/cybersecurity_ai/files/pass_thru/run_passthru.py", line 37, in run_pipeline
    pipeline.run()
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 517, in run
    asyncio.run(self._do_run())
  File "/opt/conda/envs/morpheus/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/envs/morpheus/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 495, in _do_run
    await self.join()
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 251, in join
    await self._srf_executor.join_async()
AttributeError: 'NoneType' object has no attribute 'join_async'

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

feature requestNew feature or requestimprovementImprovement to existing functionality

Type

No type

Projects

  • Status

    Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions