Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

MXNet 2.0 cu112 docker undefined symbol issue #20145

@Zha0q1

Description

@Zha0q1

The nightly docker public.ecr.aws/w6z5f7h2/mxnet/python:nightly_gpu_cu112_py3 has a undefined symbol issue

>>> import mxnet
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/mxnet/__init__.py", line 23, in <module>
    from .context import Context, current_context, cpu, gpu, cpu_pinned
  File "/usr/local/lib/python3.7/dist-packages/mxnet/context.py", line 20, in <module>
    from .base import _LIB
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 293, in <module>
    _LIB = _load_lib()
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 284, in _load_lib
    lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
  File "/usr/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

This is most likely due to that in new nvml (cu112) there is a new v2 api.
checking nvidia/cuda:11.2.0-cudnn8-devel-centos7 confirmed this:

 * NVML API versioning support
 */
#define NVML_API_VERSION            11
#define NVML_API_VERSION_STR        "11"
/**
 * Defining NVML_NO_UNVERSIONED_FUNC_DEFS will disable "auto upgrading" of APIs.
 * e.g. the user will have to call nvmlInit_v2 instead of nvmlInit. Enable this
 * guard if you need to support older versions of the API
 */
#ifndef NVML_NO_UNVERSIONED_FUNC_DEFS
    #define nvmlInit                                nvmlInit_v2
    #define nvmlDeviceGetPciInfo                    nvmlDeviceGetPciInfo_v3
    #define nvmlDeviceGetCount                      nvmlDeviceGetCount_v2
    #define nvmlDeviceGetHandleByIndex              nvmlDeviceGetHandleByIndex_v2
    #define nvmlDeviceGetHandleByPciBusId           nvmlDeviceGetHandleByPciBusId_v2
    #define nvmlDeviceGetNvLinkRemotePciInfo        nvmlDeviceGetNvLinkRemotePciInfo_v2
    #define nvmlDeviceRemoveGpu                     nvmlDeviceRemoveGpu_v2
    #define nvmlDeviceGetGridLicensableFeatures     nvmlDeviceGetGridLicensableFeatures_v3
    #define nvmlEventSetWait                        nvmlEventSetWait_v2
    #define nvmlDeviceGetAttributes                 nvmlDeviceGetAttributes_v2
    #define nvmlComputeInstanceGetInfo              nvmlComputeInstanceGetInfo_v2
    #define nvmlDeviceGetComputeRunningProcesses    nvmlDeviceGetComputeRunningProcesses_v2
    #define nvmlDeviceGetGraphicsRunningProcesses   nvmlDeviceGetGraphicsRunningProcesses_v2
#endif // #ifndef NVML_NO_UNVERSIONED_FUNC_DEFS
..........
..........
nvmlReturn_t DECLDIR nvmlDeviceGetComputeRunningProcesses_v2(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_t *infos);

We can probably get around this issue by defining NVML_NO_UNVERSIONED_FUNC_DEFS

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions