-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure HPCX dependencies found in container #5922
Changes from 5 commits
7057df1
ec23e61
8222f11
c03110c
cfaebf2
ca0962e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1102,6 +1102,16 @@ def dockerfile_prepare_container_linux(argmap, backends, enable_gpu, | |
if 'onnxruntime' in backends: | ||
df += ''' | ||
ENV LD_LIBRARY_PATH /opt/tritonserver/backends/onnxruntime:${LD_LIBRARY_PATH} | ||
''' | ||
|
||
# Necessary for libtorch.so to find correct HPCX libraries | ||
if ('pytorch' in backends): | ||
if not enable_gpu: | ||
df += ''' | ||
ENV LD_LIBRARY_PATH /opt/hpcx/ucc/lib/:${LD_LIBRARY_PATH} | ||
''' | ||
df += ''' | ||
ENV LD_LIBRARY_PATH /opt/hpcx/ucx/lib/:${LD_LIBRARY_PATH} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was wondering whether opencv is ok with this order. I think this means that opencv would pick up the newer hpcx version. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so. It looks like the tests that failed in the pipeline were failing before. I'm unable to find an executable to check for opencv_dev to make sure it's linking against the right, but the apt repo isn't part of this variable anyway so perhaps it's set as part of the apt install. |
||
''' | ||
|
||
backend_dependencies = "" | ||
|
@@ -1208,6 +1218,13 @@ def dockerfile_prepare_container_linux(argmap, backends, enable_gpu, | |
COPY --from=min_container /usr/local/cuda-12.1/targets/{cuda_arch}-linux/lib/libnvToolsExt.so.1 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. | ||
COPY --from=min_container /usr/local/cuda-12.1/targets/{cuda_arch}-linux/lib/libnvJitLink.so.12 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. | ||
|
||
RUN mkdir -p /opt/hpcx/ucc/lib/ /opt/hpcx/ucx/lib/ | ||
COPY --from=min_container /opt/hpcx/ucc/lib/libucc.so.1 /opt/hpcx/ucc/lib/libucc.so.1 | ||
COPY --from=min_container /opt/hpcx/ucx/lib/libucm.so.0 /opt/hpcx/ucx/lib/libucm.so.0 | ||
COPY --from=min_container /opt/hpcx/ucx/lib/libucp.so.0 /opt/hpcx/ucx/lib/libucp.so.0 | ||
COPY --from=min_container /opt/hpcx/ucx/lib/libucs.so.0 /opt/hpcx/ucx/lib/libucs.so.0 | ||
COPY --from=min_container /opt/hpcx/ucx/lib/libuct.so.0 /opt/hpcx/ucx/lib/libuct.so.0 | ||
|
||
COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8 /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8 | ||
|
||
# patchelf is needed to add deps of libcublasLt.so.12 to libtorch_cuda.so | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this part? Otherwise looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot, or the CPU-only image will not be able to find libucc.so.1 (in the /opt/hpcx/ucc/lib folder). With the other version of these changes, it wasn't necessary since the library was added to the backend directory, which was on the RPATH. Since this is now copying over to the /opt/hpcx/ folder, the CPU-only build needs to know where to find library or it will have the previous failure to load.
If easier to read, I can combine this with the other LD_LIBRARY_PATH. I broke it up like this though, because I didn't want to modify the LD_LIBRARY_PATH in GPU images since we don't have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only include ucc but not ucx in cpu-only build. Is that a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I'm a bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We include both. There's no else statement. For the CPU-only build, we need to both the ucc and ucx on the path. For the GPU-only build, we only need the ucx path (since the ucc is found automatically).