Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHEL SBSA TF2 Backend Build #106

Merged
merged 2 commits into from
Sep 11, 2024
Merged

RHEL SBSA TF2 Backend Build #106

merged 2 commits into from
Sep 11, 2024

Conversation

fpetrini15
Copy link
Contributor

@fpetrini15 fpetrini15 commented Sep 4, 2024

Goal: Incorporate SBSA builds/tests and extended the smoke tests for x86 and SBSA to include L0_sequence_batcher.

Server changes: triton-inference-server/server#7595
PyTorch Backend changes: triton-inference-server/pytorch_backend#139
TensorRT Backend changes: triton-inference-server/tensorrt_backend#99

CMakeLists.txt Outdated
@@ -155,7 +159,7 @@ if (${TRITON_TENSORFLOW_DOCKER_BUILD})
COMMAND docker stop tensorflow_backend_deps || echo "error ignored..." || true
COMMAND docker rm tensorflow_backend_deps || echo "error ignored..." || true
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then mkdir tf_backend_deps && docker run -it -d --name tensorflow_backend_deps ${TRITON_TENSORFLOW_DOCKER_IMAGE} \; fi \;
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/x86_64-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't find utility make this more transparent?

Suggested change
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "find /usr/local/cuda/targets/*-linux/lib -name libnccl.so* -exec tar -cvf archive.tar {} \; fi "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great suggestion! Let me examine the tar commands for equivalency first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can use find then please remove the

if(${RHEL_BUILD})
      set(LIBS_ARCH "sbsa")
    else()
      set(LIBS_ARCH "aarch64")
    endif()

Copy link
Contributor Author

@fpetrini15 fpetrini15 Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I spent yesterday afternoon trying to get the right incantation for this, but have so far been unsuccessful. However, I've learned a lot more about what's going on here. From my understanding we are tar-ing each of the libnccl.so matches within the external tf container and then writing it to standard output (note the - [https://www.gnu.org/software/tar/manual/html_node/file.html]:

"tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*"

Which is then picked up in the subsequent command to expand the tar file in the build container stripping out the leading path ( /path/to/lib/libnccl.so --> libnccl.so):

tar --strip-components=3 -xf - -C ./tf_backend_deps

When we attempt to use the find command, nothing I've tried seems to match this behavior, however, I also think this is a really odd way to go about copying these files. I've tried a few alternate strategies such as mounting the tf_backend_deps folder and copying the libs in, which I thought would be a more straightforward approach, but that hasn't worked either.

I will continue trying to find a way to resolve this, but since this is not tested by default in our CI, I will prioritize getting the EA3 image out for now.

nv-kmcgill53
nv-kmcgill53 previously approved these changes Sep 4, 2024
Copy link
Contributor

@nv-kmcgill53 nv-kmcgill53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address @mc-nv's comments but I have nothing new to add.

@fpetrini15
Copy link
Contributor Author

fpetrini15 commented Sep 5, 2024

Okay, I believe I ended up finding a solution that's more straightforward. We now pass the results of the find command directly into docker cp so that we can copy the libs directly into the target folder. No messing around with filepaths. No messing around with tar-ing. Let me know what you think.

@@ -155,7 +149,7 @@ if (${TRITON_TENSORFLOW_DOCKER_BUILD})
COMMAND docker stop tensorflow_backend_deps || echo "error ignored..." || true
COMMAND docker rm tensorflow_backend_deps || echo "error ignored..." || true
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then mkdir tf_backend_deps && docker run -it -d --name tensorflow_backend_deps ${TRITON_TENSORFLOW_DOCKER_IMAGE} \; fi \;
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/x86_64-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "find $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/*-linux/lib/,/usr/lib/*-linux-gnu/>libnccl.so*" | xargs -I {} docker cp tensorflow_backend_deps:{} ./tf_backend_deps \; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat!

@fpetrini15 fpetrini15 merged commit e4fe6b8 into main Sep 11, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants