-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHEL SBSA TF2 Backend Build #106
Conversation
CMakeLists.txt
Outdated
@@ -155,7 +159,7 @@ if (${TRITON_TENSORFLOW_DOCKER_BUILD}) | |||
COMMAND docker stop tensorflow_backend_deps || echo "error ignored..." || true | |||
COMMAND docker rm tensorflow_backend_deps || echo "error ignored..." || true | |||
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then mkdir tf_backend_deps && docker run -it -d --name tensorflow_backend_deps ${TRITON_TENSORFLOW_DOCKER_IMAGE} \; fi \; | |||
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/x86_64-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi | |||
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't find utility make this more transparent?
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi | |
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "find /usr/local/cuda/targets/*-linux/lib -name libnccl.so* -exec tar -cvf archive.tar {} \; fi " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great suggestion! Let me examine the tar commands for equivalency first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can use find
then please remove the
if(${RHEL_BUILD})
set(LIBS_ARCH "sbsa")
else()
set(LIBS_ARCH "aarch64")
endif()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I spent yesterday afternoon trying to get the right incantation for this, but have so far been unsuccessful. However, I've learned a lot more about what's going on here. From my understanding we are tar-ing each of the libnccl.so matches within the external tf container and then writing it to standard output (note the -
[https://www.gnu.org/software/tar/manual/html_node/file.html]:
"tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*"
Which is then picked up in the subsequent command to expand the tar file in the build container stripping out the leading path ( /path/to/lib/libnccl.so --> libnccl.so):
tar --strip-components=3 -xf - -C ./tf_backend_deps
When we attempt to use the find command, nothing I've tried seems to match this behavior, however, I also think this is a really odd way to go about copying these files. I've tried a few alternate strategies such as mounting the tf_backend_deps
folder and copying the libs in, which I thought would be a more straightforward approach, but that hasn't worked either.
I will continue trying to find a way to resolve this, but since this is not tested by default in our CI, I will prioritize getting the EA3 image out for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address @mc-nv's comments but I have nothing new to add.
Okay, I believe I ended up finding a solution that's more straightforward. We now pass the results of the |
@@ -155,7 +149,7 @@ if (${TRITON_TENSORFLOW_DOCKER_BUILD}) | |||
COMMAND docker stop tensorflow_backend_deps || echo "error ignored..." || true | |||
COMMAND docker rm tensorflow_backend_deps || echo "error ignored..." || true | |||
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then mkdir tf_backend_deps && docker run -it -d --name tensorflow_backend_deps ${TRITON_TENSORFLOW_DOCKER_IMAGE} \; fi \; | |||
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/x86_64-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi | |||
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "find $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/*-linux/lib/,/usr/lib/*-linux-gnu/>libnccl.so*" | xargs -I {} docker cp tensorflow_backend_deps:{} ./tf_backend_deps \; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat!
Goal: Incorporate SBSA builds/tests and extended the smoke tests for x86 and SBSA to include L0_sequence_batcher.
Server changes: triton-inference-server/server#7595
PyTorch Backend changes: triton-inference-server/pytorch_backend#139
TensorRT Backend changes: triton-inference-server/tensorrt_backend#99