Skip to content

Running TPU tests on linux-x86-ct6e-44-1tpu #21425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/actions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,38 @@ permissions:
contents: read

jobs:
tpu_build:
strategy:
fail-fast: false
matrix:
python-version: ['3.10']
backend: [tensorflow]
name: Run TPU tests
runs-on:
# - linux-x86-ct5lp-112-4tpu
# - linux-x86-ct5lp-112-4tpu-fvn6n-runner-6kb8n
- linux-x86-ct6e-44-1tpu
# - linux-x86-ct6e-44-1tpu-4khbn-runner-x4st4
# - linux-x86-ct6e-44-1tpu-4khbn-runner-45nmc

container:
image: docker:latest
env:
PYTHON: ${{ matrix.python-version }}
KERAS_HOME: .github/workflows/config/${{ matrix.backend }}
KERAS_BACKEND: tensorflow
steps:
- uses: actions/checkout@v4

- name: Build and run Docker image for TPU tests
run: |
docker build -f .github/workflows/tpu/Dockerfile -t keras-tpu-test .
docker run --rm \
-e PYTHON=${{ matrix.python-version }} \
-e KERAS_HOME=.github/workflows/config/${{ matrix.backend }} \
-e KERAS_BACKEND=tensorflow \
keras-tpu-test

build:
strategy:
fail-fast: false
Expand Down
28 changes: 28 additions & 0 deletions .github/workflows/tpu/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
FROM python:3.10-slim

ENV KERAS_HOME=/github/workspace/.github/workflows/config/tensorflow \
KERAS_BACKEND=tensorflow

RUN apt-get update && apt-get install -y --no-install-recommends \
git \
sudo \
&& rm -rf /var/lib/apt/lists/*

# Copy the entire codebase into the container
COPY . /github/workspace
WORKDIR /github/workspace

# Create and activate venv, install pip/setuptools/psutil, then run tests
RUN cd src/github/keras && \
pip install -U pip setuptools && \
pip install -U psutil && \
pip install -r requirements-tensorflow-tpu.txt && \
pip uninstall -y keras keras-nightly && \
python3 -c 'import tensorflow as tf;print(tf.__version__);print(tf.config.list_physical_devices("TPU"))' && \
python3 -c 'import tensorflow as tf;assert len(tf.config.list_physical_devices("TPU")) > 0' && \
pytest keras --ignore keras/src/applications \
--ignore keras/src/layers/merging/merging_test.py \
--cov=keras \
--cov-config=pyproject.toml

CMD ["bash"]
74 changes: 74 additions & 0 deletions .github/workflows/tpu/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
set -e
set -x

cd "${KOKORO_ROOT}/"

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1

PYTHON_BINARY="/usr/bin/python3.10"

"${PYTHON_BINARY}" -m venv venv
source venv/bin/activate
# Check the python version
python --version
python3 --version

cd "src/github/keras"
pip install -U pip setuptools
# psutil is used by background log reader
pip install -U psutil

if [ "$KERAS_BACKEND" == "tensorflow" ]
then
echo "TensorFlow backend detected."
pip install -r requirements-tensorflow-tpu.txt --progress-bar off --timeout 1000
pip uninstall -y keras keras-nightly
echo "Check that TensorFlow uses TPU"
python3 -c 'import tensorflow as tf;print(tf.__version__);print(tf.config.list_physical_devices("TPU"))'
# Raise error if GPU is not detected.
python3 -c 'import tensorflow as tf;assert len(tf.config.list_physical_devices("TPU")) > 0'

# TODO: keras/layers/merging/merging_test.py::MergingLayersTest::test_sparse_dot_2d Fatal Python error: Aborted
pytest keras --ignore keras/src/applications \
--ignore keras/src/layers/merging/merging_test.py \
--cov=keras \
--cov-config=pyproject.toml
fi

if [ "$KERAS_BACKEND" == "jax" ]
then
echo "JAX backend detected."
pip install -r requirements-jax-cuda.txt --progress-bar off --timeout 1000
pip uninstall -y keras keras-nightly
python3 -c 'import jax;print(jax.__version__);print(jax.default_backend())'
# Raise error if GPU is not detected.
python3 -c 'import jax;assert jax.default_backend().lower() == "gpu"'

# TODO: keras/layers/merging/merging_test.py::MergingLayersTest::test_sparse_dot_2d Fatal Python error: Aborted
# TODO: keras/trainers/data_adapters/py_dataset_adapter_test.py::PyDatasetAdapterTest::test_basic_flow0 Fatal Python error: Aborted
# keras/backend/jax/distribution_lib_test.py is configured for CPU test for now.
pytest keras --ignore keras/src/applications \
--ignore keras/src/layers/merging/merging_test.py \
--ignore keras/src/trainers/data_adapters/py_dataset_adapter_test.py \
--ignore keras/src/backend/jax/distribution_lib_test.py \
--ignore keras/src/distribution/distribution_lib_test.py \
--cov=keras \
--cov-config=pyproject.toml

pytest keras/src/distribution/distribution_lib_test.py --cov=keras --cov-config=pyproject.toml
fi

if [ "$KERAS_BACKEND" == "torch" ]
then
echo "PyTorch backend detected."
pip install -r requirements-torch-cuda.txt --progress-bar off --timeout 1000
pip uninstall -y keras keras-nightly
python3 -c 'import torch;print(torch.__version__);print(torch.cuda.is_available())'
# Raise error if GPU is not detected.
python3 -c 'import torch;assert torch.cuda.is_available()'

pytest keras --ignore keras/src/applications \
--cov=keras \
--cov-config=pyproject.toml

fi
16 changes: 16 additions & 0 deletions .github/workflows/tpu/tensorflow/continuous.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
build_file: "keras/.github/workflows/tpu/build.sh"

action {
define_artifacts {
regex: "**/sponge_log.log"
regex: "**/sponge_log.xml"
}
}

env_vars: {
key: "KERAS_BACKEND"
value: "tensorflow"
}

# Set timeout to 60 mins from default 180 mins
timeout_mins: 60
16 changes: 16 additions & 0 deletions .github/workflows/tpu/tensorflow/presubmit.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
build_file: "keras/.github/workflows/tpu/build.sh"

action {
define_artifacts {
regex: "**/sponge_log.log"
regex: "**/sponge_log.xml"
}
}

env_vars: {
key: "KERAS_BACKEND"
value: "tensorflow"
}

# Set timeout to 60 mins from default 180 mins
timeout_mins: 60
14 changes: 14 additions & 0 deletions requirements-tensorflow-tpu.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
tensorflow==2.18.0
--find-links https://storage.googleapis.com/libtpu-tf-releases/index.html
tensorflow-tpu==2.18.0

tf2onnx

# Torch cpu-only version (needed for testing).
--extra-index-url https://download.pytorch.org/whl/cpu
torch==2.6.0

# Jax cpu-only version (needed for testing).
jax[cpu]

-r requirements-common.txt
Loading