Skip to content

Commit

Permalink
Adds the container packages for BERT large inference and training (Py…
Browse files Browse the repository at this point in the history
…Torch SPR) (#83)

* Add specs, docs, and quickstarts for BERT inference and training

* Add build and run scripts

* Update mount paths

* update base FROM

* Update spec to add quickstarts

* update wrapper to include run.sh

* Update path

* Update pip install -y

* Update bert installs

* Regenerate dockerfile

* Update dockerfile for bert train

* Update installs

* Doc updates

* Update dockerfile and run after testing training

* remove bert inf files from dockerfile

* Small doc updates

* Add shm-size 8G

* Fix error message

* Fix env var usages in build.sh

* Regenerate dockerfiles

* update conda activate partial

* Add build tools

* quickstart script updates

* Clarify dataset download instructions and switch CHECKPOINT_DIR to CONFIG_FILE

* Update quickstart and docs to have phase 2 use checkpoints from phase 1

* Fix script
  • Loading branch information
dmsuehir authored Aug 30, 2021
1 parent 787b88b commit e5de734
Show file tree
Hide file tree
Showing 32 changed files with 1,269 additions and 1 deletion.
93 changes: 93 additions & 0 deletions dockerfiles/pytorch/pytorch-spr-bert-large-inference.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Copyright (c) 2020-2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
#
# THIS IS A GENERATED DOCKERFILE.
#
# This file was assembled from multiple pieces, whose use is documented
# throughout. Please refer to the TensorFlow dockerfiles documentation
# for more information.

ARG PYTORCH_IMAGE="model-zoo"
ARG PYTORCH_TAG="pytorch-ipex-spr"

FROM ${PYTORCH_IMAGE}:${PYTORCH_TAG} AS intel-optimized-pytorch

RUN yum --enablerepo=extras install -y epel-release && \
yum install -y \
ca-certificates \
git \
wget \
make \
cmake \
gcc-c++ \
gcc \
autoconf \
bzip2 \
tar

RUN source activate pytorch && \
pip install matplotlib Pillow pycocotools && \
pip install yacs opencv-python cityscapesscripts transformers && \
conda install -y libopenblas && \
mkdir -p /workspace/installs && \
cd /workspace/installs && \
wget https://github.com/gperftools/gperftools/releases/download/gperftools-2.7.90/gperftools-2.7.90.tar.gz && \
tar -xzf gperftools-2.7.90.tar.gz && \
cd gperftools-2.7.90 && \
./configure --prefix=$HOME/.local && \
make && \
make install && \
rm -rf /workspace/installs/

ARG PACKAGE_DIR=model_packages

ARG PACKAGE_NAME="pytorch-spr-bert-large-inference"

ARG MODEL_WORKSPACE

# ${MODEL_WORKSPACE} and below needs to be owned by root:root rather than the current UID:GID
# this allows the default user (root) to work in k8s single-node, multi-node
RUN umask 002 && mkdir -p ${MODEL_WORKSPACE} && chgrp root ${MODEL_WORKSPACE} && chmod g+s+w,o+s+r ${MODEL_WORKSPACE}

ADD --chown=0:0 ${PACKAGE_DIR}/${PACKAGE_NAME}.tar.gz ${MODEL_WORKSPACE}

RUN chown -R root ${MODEL_WORKSPACE}/${PACKAGE_NAME} && chgrp -R root ${MODEL_WORKSPACE}/${PACKAGE_NAME} && chmod -R g+s+w ${MODEL_WORKSPACE}/${PACKAGE_NAME} && find ${MODEL_WORKSPACE}/${PACKAGE_NAME} -type d | xargs chmod o+r+x

WORKDIR ${MODEL_WORKSPACE}/${PACKAGE_NAME}

ARG BERT_DIR="/workspace/pytorch-spr-bert-large-inference/models/bert"

RUN source activate pytorch && \
cd ${BERT_DIR} && \
cd bert && \
pip install -r examples/requirements.txt && \
pip install -e . && \
conda install -c conda-forge "llvm-openmp"

FROM intel-optimized-pytorch AS release
COPY --from=intel-optimized-pytorch /root/conda /root/conda
COPY --from=intel-optimized-pytorch /workspace/lib/ /workspace/lib/
COPY --from=intel-optimized-pytorch /root/.local/ /root/.local/

ENV DNNL_MAX_CPU_ISA="AVX512_CORE_AMX"

ENV PATH="~/conda/bin:${PATH}"
ENV LD_PRELOAD="/workspace/lib/jemalloc/lib/libjemalloc.so:$LD_PRELOAD"
ENV MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000"
ENV BASH_ENV=/root/.bash_profile
WORKDIR /workspace/
RUN yum install -y numactl mesa-libGL && \
yum clean all && \
echo "source activate pytorch" >> /root/.bash_profile
99 changes: 99 additions & 0 deletions dockerfiles/pytorch/pytorch-spr-bert-large-training.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Copyright (c) 2020-2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
#
# THIS IS A GENERATED DOCKERFILE.
#
# This file was assembled from multiple pieces, whose use is documented
# throughout. Please refer to the TensorFlow dockerfiles documentation
# for more information.

ARG PYTORCH_IMAGE="model-zoo"
ARG PYTORCH_TAG="pytorch-ipex-spr"

FROM ${PYTORCH_IMAGE}:${PYTORCH_TAG} AS intel-optimized-pytorch

RUN yum --enablerepo=extras install -y epel-release && \
yum install -y \
ca-certificates \
git \
wget \
make \
cmake \
gcc-c++ \
gcc \
autoconf \
bzip2 \
tar

RUN source activate pytorch && \
pip install matplotlib Pillow pycocotools && \
pip install yacs opencv-python cityscapesscripts transformers && \
conda install -y libopenblas && \
mkdir -p /workspace/installs && \
cd /workspace/installs && \
wget https://github.com/gperftools/gperftools/releases/download/gperftools-2.7.90/gperftools-2.7.90.tar.gz && \
tar -xzf gperftools-2.7.90.tar.gz && \
cd gperftools-2.7.90 && \
./configure --prefix=$HOME/.local && \
make && \
make install && \
rm -rf /workspace/installs/

ARG PACKAGE_DIR=model_packages

ARG PACKAGE_NAME="pytorch-spr-bert-large-training"

ARG MODEL_WORKSPACE

# ${MODEL_WORKSPACE} and below needs to be owned by root:root rather than the current UID:GID
# this allows the default user (root) to work in k8s single-node, multi-node
RUN umask 002 && mkdir -p ${MODEL_WORKSPACE} && chgrp root ${MODEL_WORKSPACE} && chmod g+s+w,o+s+r ${MODEL_WORKSPACE}

ADD --chown=0:0 ${PACKAGE_DIR}/${PACKAGE_NAME}.tar.gz ${MODEL_WORKSPACE}

RUN chown -R root ${MODEL_WORKSPACE}/${PACKAGE_NAME} && chgrp -R root ${MODEL_WORKSPACE}/${PACKAGE_NAME} && chmod -R g+s+w ${MODEL_WORKSPACE}/${PACKAGE_NAME} && find ${MODEL_WORKSPACE}/${PACKAGE_NAME} -type d | xargs chmod o+r+x

WORKDIR ${MODEL_WORKSPACE}/${PACKAGE_NAME}

ARG BERT_DIR="/workspace/pytorch-spr-bert-large-training/models/bert/bert"

RUN source activate pytorch && \
cd ${BERT_DIR} && \
pip install --upgrade pip && \
pip install -r examples/requirements.txt && \
pip install -e . && \
pip install datasets accelerate tfrecord && \
conda install openblas && \
conda install faiss-cpu -c pytorch && \
pip install transformers==4.9.0

RUN cd .. && \
rm -rf ${BERT_DIR}

FROM intel-optimized-pytorch AS release
COPY --from=intel-optimized-pytorch /root/conda /root/conda
COPY --from=intel-optimized-pytorch /workspace/lib/ /workspace/lib/
COPY --from=intel-optimized-pytorch /root/.local/ /root/.local/

ENV DNNL_MAX_CPU_ISA="AVX512_CORE_AMX"

ENV PATH="~/conda/bin:${PATH}"
ENV LD_PRELOAD="/workspace/lib/jemalloc/lib/libjemalloc.so:$LD_PRELOAD"
ENV MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000"
ENV BASH_ENV=/root/.bash_profile
WORKDIR /workspace/
RUN yum install -y numactl mesa-libGL && \
yum clean all && \
echo "source activate pytorch" >> /root/.bash_profile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## Build the container

The <model name> <mode> package has scripts and a Dockerfile that are
used to build a workload container that runs the model. This container
uses the PyTorch/IPEX container as it's base, so ensure that you have built
the `pytorch-ipex-spr.tar.gz` container prior to building this model container.

Use `docker images` to verify that you have the base container built. For example:
```
$ docker images | grep pytorch-ipex-spr
model-zoo pytorch-ipex-spr fecc7096a11e 40 minutes ago 8.31GB
```

To build the <model name> <mode> container, extract the package and
run the `build.sh` script.
```
# Extract the package
tar -xzf <package name>
cd <package dir>
# Build the container
./build.sh
```

After the build completes, you should have a container called
`<docker image>` that will be used to run the model.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<!-- 10. Description -->
## Description

This document has instructions for running <model name> <mode> using
Intel-optimized PyTorch.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Run the model

Download the pretrained model from huggingface and set the `PRETRAINED_MODEL` environment
variable to point to the downloaded file.
```
wget https://cdn.huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin -O pytorch_model.bin
export PRETRAINED_MODEL=$(pwd)/pytorch_model.bin
```

Once you have the pretarined model and have [built the container](#build-the-container),
use the `run.sh` script from the container package to run <model name> <mode> in docker.
Set environment variables to specify the precision to run, and an output directory.
By default, the `run.sh` script will run the `inference_realtime.sh` quickstart script.
To run a different script, specify the name of the script using the `SCRIPT` environment
variable.
```
# Navigate to the container package directory
cd <package dir>
# Set the required environment vars
export PRETRAINED_MODEL=<path to the downloaded model>
export PRECISION=<specify the precision to run>
export OUTPUT_DIR=<directory where log files will be written>
# Run the container with inference_realtime.sh quickstart script
./run.sh
# Use the SCRIPT env var to run a different quickstart script
SCRIPT=accuracy.sh ./run.sh
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
<!--- 80. License -->
## License

Licenses can be found in the model package, in the `licenses` directory.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<!--- 40. Quick Start Scripts -->
## Quick Start Scripts

| Script name | Description |
|-------------|-------------|
| `inference_realtime.sh` | Runs multi instance realtime inference using 4 cores per instance for the specified precision (fp32, int8 or bf16) using the [huggingface pretrained model](https://cdn.huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin). |
| `inference_throughput.sh` | Runs multi instance batch inference using 1 instance per socket for the specified precision (fp32, int8 or bf16) using the [huggingface pretrained model](https://cdn.huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin). |
| `accuracy.sh` | Measures the inference accuracy for the specified precision (fp32, int8 or bf16) using the [huggingface pretrained model](https://cdn.huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin). |
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
<!--- 0. Title -->
# PyTorch <model name> <mode>
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
## Model Package

The model package includes the Dockerfile and scripts needed to build and
run <model name> <mode> in a container.
```
<package dir>
├── README.md
├── build.sh
├── licenses
│   ├── LICENSE
│   └── third_party
├── model_packages
│   └── <package name>
├── <package dir>.Dockerfile
└── run.sh
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
<!--- 0. Title -->
# PyTorch BERT Large inference

<!-- 10. Description -->
## Description

This document has instructions for running BERT Large inference using
Intel-optimized PyTorch.

## Model Package

The model package includes the Dockerfile and scripts needed to build and
run BERT Large inference in a container.
```
pytorch-spr-bert-large-inference
├── README.md
├── build.sh
├── licenses
│   ├── LICENSE
│   └── third_party
├── model_packages
│   └── pytorch-spr-bert-large-inference.tar.gz
├── pytorch-spr-bert-large-inference.Dockerfile
└── run.sh
```

<!--- 40. Quick Start Scripts -->
## Quick Start Scripts

| Script name | Description |
|-------------|-------------|
| `inference_realtime.sh` | Runs multi instance realtime inference using 4 cores per instance for the specified precision (fp32, int8 or bf16) using the [huggingface pretrained model](https://cdn.huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin). |
| `inference_throughput.sh` | Runs multi instance batch inference using 1 instance per socket for the specified precision (fp32, int8 or bf16) using the [huggingface pretrained model](https://cdn.huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin). |
| `accuracy.sh` | Measures the inference accuracy for the specified precision (fp32, int8 or bf16) using the [huggingface pretrained model](https://cdn.huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin). |

## Build the container

The BERT Large inference package has scripts and a Dockerfile that are
used to build a workload container that runs the model. This container
uses the PyTorch/IPEX container as it's base, so ensure that you have built
the `pytorch-ipex-spr.tar.gz` container prior to building this model container.

Use `docker images` to verify that you have the base container built. For example:
```
$ docker images | grep pytorch-ipex-spr
model-zoo pytorch-ipex-spr fecc7096a11e 40 minutes ago 8.31GB
```

To build the BERT Large inference container, extract the package and
run the `build.sh` script.
```
# Extract the package
tar -xzf pytorch-spr-bert-large-inference.tar.gz
cd pytorch-spr-bert-large-inference
# Build the container
./build.sh
```

After the build completes, you should have a container called
`model-zoo:pytorch-bert-large-inference` that will be used to run the model.

## Run the model

Download the pretrained model from huggingface and set the `PRETRAINED_MODEL` environment
variable to point to the downloaded file.
```
wget https://cdn.huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin -O pytorch_model.bin
export PRETRAINED_MODEL=$(pwd)/pytorch_model.bin
```

Once you have the pretarined model and have [built the container](#build-the-container),
use the `run.sh` script from the container package to run BERT Large inference in docker.
Set environment variables to specify the precision to run, and an output directory.
By default, the `run.sh` script will run the `inference_realtime.sh` quickstart script.
To run a different script, specify the name of the script using the `SCRIPT` environment
variable.
```
# Navigate to the container package directory
cd pytorch-spr-bert-large-inference
# Set the required environment vars
export PRETRAINED_MODEL=<path to the downloaded model>
export PRECISION=<specify the precision to run>
export OUTPUT_DIR=<directory where log files will be written>
# Run the container with inference_realtime.sh quickstart script
./run.sh
# Use the SCRIPT env var to run a different quickstart script
SCRIPT=accuracy.sh ./run.sh
```

<!--- 80. License -->
## License

Licenses can be found in the model package, in the `licenses` directory.

Loading

0 comments on commit e5de734

Please sign in to comment.