Skip to content

Commit

Permalink
[Doc]: Broken link in Kubernetes doc (huggingface#33879)
Browse files Browse the repository at this point in the history
* add relative path in .md and redirects to conf.py

* add redirects to conf.py and update .md

* modify links in .md
  • Loading branch information
saldanhad authored Oct 4, 2024
1 parent 124713c commit b6a01df
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 10 deletions.
2 changes: 1 addition & 1 deletion docs/source/en/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@
"{processor_class}": "FakeProcessorClass",
"{model_class}": "FakeModelClass",
"{object_class}": "FakeObjectClass",
}
}
19 changes: 10 additions & 9 deletions docs/source/en/perf_train_cpu_many.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,16 +138,16 @@ Now, run the following command in node0 and **4DDP** will be enabled in node0 an
## Usage with Kubernetes

The same distributed training job from the previous section can be deployed to a Kubernetes cluster using the
[Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/pytorch/).
[Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/user-guides/pytorch).

### Setup

This example assumes that you have:
* Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow/)
* [`kubectl`](https://kubernetes.io/docs/tasks/tools/) installed and configured to access the Kubernetes cluster
* A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) that can be used
* Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow)
* [`kubectl`](https://kubernetes.io/docs/tasks/tools) installed and configured to access the Kubernetes cluster
* A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes) that can be used
to store datasets and model files. There are multiple options for setting up the PVC including using an NFS
[storage class](https://kubernetes.io/docs/concepts/storage/storage-classes/) or a cloud storage bucket.
[storage class](https://kubernetes.io/docs/concepts/storage/storage-classes) or a cloud storage bucket.
* A Docker container that includes your model training script and all the dependencies needed to run the script. For
distributed CPU training jobs, this typically includes PyTorch, Transformers, Intel Extension for PyTorch, Intel
oneCCL Bindings for PyTorch, and OpenSSH to communicate between the containers.
Expand Down Expand Up @@ -176,7 +176,7 @@ PyTorchJob to the cluster.

### PyTorchJob Specification File

The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/pytorch/) is used to run the distributed
The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch) is used to run the distributed
training job on the cluster. The yaml file for the PyTorchJob defines parameters such as:
* The name of the PyTorchJob
* The number of replicas (workers)
Expand Down Expand Up @@ -273,12 +273,13 @@ To run this example, update the yaml based on your training script and the nodes
<Tip>
The CPU resource limits/requests in the yaml are defined in [cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)
The CPU resource limits/requests in the yaml are defined in
[cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)
where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical
host or a VM). The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of
available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in
order to leave some resources for the kubelet and OS. In order to get ["guaranteed"](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed)
[quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) for the worker pods,
[quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod) for the worker pods,
set the same CPU and memory amounts for both the resource limits and requests.
</Tip>
Expand Down Expand Up @@ -318,4 +319,4 @@ with the job, the PyTorchJob resource can be deleted from the cluster using `kub

This guide covered running distributed PyTorch training jobs using multiple CPUs on bare metal and on a Kubernetes
cluster. Both cases utilize Intel Extension for PyTorch and Intel oneCCL Bindings for PyTorch for optimal training
performance, and can be used as a template to run your own workload on multiple nodes.
performance, and can be used as a template to run your own workload on multiple nodes.

0 comments on commit b6a01df

Please sign in to comment.