Feat: adds huggingface bloom example to torchserve (kubeflow#2466)

* Feat: add huggingface bloom example to torchserve Signed-off-by: Jagadeesh J <jagadeeshj@ideas2it.com> * fix: add pv and helper pod yaml - update readme doc - add storageclass yaml - fix config.properties Signed-off-by: Jagadeesh J <jagadeeshj@ideas2it.com> * fix: add steps for model sharding
magdalenakuhn17 · Mar 4, 2023 · 38465ea · 38465ea
1 parent aed4d5b
commit 38465ea
Show file tree

Hide file tree

Showing 6 changed files with 255 additions and 0 deletions.
diff --git a/docs/samples/v1beta1/torchserve/v1/bloom/README.md b/docs/samples/v1beta1/torchserve/v1/bloom/README.md
@@ -0,0 +1,187 @@
+# TorchServe example with Huggingface BLOOM model
+In this example we will show how to serve [Large Huggingface models with TorchServe](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Largemodels)
+on KServe.
+
+## Model archive file creation
+
+Clone [pytorch/serve](https://github.com/pytorch/serve) repository,
+navigate to `examples/Huggingface_Largemodels` and follow the steps for creating the MAR file including serialized model and other dependent files.
+
+The above Torchserve example works on shard version of Huggingface models.
+
+For sharding the Huggingface models you can use the following script, and then [compress the model](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Largemodels#step-2-compress-downloaded-model)
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name="bigscience/bloomz-7b1"
+model = AutoModelForCausalLM.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model.save_pretrained("model/"+model_name, max_shard_size="5GB")
+tokenizer.save_pretrained("model/"+model_name)
+```
+
+## Create NVMe Persistent Volume
+
+Use SSH to connect to the worker nodes and prepare the NVMe drives for Kubernetes, as follows.
+
+Run the `lsblk`  command on each worker node to lists the available disks. 
+
+```bash
+NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
+nvme1n1       259:0    0 116.4G  0 disk 
+nvme0n1       259:1    0    80G  0 disk 
+├─nvme0n1p1   259:2    0    80G  0 part /
+└─nvme0n1p128 259:3    0     1M  0 part 
+```
+
+Output of the command line that shows the result of running the lsblk command.
+
+
+```bash
+$ sudo mkfs.xfs /dev/nvme1n1
+```
+
+```bash
+$ sudo mkdir -p /mnt/data/vol1
+$ sudo chmod -R 777 /mnt     
+$ sudo mount /dev/nvme1n1 /mnt/data/vol1
+```
+
+Permanently mount the device:
+
+```bash
+$ sudo blkid /dev/nvme1n1
+```
+
+To get it to mount every time, add the following line to the /etc/fstab file:
+
+```bash
+UUID=nvme_UUID      /mnt/data/vol1   xfs    defaults,nofail    0    2
+```
+
+Clone the local provisioner repository:
+
+```bash
+$ git clone https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner.git
+```
+
+Create a StorageClass yaml file `storageclass.yaml`
+
+```yaml
+kind: StorageClass
+apiVersion: storage.k8s.io/v1
+metadata:
+  name: fast-disks
+provisioner: kubernetes.io/no-provisioner
+volumeBindingMode: WaitForFirstConsumer
+```
+
+```bash
+$ kubectl apply -f storageclass.yaml
+```
+
+Create Local Persistent Volumes for Kubernetes
+
+- Change `hostDir` to the mount path
+
+```bash
+cd sig-storage-local-static-provisioner
+helm template ./helm/provisioner > ./provisioner/deployment/kubernetes/provisioner_generated.yaml
+
+kubectl apply -f ./deployment/kubernetes/provisioner_generated.yaml
+```
+
+Output of the command line that shows the result of running the kubectl get pods command.
+
+```bash
+NAME                                         READY   STATUS    RESTARTS   AGE
+kserve-controller-manager-5c5c4d8c89-lrzbd   2/2     Running   0          4d2h
+local-nvme-pv-provisioner-vwxgt              1/1     Running   0          16m
+```
+
+```bash
+$ kubectl get pv
+
+NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                     STORAGECLASS   REASON   AGE
+local-pv-2a85b8ac                          116Gi      RWO            Delete           Bound       kserve-test/model-cache   fast-disks              4d3h
+```
+## Create PVC and Mount the Model.
+
+```bash
+kubectl apply -f pvc-pod.yaml
+```
+
+Refer: [Bloom Model Example](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Largemodels)
+
+Move the config.properties and MAR file to PVC in the below structure.
+
+```bash
+|_model-store
+  |_bloom-560m.mar
+|_config
+  |_config.properties
+```
+
+```bash
+kubectl exec -it pv-pod -- mkdir /pv/config
+kubectl exec -it pv-pod -- mkdir /pv/model-store
+
+kubectl cp config.properties pv-pod:/pv/config
+kubectl cp bloom-560m.mar -it pv-pod:/pv/config
+```
+
+## Create the InferenceService
+
+Apply the CRD
+
+```bash
+kubectl apply -f bloom-560m.yaml
+```
+
+Expected Output
+
+```bash
+$inferenceservice.serving.kserve.io/torchserve-bloom-560m created
+```
+
+## Run a prediction
+
+The first step is to [determine the ingress IP and ports](https://kserve.github.io/website/0.10/get_started/first_isvc/#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`
+
+```bash
+MODEL_NAME=BLOOMSeqClassification
+ISVC_NAME=torchserve-bloom-560m
+SERVICE_HOSTNAME=$(kubectl get inferenceservice ${ISVC_NAME} -n <namespace> -o jsonpath='{.status.url}' | cut -d "/" -f 3)
+
+curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d ./sample_text.txt
+```
+
+Expected Output
+
+```bash
+*   Trying 44.239.20.204...
+* Connected to a881f5a8c676a41edbccdb0a394a80d6-2069247558.us-west-2.elb.amazonaws.com (44.239.20.204) port 80 (#0)
+> PUT /v1/models/BLOOMSeqClassification:predict HTTP/1.1
+> Host: torchserve-bloom-560m.kserve-test.example.com
+> User-Agent: curl/7.47.0
+> Accept: */*
+> Content-Length: 79
+> Expect: 100-continue
+>
+< HTTP/1.1 100 Continue
+* We are completely uploaded and fine
+< HTTP/1.1 200 OK
+< cache-control: no-cache; no-store, must-revalidate, private
+< content-length: 8
+< date: Wed, 04 Nov 2020 10:54:49 GMT
+< expires: Thu, 01 Jan 1970 00:00:00 UTC
+< pragma: no-cache
+< x-request-id: 4b54d3ac-185f-444c-b344-b8a785fdeb50
+< x-envoy-upstream-service-time: 2085
+< server: istio-envoy
+<
+* Connection #0 to host torchserve-bloom-560m.kserve-test.example.com left intact
+Accepted
+```
+
+**__Note__** For larger models use `A100 80g` GPU instances. 
diff --git a/docs/samples/v1beta1/torchserve/v1/bloom/bloom-560m.yaml b/docs/samples/v1beta1/torchserve/v1/bloom/bloom-560m.yaml
@@ -0,0 +1,14 @@
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: "torchserve-bloom-560m"
+spec:
+  predictor:
+    pytorch:
+      storageUri: pvc://model-cache
+      resources:
+          limits:
+            cpu: 4
+            memory: 16Gi
+            nvidia.com/gpu: 1
+  nodeName: ip-xxx-xxx-xxx-xxx.us-west-2.compute.internal
diff --git a/docs/samples/v1beta1/torchserve/v1/bloom/config.properties b/docs/samples/v1beta1/torchserve/v1/bloom/config.properties
@@ -0,0 +1,13 @@
+inference_address=http://0.0.0.0:8085
+management_address=http://0.0.0.0:8085
+metrics_address=http://0.0.0.0:8082
+grpc_inference_port=7070
+grpc_management_port=7071
+enable_metrics_api=true
+metrics_format=prometheus
+number_of_netty_threads=4
+job_queue_size=10
+enable_envvars_config=true
+install_py_dep_per_model=true
+model_store=/mnt/models/model-store
+model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"bloom":{"1.0":{"defaultVersion":true,"marName":"bloom-560m.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}}
diff --git a/docs/samples/v1beta1/torchserve/v1/bloom/pvc-pod.yml b/docs/samples/v1beta1/torchserve/v1/bloom/pvc-pod.yml
@@ -0,0 +1,28 @@
+
+kind: PersistentVolumeClaim
+apiVersion: v1
+metadata:
+  name: model-local-claim
+spec:
+  accessModes:
+  - ReadWriteOnce
+  resources:
+    requests:
+      storage: 700Gi 
+  storageClassName: fast-disks
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: pv-pod
+spec:
+  volumes:
+    - name: pv-storage
+      persistentVolumeClaim:
+        claimName: model-local-claim
+  containers:
+    - name: pv-container
+      image: alpine
+      volumeMounts:
+        - mountPath: "/pv"
+          name: pv-storage
diff --git a/docs/samples/v1beta1/torchserve/v1/bloom/sample_text.txt b/docs/samples/v1beta1/torchserve/v1/bloom/sample_text.txt
@@ -0,0 +1,7 @@
+{
+  "instances": [
+    {
+      "data": "My dog is cute"
+    }
+  ]
+}
diff --git a/docs/samples/v1beta1/torchserve/v1/bloom/storageclass.yaml b/docs/samples/v1beta1/torchserve/v1/bloom/storageclass.yaml
@@ -0,0 +1,6 @@
+kind: StorageClass
+apiVersion: storage.k8s.io/v1
+metadata:
+    name: fast-disks
+provisioner: kubernetes.io/no-provisioner
+volumeBindingMode: WaitForFirstConsumer