PyTorch Support (kubeflow#153)

* initial pytorch support * modifying go crd files * modifying go crd files * test placeholder * cleaning up test placeholder file * updating with cifar10 sample * updating cifar10 instructions * correcting docker steps * adding pytorch doeckerfile * removing generated data files * correcting the sample pytorch yaml file * adding the class file and class name parameters * adding cifar10 input file and dockerfile * adding gcs location for model file * addressing review comments * simplifying PyTorch interface * making model class name optional * fix the comment ordering * removing model file * adding default behaviour
magdalenakuhn17 · Jun 27, 2019 · 61a35a7 · 61a35a7
1 parent f5a749e
commit 61a35a7
Show file tree

Hide file tree

Showing 16 changed files with 3,977 additions and 0 deletions.
diff --git a/config/default/crds/serving_v1alpha1_kfservice.yaml b/config/default/crds/serving_v1alpha1_kfservice.yaml
@@ -65,6 +65,21 @@ spec:
                     0 in case of no traffic
                   format: int64
                   type: integer
+                pytorch:
+                  properties:
+                    modelClassName:
+                      description: Name of the model class for PyTorch model
+                      type: string
+                    modelUri:
+                      type: string
+                    resources:
+                      description: Defaults to requests and limits of 1CPU, 2Gb MEM.
+                      type: object
+                    runtimeVersion:
+                      type: string
+                  required:
+                  - modelUri
+                  type: object
                 serviceAccountName:
                   description: Service Account Name
                   type: string
@@ -144,6 +159,21 @@ spec:
                     0 in case of no traffic
                   format: int64
                   type: integer
+                pytorch:
+                  properties:
+                    modelClassName:
+                      description: Defaults to latest PyTorch Version.
+                      type: string
+                    modelUri:
+                      type: string
+                    resources:
+                      description: Defaults to requests and limits of 1CPU, 2Gb MEM.
+                      type: object
+                    runtimeVersion:
+                      type: string
+                  required:
+                  - modelUri
+                  type: object
                 serviceAccountName:
                   description: Service Account Name
                   type: string

diff --git a/docs/samples/pytorch/README.md b/docs/samples/pytorch/README.md
@@ -0,0 +1,112 @@
+## Creating your own model and testing the PyTorch server.
+
+To test the [PyTorch](https://pytorch.org/) server, first we need to generate a simple cifar10 model using PyTorch. 
+
+```shell
+python cifar10.py
+```
+You should see an output similar to this
+
+```shell
+Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
+Failed download. Trying https -> http instead. Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
+100.0%Files already downloaded and verified
+[1,  2000] loss: 2.232
+[1,  4000] loss: 1.913
+[1,  6000] loss: 1.675
+[1,  8000] loss: 1.555
+[1, 10000] loss: 1.492
+[1, 12000] loss: 1.488
+[2,  2000] loss: 1.412
+[2,  4000] loss: 1.358
+[2,  6000] loss: 1.362
+[2,  8000] loss: 1.338
+[2, 10000] loss: 1.315
+[2, 12000] loss: 1.278
+Finished Training
+```
+
+Then, we can run the PyTorch server using the trained model and test for predictions. Models can be on local filesystem, S3 compatible object storage or Google Cloud Storage. 
+
+Note: Currently KFServing supports PyTorch models saved using [state_dict method]((https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference), PyTorch's recommended way of saving models for inference. The KFServing interface for PyTorch expects users to upload the model_class_file in same location as the PyTorch model, and accepts an optional model_class_name to be passed in as a runtime input. If model class name is not specified, we use 'PyTorchModel' as the default class name. The current interface may undergo changes as we evolve this to support PyTorch models saved using other methods as well.
+
+```shell
+python -m pytorchserver --model_dir ./ --model_name pytorchmodel --model_class_name Net
+```
+
+We can also use the inbuilt PyTorch support for sample datasets and do some simple predictions
+
+```python
+import torch
+import torchvision
+import torchvision.transforms as transforms
+transform = transforms.Compose([transforms.ToTensor(),
+                                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
+testset = torchvision.datasets.CIFAR10(root='./data', train=False,
+                                       download=True, transform=transform)
+testloader = torch.utils.data.DataLoader(testset, batch_size=4,
+                                         shuffle=False, num_workers=2)
+dataiter = iter(testloader)
+images, labels = dataiter.next()
+formData = {
+    'instances': images[0:1].tolist()
+}
+res = requests.post('http://localhost:8080/models/pytorchmodel:predict', json=formData)
+print(res)
+print(res.text)
+```
+
+# Predict on a KFService using PyTorch
+
+## Setup
+1. Your ~/.kube/config should point to a cluster with [KFServing installed](https://github.com/kubeflow/kfserving/blob/master/docs/DEVELOPER_GUIDE.md#deploy-kfserving).
+2. Your cluster's Istio Ingress gateway must be network accessible.
+3. Your cluster's Istio Egresss gateway must [allow Google Cloud Storage](https://knative.dev/docs/serving/outbound-network-access/)
+
+## Create the KFService
+
+Apply the CRD
+```
+kubectl apply -f pytorch.yaml
+```
+
+Expected Output
+```
+$ kfservice.serving.kubeflow.org/pytorch-cifar10 created
+```
+
+## Run a prediction
+
+```
+MODEL_NAME=pytorch-cifar10
+INPUT_PATH=@./input.json
+CLUSTER_IP=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
+
+SERVICE_HOSTNAME=$(kubectl get kfservice pytorch-cifar10 -o jsonpath='{.status.url}')
+
+curl -v -H "Host: ${SERVICE_HOSTNAME}" -d $INPUT_PATH http://$CLUSTER_IP/models/$MODEL_NAME:predict
+```
+
+You should see an output similar to the one below:
+
+```
+> POST /models/pytorch-cifar10:predict HTTP/1.1
+> Host: pytorch-cifar10.default.svc.cluster.local
+> User-Agent: curl/7.54.0
+> Accept: */*
+> Content-Length: 110681
+> Content-Type: application/x-www-form-urlencoded
+> Expect: 100-continue
+> 
+< HTTP/1.1 100 Continue
+* We are completely uploaded and fine
+< HTTP/1.1 200 OK
+< content-length: 221
+< content-type: application/json; charset=UTF-8
+< date: Fri, 21 Jun 2019 04:05:39 GMT
+< server: istio-envoy
+< x-envoy-upstream-service-time: 35292
+< 
+
+{"predictions": [[-0.8955065011978149, -1.4453213214874268, 0.1515328735113144, 2.638284683227539, -1.00240159034729, 2.270702600479126, 0.22645258903503418, -0.880557119846344, 0.08783778548240662, -1.5551214218139648]]
+```
diff --git a/docs/samples/pytorch/cifar10.py b/docs/samples/pytorch/cifar10.py
@@ -0,0 +1,79 @@
+import torch
+import torchvision
+import torchvision.transforms as transforms
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+
+
+class Net(nn.Module):
+    def __init__(self):
+        super(Net, self).__init__()
+        self.conv1 = nn.Conv2d(3, 6, 5)
+        self.pool = nn.MaxPool2d(2, 2)
+        self.conv2 = nn.Conv2d(6, 16, 5)
+        self.fc1 = nn.Linear(16 * 5 * 5, 120)
+        self.fc2 = nn.Linear(120, 84)
+        self.fc3 = nn.Linear(84, 10)
+
+    def forward(self, x):
+        x = self.pool(F.relu(self.conv1(x)))
+        x = self.pool(F.relu(self.conv2(x)))
+        x = x.view(-1, 16 * 5 * 5)
+        x = F.relu(self.fc1(x))
+        x = F.relu(self.fc2(x))
+        x = self.fc3(x)
+        return x
+
+
+if __name__ == "__main__":
+
+    transform = transforms.Compose(
+        [transforms.ToTensor(),
+         transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
+
+    trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
+                                            download=True, transform=transform)
+    trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
+                                              shuffle=True, num_workers=2)
+
+    testset = torchvision.datasets.CIFAR10(root='./data', train=False,
+                                           download=True, transform=transform)
+    testloader = torch.utils.data.DataLoader(testset, batch_size=4,
+                                             shuffle=False, num_workers=2)
+
+    classes = ('plane', 'car', 'bird', 'cat',
+               'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
+
+    net = Net()
+
+    criterion = nn.CrossEntropyLoss()
+    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
+
+    for epoch in range(2):  # loop over the dataset multiple times
+
+        running_loss = 0.0
+        for i, data in enumerate(trainloader, 0):
+            # get the inputs; data is a list of [inputs, labels]
+            inputs, labels = data
+
+            # zero the parameter gradients
+            optimizer.zero_grad()
+
+            # forward + backward + optimize
+            outputs = net(inputs)
+            loss = criterion(outputs, labels)
+            loss.backward()
+            optimizer.step()
+
+            # print statistics
+            running_loss += loss.item()
+            if i % 2000 == 1999:    # print every 2000 mini-batches
+                print('[%d, %5d] loss: %.3f' %
+                      (epoch + 1, i + 1, running_loss / 2000))
+                running_loss = 0.0
+
+    print('Finished Training')
+
+    # Save model
+    torch.save(net.state_dict(), "model.pt")