Skip to content

Some questions about tf-serving on NFS #844

Closed
@gxfun

Description

@gxfun

Hi,
We cannot access the Internet and Ambassador isn't working . Will these affect the use of tf-serving?

We use kubeadm1.9.1 set up kubernetes.

kubernetes
master        iecas-30-6
slave           iecas-30-7, iecas-30-8

NFS
server         iecas-30-7
client          iecas-30-6, iecas-30-8

There is the information of inception-nfs service.

kubectl get deployment inception-nfs -n kubeflow

NAME            DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
inception-nfs   1         1         1            1           31m

kubectl get services -n kubeflow

NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
ambassador         ClusterIP   10.99.217.131   <none>        80/TCP              32m
ambassador-admin   ClusterIP   10.103.24.16    <none>        8877/TCP            32m
inception-nfs      ClusterIP   10.105.9.96     <none>        9000/TCP,8000/TCP   32m
k8s-dashboard      ClusterIP   10.111.23.158   <none>        443/TCP             32m
tf-hub-0           ClusterIP   None            <none>        8000/TCP            32m
tf-hub-lb          ClusterIP   10.98.150.141   <none>        80/TCP              32m
tf-job-dashboard   ClusterIP   10.110.154.14   <none>        80/TCP              32m

We can see the EXTERNAL-IP is none.

kubectl logs inception-nfs-657769bbd5-w4cv2 -n kubeflow

2018-05-21 18:42:33.129402: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:370] FileSystemStoragePathSource encountered a file-system access error: Could not find base path /mnt/var/nfs/general/inception for servable inception-nfs    

The error is Could not find base path /mnt/var/nfs/general/inception, but it exists in /var/nfs/general/inception.

iecas@iecas-30-7: ll /var/nfs/general/

total 16
drwxr-xr-x 4 nobody nogroup 4096 5月  22 01:17 ./
drwxr-xr-x 3 root   root    4096 3月   4  2016 ../
-rw-r--r-- 1 nobody nogroup    0 3月   4  2016 general.test
drwxr-xr-x 3 root   root    4096 5月  22 01:17 inception/
drwxr-xr-x 2 root   root    4096 3月   4  2016 pip/
kubectl describe pod  inception-nfs-657769bbd5-w4cv2 -n kubeflow

Name:           inception-nfs-657769bbd5-w4cv2
Namespace:      kubeflow
Node:           iecas-30-8/192.168.30.8
Start Time:     Tue, 22 May 2018 02:03:44 +0800
Labels:         app=inception-nfs
                pod-template-hash=2133256681
Annotations:    <none>
Status:         Running
IP:             10.244.1.14
Controlled By:  ReplicaSet/inception-nfs-657769bbd5
Containers:
  inception-nfs:
    Container ID:  docker://5acfa1a67310929575ab65e89ca482106d088c9cf3ecee4e64710b26d538c930
    Image:         gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec
    Image ID:      docker://sha256:aeb4fbd2c5a15d0714054153556e6e445a1bbb8fcbac7b289467bb328025d9db
    Port:          9000/TCP
    Args:
      /usr/bin/tensorflow_model_server
      --port=9000
      --model_name=inception-nfs
      --model_base_path=/mnt/var/nfs/general/inception
    State:          Running
      Started:      Tue, 22 May 2018 02:07:11 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     4
      memory:  4Gi
    Requests:
      cpu:        1
      memory:     1Gi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw2s8 (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  default-token-kw2s8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kw2s8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                 Age                From                 Message
  ----     ------                 ----               ----                 -------
  Normal   SuccessfulMountVolume  44m                kubelet, iecas-30-8  MountVolume.SetUp succeeded for volume "default-token-kw2s8"
  Warning  Failed                 44m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:51709->[::1]:53: read: connection refused
  Warning  Failed                 43m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:53334->[::1]:53: read: connection refused
  Warning  Failed                 43m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:34904->[::1]:53: read: connection refused
  Warning  Failed                 42m (x4 over 44m)  kubelet, iecas-30-8  Error: ErrImagePull
  Normal   Pulling                42m (x4 over 44m)  kubelet, iecas-30-8  pulling image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec"
  Warning  Failed                 42m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:60235->[::1]:53: read: connection refused
  Normal   BackOff                42m (x6 over 44m)  kubelet, iecas-30-8  Back-off pulling image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec"
  Warning  Failed                 42m (x6 over 44m)  kubelet, iecas-30-8  Error: ImagePullBackOff
  Normal   Scheduled              41m                default-scheduler    Successfully assigned inception-nfs-657769bbd5-w4cv2 to iecas-30-8
iecas@iecas-30-6: kubectl edit service tf-job-dashboard -n kubeflow

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: Service
metadata:
  annotations:
    getambassador.io/config: |-
      ---
      apiVersion: ambassador/v0
      kind:  Mapping
      name: tfjobs-ui-mapping
      prefix: /tfjobs/
      rewrite: /tfjobs/
      service: tf-job-dashboard.kubeflow
  creationTimestamp: 2018-05-21T18:05:41Z
  name: tf-job-dashboard
  namespace: kubeflow
  resourceVersion: "1750"
  selfLink: /api/v1/namespaces/kubeflow/services/tf-job-dashboard
  uid: 9320f5b3-5d21-11e8-9f7b-a0423f2e7641
spec:
  clusterIP: 10.110.154.14
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    name: tf-job-dashboard
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
iecas@iecas-30-6:~/Documents/kubeflow/code/my-kubeflow$ kubectl get nodes

NAME         STATUS    ROLES     AGE       VERSION
iecas-30-6   Ready     master    54m       v1.9.1
iecas-30-7   Ready     <none>    49m       v1.9.1
iecas-30-8   Ready     <none>    51m       v1.9.1

iecas@iecas-30-6:~/Documents/kubeflow/code/my-kubeflow$ kubectl get pods --all-namespaces

NAMESPACE     NAME                                   READY     STATUS             RESTARTS   AGE
kube-system   etcd-iecas-30-6                        1/1       Running            0          53m
kube-system   kube-apiserver-iecas-30-6              1/1       Running            0          53m
kube-system   kube-controller-manager-iecas-30-6     1/1       Running            0          53m
kube-system   kube-dns-6f4fd4bdf-lbg2w               3/3       Running            0          54m
kube-system   kube-flannel-ds-8nzkh                  1/1       Running            0          52m
kube-system   kube-flannel-ds-f4q5h                  1/1       Running            0          51m
kube-system   kube-flannel-ds-hg449                  1/1       Running            0          50m
kube-system   kube-proxy-dfgtr                       1/1       Running            0          51m
kube-system   kube-proxy-nfqtb                       1/1       Running            0          50m
kube-system   kube-proxy-xdx2t                       1/1       Running            0          54m
kube-system   kube-scheduler-iecas-30-6              1/1       Running            0          53m
kube-system   nvidia-device-plugin-daemonset-h87m9   1/1       Running            0          50m
kube-system   nvidia-device-plugin-daemonset-mpvzg   1/1       Running            0          50m
kubeflow      ambassador-64dcb6694f-qnvvk            1/2       CrashLoopBackOff   11         38m
kubeflow      ambassador-6dffffbc5c-9vb59            1/2       CrashLoopBackOff   11         37m
kubeflow      ambassador-6dffffbc5c-qh2qj            1/2       CrashLoopBackOff   5          6m
kubeflow      ambassador-6dffffbc5c-w2gk9            1/2       CrashLoopBackOff   11         37m
kubeflow      inception-nfs-657769bbd5-w4cv2         1/1       Running            0          37m
kubeflow      spartakus-volunteer-66564f9679-s4gjn   1/1       Running            0          37m
kubeflow      tf-hub-0                               1/1       Running            0          37m
kubeflow      tf-job-dashboard-7d48f6456c-hd6n8      1/1       Running            0          38m
kubeflow      tf-job-operator-68cd79c8b5-rpxlp       1/1       Running            0          38m

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions