-
Notifications
You must be signed in to change notification settings - Fork 116
Open
Description
Hi,
I was following your tutorial in https://www.youtube.com/watch?v=6wWdNg0GMV4
I have kubeflow set up with EKS cluster with version 1.23 (ebs-csi driver set up as instructed)
Kubeflow itself seem to be working as I try the Demo XGBoost pipeline and it was able to complete.
I set up the notebook with allow access to kubeflow pipeline checked, and applied access_kfp_from_jupyter_notebook.yaml
and set-minio-kserve-secret.yaml
I am also able to access minio and see some artifacts generated etc.
When I run digits_recognizer_pipeline.ipynb
, the get latest data finished quickly but the get data batch
step get stucked and time out.
Here;s the log:
time="2023-02-21T23:47:25.234Z" level=info msg="capturing logs" argo=true
getting data
2023-02-21 23:47:25.569490: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-21 23:47:25.569521: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/urllib3/connection.py", line 169, in _new_conn
conn = connection.create_connection(
File "/opt/conda/lib/python3.8/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/opt/conda/lib/python3.8/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/opt/conda/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/opt/conda/lib/python3.8/http/client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/conda/lib/python3.8/http/client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/conda/lib/python3.8/http/client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/conda/lib/python3.8/http/client.py", line 1007, in _send_output
self.send(msg)
File "/opt/conda/lib/python3.8/http/client.py", line 947, in send
self.connect()
File "/opt/conda/lib/python3.8/site-packages/urllib3/connection.py", line 200, in connect
conn = self._new_conn()
File "/opt/conda/lib/python3.8/site-packages/urllib3/connection.py", line 181, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fa9c03eff10>: Failed to establish a new connection: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/tmp.oceiY470Q3", line 76, in <module>
_outputs = get_data_batch(**_parsed_args)
File "/tmp/tmp.oceiY470Q3", line 19, in get_data_batch
minio_client.fget_object(minio_bucket,"mnist.npz","/tmp/mnist.npz")
File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 787, in fget_object
stat = self.stat_object(
File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 1195, in stat_object
response = self._url_open(
File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 2189, in _url_open
region = self._get_bucket_region(bucket_name)
File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 2067, in _get_bucket_region
region = self._get_bucket_location(bucket_name)
File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 2100, in _get_bucket_location
response = self._http.urlopen(method, url,
File "/opt/conda/lib/python3.8/site-packages/urllib3/poolmanager.py", line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
return self.urlopen(
File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
return self.urlopen(
File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
return self.urlopen(
[Previous line repeated 2 more times]
File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "/opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='100.65.11.110', port=9000): Max retries exceeded with url: /mlpipeline?location= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa9c03eff10>: Failed to establish a new connection: [Errno 110] Connection timed out'))
time="2023-02-22T00:00:37.229Z" level=error msg="cannot save artifact /tmp/outputs/datapoints_test/data" argo=true error="stat /tmp/outputs/datapoints_test/data: no such file or directory"
time="2023-02-22T00:00:37.229Z" level=error msg="cannot save artifact /tmp/outputs/datapoints_training/data" argo=true error="stat /tmp/outputs/datapoints_training/data: no such file or directory"
time="2023-02-22T00:00:37.229Z" level=error msg="cannot save artifact /tmp/outputs/dataset_version/data" argo=true error="stat /tmp/outputs/dataset_version/data: no such file or directory"
Error: exit status 1
What might be wrong? kubeflow version is 1.6.1
Metadata
Metadata
Assignees
Labels
No labels