Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train pytorch-based model using k8s mode #3408

Open
gganduu opened this issue Nov 5, 2021 · 2 comments
Open

Unable to train pytorch-based model using k8s mode #3408

gganduu opened this issue Nov 5, 2021 · 2 comments
Assignees

Comments

@gganduu
Copy link

gganduu commented Nov 5, 2021

My az k8s mode configuration is :

model = model_creator(None)
compute_loss = loss_creator(None)
optimizer = optim_creator(model, None)
train_loader = train_loader_creator(None, batch_size)
val_loader = val_data_creator(None, batch_size)

init_orca_context(
                    cluster_mode="k8s", 
                    master="k8s://https://172.16.212.214:6443",
                    container_image="ielym/test:az-k8s-v2",
                    num_nodes=2, 
                    memory="30g", 
                    cores=8,
                    conf={
                         "spark.driver.host": "172.16.212.214",
                         "spark.driver.port": "54323",
                         "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
                         "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/zoo/",
                         "spark.kubernetes.executor.label.aztest": "1"
                    }
        )
        
est = Estimator.from_torch(model=model_creator, optimizer=optim_creator, loss=loss_creator, backend="torch_distributed")
est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size, validation_data=val_data_creator)

When I tried to running this code, an error was caused:

javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639)
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:965)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:319)
at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:283)
at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:168)
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:257)
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:112)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:254)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:200)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

But with the same pytorch yolov5 code, I can successfully train it using az local mode:

model = model_creator(None)
compute_loss = loss_creator(None)
optimizer = optim_creator(model, None)
train_loader = train_loader_creator(None, batch_size)
val_loader = val_data_creator(None, batch_size)

init_orca_context(cluster_mode="local", cores=8, num_nodes=1, memory='30g', init_ray_on_spark=False, object_store_memory='30g')
est = Estimator.from_torch(model=model_creator, optimizer=optim_creator, loss=loss_creator, backend="torch_distributed")

est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size)

There are two nodes of k8s, and one for controller node, the other for a work node. Under the same k8s env, I can train tf-based yolov3 without error.

@glorysdj
Copy link
Contributor

glorysdj commented Nov 5, 2021

please check if the kubeconfig is mounted/set, the spark version of driver and the executor image
it may be related to misload kubeconfig or wrong version of okhttp/kubernetes-client

dding3 pushed a commit to dding3/BigDL that referenced this issue Nov 17, 2021
@glorysdj
Copy link
Contributor

@gganduu have you fixed this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants