Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The KFP preloaded XGboost sample is broken and out-dated. #5089

Closed
chensun opened this issue Feb 4, 2021 · 2 comments · Fixed by #5093 or #5100
Closed

The KFP preloaded XGboost sample is broken and out-dated. #5089

chensun opened this issue Feb 4, 2021 · 2 comments · Fixed by #5093 or #5100

Comments

@chensun
Copy link
Member

chensun commented Feb 4, 2021

TL;DR: The preload XGBoost sample is currently broken.
Proposing we remove this sample from KFP preload and from sample test until we got a chance to refresh the sample.


The direct cause was that it used the Dataproc 1.2 image which is based on Python 2.7, and pip 21.0 dropped support for Python 2.7.
The symptom is that dataproc_create_cluster fails on initialization.
image
and the specific error is mentioned here.

#5062 made an attempted fix by upgrading to Dataproc 1.5 image. It fixed the Dataproc cluster creation issue, but we hit an error later at the Trainer step.

We were advised that newer versions of Dataproc images likely don't have XGBoost library preinstalled, as there's now an initialization action that goes through extra steps to install XGBoost libraries.

Following that route, I tried installing the default XGBoost version using the rapids script, then hit the error as follows:

21/02/03 18:34:20 INFO org.spark_project.jetty.util.log: Logging initialized @3037ms
21/02/03 18:34:20 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
21/02/03 18:34:20 INFO org.spark_project.jetty.server.Server: Started @3169ms
21/02/03 18:34:20 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@4159e81b{HTTP/1.1,[http/1.1]}{0.0.0.0:37489}
21/02/03 18:34:20 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at xgb-bdd8f29b-fb13-4ec2-abcf-38b3699e7ca3-m/10.128.0.101:8032
21/02/03 18:34:21 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at xgb-bdd8f29b-fb13-4ec2-abcf-38b3699e7ca3-m/10.128.0.101:10200
21/02/03 18:34:21 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/02/03 18:34:21 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/02/03 18:34:21 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
21/02/03 18:34:21 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
21/02/03 18:34:23 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1612377093662_0003
21/02/03 18:34:30 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
21/02/03 18:34:30 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
21/02/03 18:34:30 INFO org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0
21/02/03 18:34:36 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
21/02/03 18:34:36 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
21/02/03 18:34:36 INFO org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0
Exception in thread "main" java.lang.NoSuchMethodError: ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame$default$5()Lml/dmlc/xgboost4j/scala/ObjectiveTrait;
	at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer$.main(XGBoostTrainer.scala:120)
	at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer.main(XGBoostTrainer.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/02/03 18:34:39 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@4159e81b{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
Job output is complete

I then realized that the sample is based on the code from the deprecated component path, which was deleted by #5045.

Specifically, the not found method from the above error was used here:

val xgboostModel = XGBoost.trainWithDataFrame(

And trainWithDataFrame only exists in XGBoost 0.72, but not seen from any versions beyond.

XGBoost 0.72 is too old and not even available from https://repo1.maven.org/maven2/com/nvidia/, which is used by rapids to download XGBoost.

At this point, I feel like we'd rather invest to rewrite the XGBoost sample using the latest XGBoost library than patching the existing one if we do think it's worth demoing running a XGBoost-on-Dataproc pipeline.
Util we have the sample working, I propose we remove it from the KFP preloaded pipelines and sample-tests.

@Bobgy
Copy link
Contributor

Bobgy commented Feb 4, 2021

Util we have the sample working, I propose we remove it from the KFP preloaded pipelines and sample-tests.

Completely agree with this!
The sample is a nice to have, although it's more important get KFP releases healthy and running.

@chensun
Copy link
Member Author

chensun commented Feb 5, 2021

While the issue is mitigated by temporarily removing the sample from preloads and tests, we may still want to rewrite the XGBoost Spark-Dataproc sample using the latest XGBoost library. Keep this issue open for tracking.

@chensun chensun reopened this Feb 5, 2021
google-oss-robot pushed a commit that referenced this issue Feb 12, 2021
* Revert "fix(samples): Remove broken xgboost sample (#5091)"

This reverts commit 1dcda80.

* fix(backend): Replaced the XGBoost sample

* Fixed the backend image build

* Updated the frontend tests
chensun pushed a commit to chensun/pipelines that referenced this issue Feb 12, 2021
…low#5100)

* Revert "fix(samples): Remove broken xgboost sample (kubeflow#5091)"

This reverts commit 1dcda80.

* fix(backend): Replaced the XGBoost sample

* Fixed the backend image build

* Updated the frontend tests
chensun pushed a commit that referenced this issue Feb 12, 2021
* Revert "fix(samples): Remove broken xgboost sample (#5091)"

This reverts commit 1dcda80.

* fix(backend): Replaced the XGBoost sample

* Fixed the backend image build

* Updated the frontend tests
chensun pushed a commit that referenced this issue Feb 12, 2021
* Revert "fix(samples): Remove broken xgboost sample (#5091)"

This reverts commit 1dcda80.

* fix(backend): Replaced the XGBoost sample

* Fixed the backend image build

* Updated the frontend tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment