The KFP preloaded XGboost sample is broken and out-dated. #5089

chensun · 2021-02-04T00:36:45Z

TL;DR: The preload XGBoost sample is currently broken.
Proposing we remove this sample from KFP preload and from sample test until we got a chance to refresh the sample.

The direct cause was that it used the Dataproc 1.2 image which is based on Python 2.7, and pip 21.0 dropped support for Python 2.7.
The symptom is that dataproc_create_cluster fails on initialization.

and the specific error is mentioned here.

#5062 made an attempted fix by upgrading to Dataproc 1.5 image. It fixed the Dataproc cluster creation issue, but we hit an error later at the Trainer step.

We were advised that newer versions of Dataproc images likely don't have XGBoost library preinstalled, as there's now an initialization action that goes through extra steps to install XGBoost libraries.

Following that route, I tried installing the default XGBoost version using the rapids script, then hit the error as follows:

21/02/03 18:34:20 INFO org.spark_project.jetty.util.log: Logging initialized @3037ms
21/02/03 18:34:20 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
21/02/03 18:34:20 INFO org.spark_project.jetty.server.Server: Started @3169ms
21/02/03 18:34:20 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@4159e81b{HTTP/1.1,[http/1.1]}{0.0.0.0:37489}
21/02/03 18:34:20 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at xgb-bdd8f29b-fb13-4ec2-abcf-38b3699e7ca3-m/10.128.0.101:8032
21/02/03 18:34:21 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at xgb-bdd8f29b-fb13-4ec2-abcf-38b3699e7ca3-m/10.128.0.101:10200
21/02/03 18:34:21 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/02/03 18:34:21 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/02/03 18:34:21 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
21/02/03 18:34:21 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
21/02/03 18:34:23 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1612377093662_0003
21/02/03 18:34:30 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
21/02/03 18:34:30 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
21/02/03 18:34:30 INFO org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0
21/02/03 18:34:36 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
21/02/03 18:34:36 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
21/02/03 18:34:36 INFO org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0
Exception in thread "main" java.lang.NoSuchMethodError: ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame$default$5()Lml/dmlc/xgboost4j/scala/ObjectiveTrait;
	at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer$.main(XGBoostTrainer.scala:120)
	at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer.main(XGBoostTrainer.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/02/03 18:34:39 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@4159e81b{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
Job output is complete

I then realized that the sample is based on the code from the deprecated component path, which was deleted by #5045.

Specifically, the not found method from the above error was used here:

pipelines/components/deprecated/dataproc/train/src/XGBoostTrainer.scala

Line 121 in 32ce8d8

val xgboostModel = XGBoost.trainWithDataFrame(

And trainWithDataFrame only exists in XGBoost 0.72, but not seen from any versions beyond.

XGBoost 0.72 is too old and not even available from https://repo1.maven.org/maven2/com/nvidia/, which is used by rapids to download XGBoost.

At this point, I feel like we'd rather invest to rewrite the XGBoost sample using the latest XGBoost library than patching the existing one if we do think it's worth demoing running a XGBoost-on-Dataproc pipeline.
Util we have the sample working, I propose we remove it from the KFP preloaded pipelines and sample-tests.

The text was updated successfully, but these errors were encountered:

Bobgy · 2021-02-04T02:11:10Z

Util we have the sample working, I propose we remove it from the KFP preloaded pipelines and sample-tests.

Completely agree with this!
The sample is a nice to have, although it's more important get KFP releases healthy and running.

chensun · 2021-02-05T19:36:40Z

While the issue is mitigated by temporarily removing the sample from preloads and tests, we may still want to rewrite the XGBoost Spark-Dataproc sample using the latest XGBoost library. Keep this issue open for tracking.

* Revert "fix(samples): Remove broken xgboost sample (#5091)" This reverts commit 1dcda80. * fix(backend): Replaced the XGBoost sample * Fixed the backend image build * Updated the frontend tests

…low#5100) * Revert "fix(samples): Remove broken xgboost sample (kubeflow#5091)" This reverts commit 1dcda80. * fix(backend): Replaced the XGBoost sample * Fixed the backend image build * Updated the frontend tests

* Revert "fix(samples): Remove broken xgboost sample (#5091)" This reverts commit 1dcda80. * fix(backend): Replaced the XGBoost sample * Fixed the backend image build * Updated the frontend tests

chensun added the area/samples label Feb 4, 2021

chensun mentioned this issue Feb 4, 2021

Kubeflow-pipeline-postsubmit-integration-test failure #5007

Closed

Ark-kun mentioned this issue Feb 4, 2021

fix(backend): Replaced the XGBoost sample. Fixes #5089 #5090

Closed

Bobgy mentioned this issue Feb 4, 2021

fix(samples): Remove broken xgboost sample #5091

Merged

2 tasks

numerology mentioned this issue Feb 4, 2021

test: Temporarily disable XGBoost tests #5093

Merged

2 tasks

Bobgy closed this as completed in #5093 Feb 4, 2021

Ark-kun mentioned this issue Feb 4, 2021

fix(backend): Replaced the XGBoost sample. Fixes #5089 #5100

Merged

chensun reopened this Feb 5, 2021

google-oss-robot closed this as completed in #5100 Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The KFP preloaded XGboost sample is broken and out-dated. #5089

The KFP preloaded XGboost sample is broken and out-dated. #5089

chensun commented Feb 4, 2021 •

edited

Loading

Bobgy commented Feb 4, 2021 •

edited

Loading

chensun commented Feb 5, 2021

The KFP preloaded XGboost sample is broken and out-dated. #5089

The KFP preloaded XGboost sample is broken and out-dated. #5089

Comments

chensun commented Feb 4, 2021 • edited Loading

Bobgy commented Feb 4, 2021 • edited Loading

chensun commented Feb 5, 2021

chensun commented Feb 4, 2021 •

edited

Loading

Bobgy commented Feb 4, 2021 •

edited

Loading