Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mleap and python 3.8 #842

Closed
drei34 opened this issue Jan 26, 2023 · 27 comments
Closed

Mleap and python 3.8 #842

drei34 opened this issue Jan 26, 2023 · 27 comments

Comments

@drei34
Copy link

drei34 commented Jan 26, 2023

Hi,

I have more of a question than a specific issue.

I was trying to use python 3.8 but my question is does mleap support this? What version would run? I know some changes were made which seems to suggest this, but unsure if they are in the stable release yet.

Thank you!

@jsleight
Copy link
Contributor

I've been using mleap and python 3.8 for quite a while. Both mleap v0.20 and v0.21.x should work, I can't remember of 0.19 did or not (probably yes).

@drei34
Copy link
Author

drei34 commented Feb 10, 2023

@jsleight Thanks! Have you ever used serialize to bundle? I'm on python 3.8 and 0.20 mleap. Java 8 and Scala 2.12. I make the context as below, but I can't serialize a pipeline. I pasted some code below.

image

`
def gen_spark_session():
return SparkSession.builder.appName("happy").config(
"hive.exec.dynamic.partition", "True").config(
"hive.exec.dynamic.partition.mode", "nonstrict").config(
"spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").config(
"spark.jars.packages",
"ml.combust.mleap:mleap-spark_2.12:0.20.0,"
"ml.combust.mleap:mleap-spark-base_2.12:0.20.0"
).enableHiveSupport().getOrCreate()

spark = gen_spark_session()

dfTrainFake = spark.createDataFrame([
(7,20,3,6,1,10,3,53948,245351,1),
(7,20,3,6,1,10,3,53948,245351,1)
], ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId','label'])
print(type(dfTrainFake))

# Do some stuff
pipelineModel.serializeToBundle(locMLeapModelPath, trainPredictions)
`

@jsleight
Copy link
Contributor

I suspect you need an additional jar(s). I have all of these as dependencies:

ml.combust.mleap:mleap-spark_2.12
ml.combust.mleap:mleap-core_2.12
ml.combust.mleap:mleap-runtime_2.12
ml.combust.mleap:mleap-avro_2.12
ml.combust.mleap:mleap-spark-extension_2.12
ml.combust.mleap:mleap-spark-base_2.12
ml.combust.bundle:bundle-ml_2.12
ml.combust.bundle:bundle-hdfs_2.12
ml.combust.mleap:mleap-tensorflow_2.12
ml.combust.mleap:mleap-xgboost-spark_2.12

That full list is probably overkill, but probably you need one/both of bundle ones.

@drei34
Copy link
Author

drei34 commented Feb 10, 2023

So unfortunately this still does not work. I get the same error. I did as below to get the spark session (is it right to do jars.packages?). I also did the two commands below to set the PATHS. It was complaining about a Python 3.7 / Python 3.8 (prob can add this to bash_rc). Is the only thing left to try all the jars.packages above?

export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python

image

@drei34
Copy link
Author

drei34 commented Feb 13, 2023

I added all of these except the below and still no luck.

ml.combust.mleap:mleap-tensorflow_2.12 ml.combust.mleap:mleap-xgboost-spark_2.12 ml.combust.mleap:mleap-avro_2.12

I am doing this:

`
def gen_spark_session():
    return SparkSession.builder.appName("happy").config(
        "hive.exec.dynamic.partition", "True").config(
        "hive.exec.dynamic.partition.mode", "nonstrict").config(
        "spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").config(
        "spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark_extension_2.12:0.20.0,"
        "ml.combust.mleap:mleap-runtime_2.12:0.20.0,"
        "ml.combust.mleap:mleap-core_2.12:0.20.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.20.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.20.0"
        ).enableHiveSupport().getOrCreate()
       
spark = gen_spark_session()
`

@drei34
Copy link
Author

drei34 commented Feb 14, 2023

@jsleight Is it possible the problem is conda and SPARK? I.e. I am thinking the issue might be as in the link below. It seems they very carefully select some env variables to make this error go away (note it's not the same error as I have, but the JVM not having something is the complaint ... Thinking this is related) ... Thanks in advance or do you know someone who might know the issue here, sort of stuck. Also, it seems that I have py4j-0.10.9-src.zip ... I imagine is this OK?

Additionally, are these jars needed to build and ultimately serialize a xgboost model in your pipeline?

ml.combust.mleap:mleap-tensorflow_2.12
ml.combust.mleap:mleap-xgboost-spark_2.12

ENV variables:
https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/

@jsleight
Copy link
Contributor

jsleight commented Feb 14, 2023

Hmmm, I can't reproduce this. Can you post a complete minimal example?
E.g., using python 3.8.0, pyspark 3.2.2, and mleap 0.20.0, py4j 0.10.9.5, this code works for me.

import pyspark
import pyspark.ml
spark = (
    pyspark.sql.SparkSession.builder.appName("happy")
    .config("hive.exec.dynamic.partition", "True")
    .config("hive.exec.dynamic.partition.mode", "nonstrict")
    .config("spark.jars.excludes","net.sourceforge.f2j:arpack_combined_all")
    .config("spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.20.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.20.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.20.0"
    )
    .enableHiveSupport().getOrCreate()
)

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)

pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()

@drei34
Copy link
Author

drei34 commented Feb 14, 2023

Thanks! So I have pyspark 3.1.3, mleap 0.20.0, python 3.8.15. I am on a google data proc 2.0 box and I put the jars above in /usr/lib/spark/jars .

I took your code (let's fix that) and it gave me this JVM error Py4JError: ml.combust.mleap.spark.SimpleSparkSerializer does not exist in the JVM. I have py4j 10.9, is that also what you have? My speculation is this: https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/ but I have not been able to fix my problem ... What are your system variables. Are you using anaconda?

@drei34
Copy link
Author

drei34 commented Feb 14, 2023

Specifically, if I close a terminal I get this error first: Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 477, in main ("%d.%d" % sys.version_info[:2], version)) Exception: Python in worker has different version 3.7 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I fix this by setting as below, then I get the JVM one.

export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python

I.e. this:

image

image

@jsleight
Copy link
Contributor

Additionally, are these jars needed to build and ultimately serialize a xgboost model in your pipeline?

you'll need the xgboost jar to serialize an xgboost model. The tensorflow jar is for if you want to serialize a tensorflow model.

@jsleight jsleight reopened this Feb 14, 2023
@jsleight
Copy link
Contributor

I'm using pip and virtualenv, don't have any relevant env variables.

I have py4j 0.10.9.5, but it is just being pulled in as a dependency from pyspark

Mleap 0.20.0 is built for spark 3.2.0 which might cause some of these problems

@drei34
Copy link
Author

drei34 commented Feb 14, 2023

OK thank you again - will try 0.19.0. This should work with 3.1.3 right? Seems so by the website: https://github.com/combust/mleap

Update: I downgraded mleap to 0.19.0 and pulled the old jars from maven, but still have the same error.

import pyspark
import pyspark.ml
spark = (
    pyspark.sql.SparkSession.builder.appName("happy")
    .config("hive.exec.dynamic.partition", "True")
    .config("hive.exec.dynamic.partition.mode", "nonstrict")
    .config("spark.jars.excludes","net.sourceforge.f2j:arpack_combined_all")
    .config("spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.19.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.19.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.19.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.19.0"
    )
    .enableHiveSupport().getOrCreate()
)

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)

pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()

@jsleight
Copy link
Contributor

https://github.com/combust/mleap#mleapspark-version has the version compatibility. 0.19.0 was spark 3.0.2. All of these compatibilities are just the versions which are explicitly tested, so other things might work but I don't have any conclusive evidence one way or another.

The class not being in the JVM would support your idea of something being weird with your env variables.

@drei34
Copy link
Author

drei34 commented Feb 14, 2023

Gotcha I'll keep looking. Another thing that's weird is I see the error is from here as below. Is there some request being sent somewhere? This seems to complaint that the Java side is empty so I wonder if the problem is some request that gets blocked due to some set ip (very unsure).

image

@drei34
Copy link
Author

drei34 commented Feb 15, 2023

Digging around more I think this should work: pyspark --packages ml.combust.mleap:mleap-spark_2.12:0.16.0 predict.py where here I am specifying the Scala and Mleap versions. This is from here: combust/mleap-docs#8 (comment) and similar to just specifying the packages when building the context, I think. I made the predict file be as above, but still no luck. I moved the MLEAP version down but this seems like a bad thing: I should not be going back to MLEAP versions from years ago to make this all work. This is Python 3.8, Mleap 0.16, PySpark 3.1.3 ... It would seem that this should all be OK but I'm at a loss as to what to do about this error ... I guess the only thing left to do is to make a conda env with: python 3.8.0, pyspark 3.2.2, and mleap 0.20.0, py4j 0.10.9.5 and hope it works. I believe there is something wrong w the paths on Google machines or something, I can't come up with anything else on this frustrating issue.

image

import pyspark
import pyspark.ml
 
spark=pyspark.sql.SparkSession.builder.config('spark.jars.packages', 'ml.combust.mleap:mleap-spark_2.12:0.16.0,ml.combust.mleap:mleap-runtime_2.12:0.16.0,ml.combust.bundle:bundle-ml_2.12:0.16.0,ml.combust.bundle:bundle-hdfs_2.12:0.16.0"').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)
 
# Works just file on 1.4, not on 2.0.
pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)
 
from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()

@jsleight
Copy link
Contributor

My experience with the py4j Answer from java side is empty errors is usually that spark's jvm died. The test case I did above was with java 8, but I'm pretty sure java 11 will work too (and certainly java 11 works with latest mleap). Notably java 17 won't work because spark doesn't support that java version.

@drei34
Copy link
Author

drei34 commented Feb 15, 2023

Actually looking above in the stack trace I see this error as the root: Exception in thread "Thread-4" java.lang.NoClassDefFoundError: ml/combust/bundle/serializer/SerializationFormat ...

I added jars directly like below, and this is still happening. Really unsure why. Would it be possible to zoom with anyone on the team over this?

image

`
import pyspark
import pyspark.ml

spark=pyspark.sql.SparkSession.builder.config('spark.jars', '/usr/lib/spark/jars/mleap-spark-base_2.12-0.16.0.jar,/usr/lib/spark/jars/mleap-spark_2.12-0.16.0.jar,/usr/lib/spark/jars/mleap-runtime_2.12-0.16.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.16.0.jar,/usr/lib/spark/jars/bundle-hdfs_2.12-0.16.0.jar').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']

df = spark.createDataFrame(
[
(7,20,3,6,1,10,3,53948,245351,1),
(7,20,3,6,1,10,3,53948,245351,1)
],
features + ['label']
)

pipeline = pyspark.ml.Pipeline(stages=[
pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()
`

@drei34
Copy link
Author

drei34 commented Feb 15, 2023

Basically my error looks like this and the issue seems to be some path problem: combust/mleap-docs#8 (comment)

@jsleight
Copy link
Contributor

It is interesting that your error is complaining about a path with / in it. I think it is usually with . separators? Maybe I'm mis-remembering things though.

@drei34
Copy link
Author

drei34 commented Feb 21, 2023

So the .jar bundle-ml_2.12-0.19.0.jar (or 0.16.0) has this Class in it. And the path seems like it uses '/'. Another question I have I know the jars need to be added to the spark jars folder. Do they need to also be added to pyspark in some way? It's like pyspark does not see the right thing (pyspark 3.1.3, mleap <= 0.19.0).

image

@drei34
Copy link
Author

drei34 commented Feb 21, 2023

My speculation is py4j is not using the right jars in some way ... Conda might have it's own py4j and this is the confusion. But I'm still unsure. Just wondering if you saw this before ...

@drei34
Copy link
Author

drei34 commented Feb 21, 2023

The error seems to be from: /usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j and I see that there are imports happening. I put some print statements in the java_gateway.py file in this directory since that's where the exception is coming from and I want to see what is being imported. The weird part is most look like (I'm using your example still) org.apache.spark.ml.classification.LogisticRegression but the mleap ones do not have the prefix org.apache.spark but mleap was installed via pip. Why would this not be appended as such to the path? Basically the exception is complaining that ml.combust.mleap.spark.SimpleSparkSerializer is not found but I think it has access to org.apache.spark.ml.combust.mleap.spark.SimpleSparkSerializer and that's what it should be using ... Trying to figure out why this is happening also + add this hack fix to see if it even goes anywhere ...

@drei34
Copy link
Author

drei34 commented Feb 21, 2023

@jsleight Yeah so I fix that problem with the import (I know there's a better solution, but adding the prefix seems to bring in the needed class), but then I get another error. 'JavaPackage' object is not callable ... Unsure why, this error seems to be a problem if maybe you have wrong versions but I don't think that's the case. I'm on Java8, Scala 2.12, Mleap 0.19.0, PySpark 3.1.3 and Python 3.8.15 ...

image

@drei34
Copy link
Author

drei34 commented Feb 21, 2023

This is my spark context btw ... I added all the right jars, I think spark=pyspark.sql.SparkSession.builder.config('spark.jars', '/usr/lib/spark/jars/mleap-spark-base_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-spark_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-runtime_2.12-0.19.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.19.0.jar,/usr/lib/spark/jars/bundle-hdfs_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-spark-extension_2.12-0.19.0.jar').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

@jsleight
Copy link
Contributor

in my experience the JavaPackage object is not callable has always come from the Spark jvm not having the jars which it needs.

When I run your code examples above (using mleap 0.19.0) then it works for me.

@drei34
Copy link
Author

drei34 commented Feb 21, 2023

Is there a way to check that the context I pass has what it needs? I mean given the jars above, I should have everything. This gives a context so those jars are where they need to be.

@drei34
Copy link
Author

drei34 commented Mar 7, 2023

See #845

@drei34 drei34 closed this as completed Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants