-
Notifications
You must be signed in to change notification settings - Fork 731
Closed as not planned
Description
Is there an existing issue for this?
- I have searched the existing issues and did not find a match.
Who can help?
as from discussion thread #13812 ,
spark config used
"spark.sql.warehouse.dir" : "s3a://bucket-example/nlp",
"spark.hadoop.fs.s3a.access.key":"<masked>",
"spark.hadoop.fs.s3a.secret.key": "<masked>",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryoserializer.buffer.max": "2000M",
"spark.driver.maxResultSize": "0",
"spark.kubernetes.container.image": "tested on pyspaark 3.3 and 3.4",
"spark.kubernetes.container.image.pullPolicy" : "Always",
"spark.jsl.settings.pretrained.cache_folder": "/opt/spark/work-dir",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
"spark.jsl.settings.annotator.log_folder": "/opt/spark/work-dir/logs"
when I save model to PVC, no issue
model.write().overwrite().save('/path_to_pvc/test_model_greview_bert')
but when i save to s3a
model.write().overwrite().save("s3a://bucket-example/nlp/models/greview_bert")
I get below error
please note, if i use pyspark without sparkNLP, no issue saving/loading dataframe into s3a
An error was encountered:
Py4JJavaError
[Traceback (most recent call last):
, File "/tmp/spark-5ad2d697-515d-43d8-82da-cbc35328adcb/shell_wrapper.py", line 113, in exec
self._exec_then_eval(code)
, File "/tmp/spark-5ad2d697-515d-43d8-82da-cbc35328adcb/shell_wrapper.py", line 106, in _exec_then_eval
exec(compile(last, '<string>', 'single'), self.globals)
, File "<string>", line 2, in <module>
, File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 197, in save
self._jwrite.save(path)
, File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
, File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
, File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
, py4j.protocol.Py4JJavaError: An error occurred while calling o427.save.
: org.apache.hadoop.fs.PathIOException: `Cannot get relative path for URI:file:///tmp/1ceaa5db4f81_bert_sentence4029999211103636104/bert_sentence_tensorflow': Input/output error
at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:360)
at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.uploadSourceFromFS(CopyFromLocalOperation.java:222)
at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:169)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$25(S3AFileSystem.java:3920)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFromLocalFile(S3AFileSystem.java:3913)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2448)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2411)
at com.johnsnowlabs.ml.tensorflow.WriteTensorflowModel.writeTensorflowModelV2(TensorflowSerializeModel.scala:85)
at com.johnsnowlabs.ml.tensorflow.WriteTensorflowModel.writeTensorflowModelV2$(TensorflowSerializeModel.scala:61)
at com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings.writeTensorflowModelV2(BertSentenceEmbeddings.scala:151)
at com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings.onWrite(BertSentenceEmbeddings.scala:399)
at com.johnsnowlabs.nlp.ParamsAndFeaturesWritable.$anonfun$write$1(ParamsAndFeaturesWritable.scala:51)
at com.johnsnowlabs.nlp.ParamsAndFeaturesWritable.$anonfun$write$1$adapted(ParamsAndFeaturesWritable.scala:51)
at com.johnsnowlabs.nlp.FeaturesWriter.saveImpl(ParamsAndFeaturesWritable.scala:38)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Unknown Source)
]
What are you working on?
Current Behavior
Expected Behavior
Steps To Reproduce
Spark NLP version and Apache Spark
spark-nlp==4.3.0
https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.3.0.jar
Type of Spark Application
Python Application
Java Version
No response
Java Home Directory
No response
Setup and installation
No response
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
No response