-
Notifications
You must be signed in to change notification settings - Fork 736
300 release candidate 1 #2324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
300 release candidate 1 #2324
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Conflicts: # src/main/scala/com/johnsnowlabs/ml/tensorflow/TensorflowBert.scala
…or-handler Added error message handler at catalyst level in NorvigSweetingApproach
…eddings-batch-annotate
…exception-handling-norvigapproach Added Spark cross version exception handler in Norvig Approach
…eddings-batch-annotate
…ntenceembeds Added batch annotator in BertSentenceEmbeddings Python
…ddings-batch-annotate Introducing BatchAnnotate in BertSentenceEmbeddings
Bug fix ner confidence
fix shadding package
…Labs/spark-nlp into 300-release-candidate-1
Contributor
Author
|
Spark NLP 3.0.0-rc1 has been released. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR proposes support for Apache Spark 3.0.1 / Scala 2.12.10 and TensorFlow 2.3.x via the native Java library.
Apache Spark 3.0.x support
Since the majority of the Apache Spark community is still on 2.4.x (some even on 2.3.x) we have to make sure any changes to support Apache Spark 3.0.x on Scala 2.12 are compatible with the other two major versions.
spark24same asspark23to start the SparkSession on the appropriate version. (Apache Spark 3.0.x will be the default from now on)NOTE: At the moment some pretrained models/pipelines compiled on Scala 2.11 are not passing the unit tests on Scala 2.12. We already have a list of them and will start retraining (extracting their weights more precisely) on Scala 2.12.
Spark NLP packages
Starting the Spark NLP 3.0.0 release the default Spark NLP packages will be based on Apache Spark 3.0.x and Scala 2.12. This means the following packages will be by default Apache Spark 3.0.x/Scala 2.12:
spark-nlpspark-nlp-gpuFor Apache Spark 2.3.x and Apache Spark 2.4.x:
spark-nlp-spark23spark-nlp-spark24spark-nlp-gpu-spark23spark-nlp-gpu-spark24Spark NLP start functions in Python and Scala are also updated for these changes.
Native TensorFlow 2.x support
New APIs, Compile, Runtime, and Load
The result of the model is not important in this section, we care about the compatibility of the model exported in TF v1 and being able to load in TF v2:
Evaluation passed
Evaluation failed
Save, Load, Pretrained model(s), and actual Results
We care about the result of the model, whether or not the embeddings are still the same, translations are still the same, etc.
Evaluation passed
Evaluation failed
MarianTransformer is not outputting desired translation in the unit test. There must be an issue withoffsetand the new NdArray buffers which are different than Java buffers withflipModels require changes
Deprecated Models
These don't have TF2.0 SavedModel on TF Hub yet: (issue reported: tensorflow/hub#735)
Additional steps
inter_op_parallelism_threadsand other params viaconfigProtoBytes)Example:
Final Tests
Devices
ENVs
Training and predicting on:
Local:
Others:
Databricks:
AWS:
Batch Annotate
Introducing BatchAnnotate for BertEmbeddings and NerDLModel. The reason is to adjust the throughput of rows to utilize accelerated hardware such as GPU for larger and more computationally required models.
NOTE: Be careful not to misuse this on CPU, always do a simple benchmark on a subset of your dataset before using it in production.