Skip to content

Conversation

@maziyarpanahi
Copy link
Contributor

@maziyarpanahi maziyarpanahi commented Feb 22, 2021

This PR proposes support for Apache Spark 3.0.1 / Scala 2.12.10 and TensorFlow 2.3.x via the native Java library.

Apache Spark 3.0.x support

Since the majority of the Apache Spark community is still on 2.4.x (some even on 2.3.x) we have to make sure any changes to support Apache Spark 3.0.x on Scala 2.12 are compatible with the other two major versions.

  • Unit tests for Scala and Python must pass for Apache Spark 2.3 on Scala 2.11
  • Unit tests for Scala and Python must pass for Apache Spark 2.4 on Scala 2.11
  • Unit tests for Scala and Python must pass for Apache Spark 3.0.1 on Scala 2.12
  • Update naming in packages/artifcats based on Apache Spark 3.0 via condition
  • Spark NLP start() functions in Python and Scala have spark24 same as spark23 to start the SparkSession on the appropriate version. (Apache Spark 3.0.x will be the default from now on)

NOTE: At the moment some pretrained models/pipelines compiled on Scala 2.11 are not passing the unit tests on Scala 2.12. We already have a list of them and will start retraining (extracting their weights more precisely) on Scala 2.12.

Spark NLP packages

Starting the Spark NLP 3.0.0 release the default Spark NLP packages will be based on Apache Spark 3.0.x and Scala 2.12. This means the following packages will be by default Apache Spark 3.0.x/Scala 2.12:

  • spark-nlp
  • spark-nlp-gpu

For Apache Spark 2.3.x and Apache Spark 2.4.x:

  • spark-nlp-spark23
  • spark-nlp-spark24
  • spark-nlp-gpu-spark23
  • spark-nlp-gpu-spark24

Spark NLP start functions in Python and Scala are also updated for these changes.

Native TensorFlow 2.x support

New APIs, Compile, Runtime, and Load

The result of the model is not important in this section, we care about the compatibility of the model exported in TF v1 and being able to load in TF v2:

Evaluation passed

  • AlbertEmbeddings
  • BertEmbeddings
  • BertSentenceEmbeddings
  • ElmoEmbeddings
  • UniversalSentenceEncoder
  • XlnetEmbeddings
  • MarianTransformer
  • T5Transformers
  • ClassifierDLApproach / ClassifierDLModel
  • SentimentDLApproach / SentimentDLModel
  • MultiClassifierDLApproach / MultiClassifierDLModel
  • LanguageDetectorDL
  • SentenceDetectorDLApproach / SentenceDetectorDLModel
  • NerDLApproach / NerDLModel
  • ContextSpellCheckerApproach / ContextSpellCheckerModel

Evaluation failed

Save, Load, Pretrained model(s), and actual Results

We care about the result of the model, whether or not the embeddings are still the same, translations are still the same, etc.

Evaluation passed

  • AlbertEmbeddings
  • BertEmbeddings
  • BertSentenceEmbeddings
  • ElmoEmbeddings
  • UniversalSentenceEncoder
  • XlnetEmbeddings
  • MarianTransformer
  • T5Transformers
  • ClassifierDLApproach / ClassifierDLModel
  • SentimentDLApproach / SentimentDLModel
  • MultiClassifierDLApproach / MultiClassifierDLModel
  • LanguageDetectorDL
  • SentenceDetectorDLApproach / SentenceDetectorDLModel
  • NerDLApproach / NerDLModel
  • ContextSpellCheckerApproach / ContextSpellCheckerModel

Evaluation failed

MarianTransformer is not outputting desired translation in the unit test. There must be an issue with offset and the new NdArray buffers which are different than Java buffers with flip

  • Multilingual USE models require SentencePiece. Maybe exporting these models in TF v2 with TF Text could be a solution.

Models require changes

  • Part of Speech models to be compatible with Apache Spark 3.0.1 on Scala 2.12.x
  • WordSegmenter models to be compatible with Apache Spark 3.0.1 on Scala 2.12.x
  • tfhub_use_multi
  • tfhub_use_multi_lg

Deprecated Models

These don't have TF2.0 SavedModel on TF Hub yet: (issue reported: tensorflow/hub#735)

  • tfhub_use_xling_many
  • tfhub_use_xling_en_es
  • tfhub_use_xling_en_fr
  • tfhub_use_xling_en_de

Additional steps

  • Configurable TensorFlow Cofing (In TF v1 we used a binary array, this should be the same as TensorFlow such as setting inter_op_parallelism_threads and other params via configProtoBytes)

Example:

private static ConfigProto singleThreadConfigProto() {
    return ConfigProto.newBuilder()
        .setInterOpParallelismThreads(1)
        .setIntraOpParallelismThreads(1)
        .build();
  }

Final Tests

Devices

  • CPU only
  • CPU with MKL (requires testing)
  • GPU
  • Fat JAR for offline use with all the above situations

ENVs

Training and predicting on:

Local:

  • macOS
  • Windows 8 and 10
  • Linux (Ubuntu 16, 18, 20 - Debian - ?)

Others:

  • Cloudera 6.x (this should be Spark 2.4.x only)

Databricks:

  • Databricks 6.4
  • Databricks 6.4 GPU
  • Databricks 7.3
  • Databricks 7.3 GPU
  • Databricks 7.4
  • Databricks 7.4 GPU
  • Databricks 7.5
  • Databricks 7.5 GPU
  • Databricks 7.6
  • Databricks 7.6 GPU

AWS:

  • EMR 5.20.0 (Apache Spark 2.4.0)
  • EMR 5.29.0 (Apache Spark 2.4.4)
  • EMR 5.30.1 (Apache Spark 2.4.5)
  • EMR 5.31.0 (Apache Spark 2.4.6)
  • EMR 5.32.0 (Apache Spark 2.4.7)
  • EMR 6.0.0 (Apache Spark 3.0.x)
  • EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
  • EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
  • AWS Glue 1.0 (2.0 is not running on YARN so it cannot be supported as a cluster)

Batch Annotate

Introducing BatchAnnotate for BertEmbeddings and NerDLModel. The reason is to adjust the throughput of rows to utilize accelerated hardware such as GPU for larger and more computationally required models.

NOTE: Be careful not to misuse this on CPU, always do a simple benchmark on a subset of your dataset before using it in production.

saif-ellafi and others added 30 commits October 5, 2020 14:18
# Conflicts:
#	src/main/scala/com/johnsnowlabs/ml/tensorflow/TensorflowBert.scala
maziyarpanahi and others added 26 commits March 6, 2021 13:50
…or-handler

Added error message handler at catalyst level in NorvigSweetingApproach
…exception-handling-norvigapproach

Added Spark cross version exception handler in Norvig Approach
…ntenceembeds

Added batch annotator in BertSentenceEmbeddings Python
…ddings-batch-annotate

Introducing BatchAnnotate in BertSentenceEmbeddings
@maziyarpanahi
Copy link
Contributor Author

Spark NLP 3.0.0-rc1 has been released.

@maziyarpanahi maziyarpanahi changed the title WIP: 300 release candidate 1 300 release candidate 1 Mar 12, 2021
@KshitizGIT KshitizGIT deleted the 300-release-candidate-1 branch March 2, 2023 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement new-feature Introducing a new feature on-hold cannot be merged right away

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants