300 release candidate 1 #2324

maziyarpanahi · 2021-02-22T13:26:36Z

This PR proposes support for Apache Spark 3.0.1 / Scala 2.12.10 and TensorFlow 2.3.x via the native Java library.

Apache Spark 3.0.x support

Since the majority of the Apache Spark community is still on 2.4.x (some even on 2.3.x) we have to make sure any changes to support Apache Spark 3.0.x on Scala 2.12 are compatible with the other two major versions.

Unit tests for Scala and Python must pass for Apache Spark 2.3 on Scala 2.11
Unit tests for Scala and Python must pass for Apache Spark 2.4 on Scala 2.11
Unit tests for Scala and Python must pass for Apache Spark 3.0.1 on Scala 2.12
Update naming in packages/artifcats based on Apache Spark 3.0 via condition
Spark NLP start() functions in Python and Scala have spark24 same as spark23 to start the SparkSession on the appropriate version. (Apache Spark 3.0.x will be the default from now on)

NOTE: At the moment some pretrained models/pipelines compiled on Scala 2.11 are not passing the unit tests on Scala 2.12. We already have a list of them and will start retraining (extracting their weights more precisely) on Scala 2.12.

Spark NLP packages

Starting the Spark NLP 3.0.0 release the default Spark NLP packages will be based on Apache Spark 3.0.x and Scala 2.12. This means the following packages will be by default Apache Spark 3.0.x/Scala 2.12:

spark-nlp
spark-nlp-gpu

For Apache Spark 2.3.x and Apache Spark 2.4.x:

spark-nlp-spark23
spark-nlp-spark24
spark-nlp-gpu-spark23
spark-nlp-gpu-spark24

Spark NLP start functions in Python and Scala are also updated for these changes.

Native TensorFlow 2.x support

New APIs, Compile, Runtime, and Load

The result of the model is not important in this section, we care about the compatibility of the model exported in TF v1 and being able to load in TF v2:

Evaluation passed

Evaluation failed

Save, Load, Pretrained model(s), and actual Results

We care about the result of the model, whether or not the embeddings are still the same, translations are still the same, etc.

Evaluation passed

Evaluation failed

~~MarianTransformer is not outputting desired translation in the unit test. There must be an issue with offset and the new NdArray buffers which are different than Java buffers with flip~~

Multilingual USE models require SentencePiece. Maybe exporting these models in TF v2 with TF Text could be a solution.

Models require changes

Part of Speech models to be compatible with Apache Spark 3.0.1 on Scala 2.12.x
WordSegmenter models to be compatible with Apache Spark 3.0.1 on Scala 2.12.x
tfhub_use_multi
tfhub_use_multi_lg

Deprecated Models

These don't have TF2.0 SavedModel on TF Hub yet: (issue reported: tensorflow/hub#735)

tfhub_use_xling_many
tfhub_use_xling_en_es
tfhub_use_xling_en_fr
tfhub_use_xling_en_de

Additional steps

Configurable TensorFlow Cofing (In TF v1 we used a binary array, this should be the same as TensorFlow such as setting inter_op_parallelism_threads and other params via configProtoBytes)

Example:

private static ConfigProto singleThreadConfigProto() {
    return ConfigProto.newBuilder()
        .setInterOpParallelismThreads(1)
        .setIntraOpParallelismThreads(1)
        .build();
  }

Explore distributed training and prediction over multiple GPU: https://www.tensorflow.org/guide/distributed_training

Final Tests

Devices

CPU only
CPU with MKL (requires testing)
GPU
Fat JAR for offline use with all the above situations

ENVs

Training and predicting on:

Local:

macOS
Windows 8 and 10
Linux (Ubuntu 16, 18, 20 - Debian - ?)

Others:

Cloudera 6.x (this should be Spark 2.4.x only)

Databricks:

AWS:

EMR 5.20.0 (Apache Spark 2.4.0)
EMR 5.29.0 (Apache Spark 2.4.4)
EMR 5.30.1 (Apache Spark 2.4.5)
EMR 5.31.0 (Apache Spark 2.4.6)
EMR 5.32.0 (Apache Spark 2.4.7)
EMR 6.0.0 (Apache Spark 3.0.x)
EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
AWS Glue 1.0 (2.0 is not running on YARN so it cannot be supported as a cluster)

Batch Annotate

Introducing BatchAnnotate for BertEmbeddings and NerDLModel. The reason is to adjust the throughput of rows to utilize accelerated hardware such as GPU for larger and more computationally required models.

NOTE: Be careful not to misuse this on CPU, always do a simple benchmark on a subset of your dataset before using it in production.

# Conflicts: # src/main/scala/com/johnsnowlabs/ml/tensorflow/TensorflowBert.scala

…ation

…or-handler Added error message handler at catalyst level in NorvigSweetingApproach

…eddings-batch-annotate

…exception-handling-norvigapproach Added Spark cross version exception handler in Norvig Approach

…eddings-batch-annotate

…ntenceembeds Added batch annotator in BertSentenceEmbeddings Python

…ddings-batch-annotate Introducing BatchAnnotate in BertSentenceEmbeddings

Bug fix ner confidence

fix shadding package

…Labs/spark-nlp into 300-release-candidate-1

maziyarpanahi · 2021-03-12T13:11:24Z

Spark NLP 3.0.0-rc1 has been released.

saif-ellafi and others added 30 commits October 5, 2020 14:18

First design for batch annotators

80eb077

Uncommented forgotten comment

8a0928e

Renamed interfaces

719b8c4

Small cleanup

86ac55c

containsIndex to reuse existing lookup function

c655d32

Improved batching logic, bert support, can be better

2a7aa78

Revert unsuccessful improvements

02aee56

Test bert in batch mode

537dc1e

Fixed bert vector matching

1d0ecbf

Minor improvements

a2ad41a

Minor improvements

2e17707

A few upstream changes

7088ecc

Save repetition

d0c7bf2

Setting recommended batch size values

be38d90

Matched accuracy with master, deal with empties

7cb9ed0

Replaced group by with a presized array

b6f1766

Merge branch 'annotator-algorithm-optimization' into batching-bert-test

2b19682

# Conflicts: # src/main/scala/com/johnsnowlabs/ml/tensorflow/TensorflowBert.scala

Sentence Bert adapted to batched mode

dbc1744

Merge branch '270-release-candidate' into annotator-algorithm-optimiz…

3f23250

…ation

Merge branch 'master' into annotator-algorithm-optimization

c96d94b

Update TensorflowBert.scala based on 2.6.3

2913306

Merge branch 'master' into annotator-algorithm-optimization

81bb953

Merge branch 'master' into annotator-algorithm-optimization

cebede8

Merge branch 'master' into annotator-algorithm-optimization

bb576f1

Merge branch '270-release-candidate' into annotator-algorithm-optimiz…

bdf7fc9

…ation

Merge branch 'master' into annotator-algorithm-optimization

c0ded3f

Added Spark3.0.1 with Scala 2.12 compilation

e9dc4ab

Removing pmml False value

cb8c6ff

back2 spark2 test

81c921b

Writing POS to TXT Test Added

963cb64

maziyarpanahi and others added 26 commits March 6, 2021 13:50

Update supported Python versions [skip ci]

ab1131d

Set default Spark version to 3.1.1

950aefd

Added error message handler at catalyst level

0fd2e36

Merge pull request #2370 from JohnSnowLabs/bugfix/norvig-sweeting-err…

f4af72b

…or-handler Added error message handler at catalyst level in NorvigSweetingApproach

Merge branch '300-release-candidate-1' into feature/bert-sentence-emb…

45da21e

…eddings-batch-annotate

Added Spark cross version exception handler in Norvig Approach

4448e1f

Refactoring match statement

95849b2

Merge pull request #2371 from JohnSnowLabs/bugfix/spark-multiversion-…

e605529

…exception-handling-norvigapproach Added Spark cross version exception handler in Norvig Approach

Merge branch '300-release-candidate-1' into feature/bert-sentence-emb…

0ea70be

…eddings-batch-annotate

Add BatchAnnotate to BertSentenceEmbeddings [skip ci]

cdaf241

Update unit tests on BertEmbeddings [skip ci]

34dbcf5

Update benchmark unit test in BertSentenceEmbeddings

a116ea3

Fix misspelling in unit test [skip ci]

cbe3cd4

Added batch annotator in BertSentenceEmbeddings Python

107f36a

Merge pull request #2398 from JohnSnowLabs/feature/batch-annot-bertse…

5b422ac

…ntenceembeds Added batch annotator in BertSentenceEmbeddings Python

Merge branch 'master' into bug-fix-ner-confidence

eb4f948

add try catch instead of options in NER confidence

a82bd68

fix shadding package

4e75335

Merge pull request #2372 from JohnSnowLabs/feature/bert-sentence-embe…

1eeb8a0

…ddings-batch-annotate Introducing BatchAnnotate in BertSentenceEmbeddings

Merge pull request #2445 from JohnSnowLabs/bug-fix-ner-confidence

7de751f

Bug fix ner confidence

Merge pull request #2446 from JohnSnowLabs/bugfix/shading_package_name

0afa07b

fix shadding package

boosted sdk

4e2b8d2

Merge branch '300-release-candidate-1' of https://github.com/JohnSnow…

e518fa1

…Labs/spark-nlp into 300-release-candidate-1

Remove unused keys from build.sbt

4a1bee7

Remove -target:jvm-1.8 from build.sbt

e308f4c

Make java target conditional

7bd3c71

maziyarpanahi closed this Mar 12, 2021

maziyarpanahi changed the title ~~WIP: 300 release candidate 1~~ 300 release candidate 1 Mar 12, 2021

KshitizGIT deleted the 300-release-candidate-1 branch March 2, 2023 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

300 release candidate 1 #2324

300 release candidate 1 #2324

Uh oh!

maziyarpanahi commented Feb 22, 2021 •

edited

Loading

Uh oh!

maziyarpanahi commented Mar 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

300 release candidate 1 #2324

300 release candidate 1 #2324

Uh oh!

Conversation

maziyarpanahi commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Apache Spark 3.0.x support

Spark NLP packages

Native TensorFlow 2.x support

New APIs, Compile, Runtime, and Load

Evaluation passed

Evaluation failed

Save, Load, Pretrained model(s), and actual Results

Evaluation passed

Evaluation failed

Models require changes

Deprecated Models

Additional steps

Final Tests

Devices

ENVs

Batch Annotate

Uh oh!

maziyarpanahi commented Mar 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

maziyarpanahi commented Feb 22, 2021 •

edited

Loading