Skip to content

Commit 4a37687

Browse files
Merge pull request #14227 from JohnSnowLabs/release/533-release-candidate
* example notebook for DocumentCharacterTextSplitter * example notebook for DeBertaForZeroShotClassification * example notebooks for BGEEmbeddings and MPNetEmbeddings * example notebook for MPNetForQuestionAnswering * example notebook + path for MPNetForSequenceClassification * Delete examples/python/annotation/text/english/language-translation/Multilingual_Translation_with_M2M100.ipynb * Add files via upload * Delete examples/python/annotation/text/english/language-translation/Multilingual_Translation_with_M2M100.ipynb * fixing colab link for M2M100 notebook Co-authored-by: Abdullah mubeen <77073730+AbdullahMubeenAnwar@users.noreply.github.com>
2 parents b046c9c + ec2ab95 commit 4a37687

File tree

1,564 files changed

+19201
-5565
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,564 files changed

+19201
-5565
lines changed

CHANGELOG

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,26 @@
1+
========
2+
5.3.3
3+
========
4+
----------------
5+
New Features & Enhancements
6+
----------------
7+
* **NEW:** Introduce UAEEmbeddings for sentence embeddings using Universal AnglE Embedding, aimed at improving semantic textual similarity tasks
8+
* Introduce critical enhancements and optimizations to the processing of the CoNLL-U format for Dependency Parsers training, including enhanced multiword token handling and improved handling of missing uPos values
9+
* Add example notebook for `DocumentCharacterTextSplitter`
10+
* Add example notebook for `DeBertaForZeroShotClassification`
11+
* Add example notebooks for `BGEEmbeddings` and `MPNetEmbeddings`
12+
* Add example notebook for `MPNetForQuestionAnswering`
13+
* Add example notebook for `MPNetForSequenceClassification`
14+
* Implement cache mechanism for `metadata.json`, enhancing efficiency by avoiding unnecessary downloads
15+
16+
----------------
17+
Bug Fixes
18+
----------------
19+
* Address a bug with serializing ONNX models that lack a `.onnx_data` file, ensuring better reliability in model serialization processes
20+
* Delete redundant `Multilingual_Translation_with_M2M100.ipynb` notebook entries
21+
* Fix Colab link for the M2M100 notebook
22+
23+
124
========
225
5.3.2
326
========

README.md

Lines changed: 45 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ documentation and examples
114114
- INSTRUCTOR Embeddings (HuggingFace models)
115115
- E5 Embeddings (HuggingFace models)
116116
- MPNet Embeddings (HuggingFace models)
117+
- UAE Embeddings (HuggingFace models)
117118
- OpenAI Embeddings
118119
- Sentence & Chunk Embeddings
119120
- Unsupervised keywords extraction
@@ -165,7 +166,7 @@ To use Spark NLP you need the following requirements:
165166

166167
**GPU (optional):**
167168

168-
Spark NLP 5.3.2 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
169+
Spark NLP 5.3.3 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
169170

170171
- NVIDIA® GPU drivers version 450.80.02 or higher
171172
- CUDA® Toolkit 11.2
@@ -181,7 +182,7 @@ $ java -version
181182
$ conda create -n sparknlp python=3.7 -y
182183
$ conda activate sparknlp
183184
# spark-nlp by default is based on pyspark 3.x
184-
$ pip install spark-nlp==5.3.2 pyspark==3.3.1
185+
$ pip install spark-nlp==5.3.3 pyspark==3.3.1
185186
```
186187

187188
In Python console or Jupyter `Python3` kernel:
@@ -226,7 +227,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh
226227

227228
## Apache Spark Support
228229

229-
Spark NLP *5.3.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
230+
Spark NLP *5.3.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
230231

231232
| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
232233
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -270,7 +271,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
270271

271272
## Databricks Support
272273

273-
Spark NLP 5.3.2 has been tested and is compatible with the following runtimes:
274+
Spark NLP 5.3.3 has been tested and is compatible with the following runtimes:
274275

275276
**CPU:**
276277

@@ -343,7 +344,7 @@ Spark NLP 5.3.2 has been tested and is compatible with the following runtimes:
343344

344345
## EMR Support
345346

346-
Spark NLP 5.3.2 has been tested and is compatible with the following EMR releases:
347+
Spark NLP 5.3.3 has been tested and is compatible with the following EMR releases:
347348

348349
- emr-6.2.0
349350
- emr-6.3.0
@@ -393,11 +394,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
393394
```sh
394395
# CPU
395396

396-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
397+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
397398

398-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
399+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
399400

400-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
401+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
401402
```
402403

403404
The `spark-nlp` has been published to
@@ -406,11 +407,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
406407
```sh
407408
# GPU
408409

409-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.2
410+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
410411

411-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.2
412+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
412413

413-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.2
414+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
414415

415416
```
416417

@@ -420,11 +421,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
420421
```sh
421422
# AArch64
422423

423-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.2
424+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
424425

425-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.2
426+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
426427

427-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.2
428+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
428429

429430
```
430431

@@ -434,11 +435,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
434435
```sh
435436
# M1/M2 (Apple Silicon)
436437

437-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.2
438+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
438439

439-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.2
440+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
440441

441-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.2
442+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
442443

443444
```
444445

@@ -452,7 +453,7 @@ set in your SparkSession:
452453
spark-shell \
453454
--driver-memory 16g \
454455
--conf spark.kryoserializer.buffer.max=2000M \
455-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
456+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
456457
```
457458

458459
## Scala
@@ -470,7 +471,7 @@ coordinates:
470471
<dependency>
471472
<groupId>com.johnsnowlabs.nlp</groupId>
472473
<artifactId>spark-nlp_2.12</artifactId>
473-
<version>5.3.2</version>
474+
<version>5.3.3</version>
474475
</dependency>
475476
```
476477

@@ -481,7 +482,7 @@ coordinates:
481482
<dependency>
482483
<groupId>com.johnsnowlabs.nlp</groupId>
483484
<artifactId>spark-nlp-gpu_2.12</artifactId>
484-
<version>5.3.2</version>
485+
<version>5.3.3</version>
485486
</dependency>
486487
```
487488

@@ -492,7 +493,7 @@ coordinates:
492493
<dependency>
493494
<groupId>com.johnsnowlabs.nlp</groupId>
494495
<artifactId>spark-nlp-aarch64_2.12</artifactId>
495-
<version>5.3.2</version>
496+
<version>5.3.3</version>
496497
</dependency>
497498
```
498499

@@ -503,7 +504,7 @@ coordinates:
503504
<dependency>
504505
<groupId>com.johnsnowlabs.nlp</groupId>
505506
<artifactId>spark-nlp-silicon_2.12</artifactId>
506-
<version>5.3.2</version>
507+
<version>5.3.3</version>
507508
</dependency>
508509
```
509510

@@ -513,28 +514,28 @@ coordinates:
513514

514515
```sbtshell
515516
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
516-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.3.2"
517+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.3.3"
517518
```
518519

519520
**spark-nlp-gpu:**
520521

521522
```sbtshell
522523
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
523-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.3.2"
524+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.3.3"
524525
```
525526

526527
**spark-nlp-aarch64:**
527528

528529
```sbtshell
529530
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
530-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.3.2"
531+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.3.3"
531532
```
532533

533534
**spark-nlp-silicon:**
534535

535536
```sbtshell
536537
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
537-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.3.2"
538+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.3.3"
538539
```
539540

540541
Maven
@@ -556,7 +557,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
556557
Pip:
557558

558559
```bash
559-
pip install spark-nlp==5.3.2
560+
pip install spark-nlp==5.3.3
560561
```
561562

562563
Conda:
@@ -585,7 +586,7 @@ spark = SparkSession.builder
585586
.config("spark.driver.memory", "16G")
586587
.config("spark.driver.maxResultSize", "0")
587588
.config("spark.kryoserializer.buffer.max", "2000M")
588-
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2")
589+
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3")
589590
.getOrCreate()
590591
```
591592

@@ -656,7 +657,7 @@ Use either one of the following options
656657
- Add the following Maven Coordinates to the interpreter's library list
657658

658659
```bash
659-
com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
660+
com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
660661
```
661662

662663
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
@@ -667,7 +668,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
667668
Apart from the previous step, install the python module through pip
668669

669670
```bash
670-
pip install spark-nlp==5.3.2
671+
pip install spark-nlp==5.3.3
671672
```
672673

673674
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -695,7 +696,7 @@ launch the Jupyter from the same Python environment:
695696
$ conda create -n sparknlp python=3.8 -y
696697
$ conda activate sparknlp
697698
# spark-nlp by default is based on pyspark 3.x
698-
$ pip install spark-nlp==5.3.2 pyspark==3.3.1 jupyter
699+
$ pip install spark-nlp==5.3.3 pyspark==3.3.1 jupyter
699700
$ jupyter notebook
700701
```
701702

@@ -712,7 +713,7 @@ export PYSPARK_PYTHON=python3
712713
export PYSPARK_DRIVER_PYTHON=jupyter
713714
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
714715

715-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
716+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
716717
```
717718

718719
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -739,7 +740,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
739740
# -s is for spark-nlp
740741
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
741742
# by default they are set to the latest
742-
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.2
743+
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.3
743744
```
744745

745746
[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
@@ -762,7 +763,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
762763
# -s is for spark-nlp
763764
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
764765
# by default they are set to the latest
765-
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.2
766+
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.3
766767
```
767768

768769
[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
@@ -781,9 +782,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP
781782

782783
3. In `Libraries` tab inside your cluster you need to follow these steps:
783784

784-
3.1. Install New -> PyPI -> `spark-nlp==5.3.2` -> Install
785+
3.1. Install New -> PyPI -> `spark-nlp==5.3.3` -> Install
785786

786-
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2` -> Install
787+
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3` -> Install
787788

788789
4. Now you can attach your notebook to the cluster and use Spark NLP!
789790

@@ -834,7 +835,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
834835
"spark.kryoserializer.buffer.max": "2000M",
835836
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
836837
"spark.driver.maxResultSize": "0",
837-
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2"
838+
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3"
838839
}
839840
}]
840841
```
@@ -843,7 +844,7 @@ A sample of AWS CLI to launch EMR cluster:
843844
844845
```.sh
845846
aws emr create-cluster \
846-
--name "Spark NLP 5.3.2" \
847+
--name "Spark NLP 5.3.3" \
847848
--release-label emr-6.2.0 \
848849
--applications Name=Hadoop Name=Spark Name=Hive \
849850
--instance-type m4.4xlarge \
@@ -907,7 +908,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
907908
--enable-component-gateway \
908909
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
909910
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
910-
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
911+
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
911912
```
912913
913914
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -950,7 +951,7 @@ spark = SparkSession.builder
950951
.config("spark.kryoserializer.buffer.max", "2000m")
951952
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
952953
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
953-
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2")
954+
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3")
954955
.getOrCreate()
955956
```
956957
@@ -964,7 +965,7 @@ spark-shell \
964965
--conf spark.kryoserializer.buffer.max=2000M \
965966
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
966967
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
967-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
968+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
968969
```
969970
970971
**pyspark:**
@@ -977,7 +978,7 @@ pyspark \
977978
--conf spark.kryoserializer.buffer.max=2000M \
978979
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
979980
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
980-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
981+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
981982
```
982983
983984
**Databricks:**
@@ -1249,7 +1250,7 @@ spark = SparkSession.builder
12491250
.config("spark.driver.memory", "16G")
12501251
.config("spark.driver.maxResultSize", "0")
12511252
.config("spark.kryoserializer.buffer.max", "2000M")
1252-
.config("spark.jars", "/tmp/spark-nlp-assembly-5.3.2.jar")
1253+
.config("spark.jars", "/tmp/spark-nlp-assembly-5.3.3.jar")
12531254
.getOrCreate()
12541255
```
12551256
@@ -1258,7 +1259,7 @@ spark = SparkSession.builder
12581259
version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x)
12591260
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
12601261
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
1261-
i.e., `hdfs:///tmp/spark-nlp-assembly-5.3.2.jar`)
1262+
i.e., `hdfs:///tmp/spark-nlp-assembly-5.3.3.jar`)
12621263
12631264
Example of using pretrained Models and Pipelines in offline:
12641265

build.sbt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)
66

77
organization := "com.johnsnowlabs.nlp"
88

9-
version := "5.3.2"
9+
version := "5.3.3"
1010

1111
(ThisBuild / scalaVersion) := scalaVer
1212

conda/meta.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
{% set name = "spark-nlp" %}
2-
{% set version = "5.3.2" %}
2+
{% set version = "5.3.3" %}
33

44
package:
55
name: {{ name|lower }}
66
version: {{ version }}
77

88
source:
99
url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz
10-
sha256: c98d14d51778c799ef43526e2eaeb5d76245ed1eda2ac3205c11e58cbbe825b6
10+
sha256: 8ca71ac2584c0a172ba3e966c0e1072aae15bb9fabe774d99b658d45aac5217d
1111

1212
build:
1313
noarch: python

0 commit comments

Comments
 (0)