Release Spark NLP 4.2.5: New CamemBERT for sequence classification, better pipeline validation in LightPipeline, new Databricks 11.3 runtime, new EMR 6.8/6.9 versions with Spark 3.3, updated notebooks with latest TensorFlow 2.11, 400+ state-of-the-art models and many more! · JohnSnowLabs/spark-nlp

📢 Overview

Spark NLP 4.2.5 🚀 comes with a new CamemBERT for sequence classification annotator (multi-class & multi-label), new pipeline validation for LightPipeline in Python, 26 updated noteooks to use the latest TensorFlow and Transformers libraries, support for new Databricks 11.3 runtime, support for new EMR versions of 6.8 and 6.9 (only EMR versions with Spark 3.3), over 400+ state-of-the-art multi-lingual pretrained models, and bug fixes.

Do not forget to visit Models Hub with over 11700+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉

⭐ New Features & improvements

NEW: Introducing CamemBertForSequenceClassification annotator in Spark NLP 🚀. CamemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using CamembertForSequenceClassification for PyTorch or TFCamembertForSequenceClassification for TensorFlow in HuggingFace 🤗
NEW: Add AnnotatorType validation in Spark NLP LightPipeline. Currently, a misconfiguration of inputCols in an annotator in a pipeline raises an exception when using transform method, but in LightPipeline it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now in LightPipeline too.
- Add outputAnnotatorType for all annotators in Python
- Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from AnnotatorApproach and AnnotatorModel
- Adding AnnotatorType validation in LightPipeline
NEW: Migrate 26 notenooks to import external Transformer models into Spark NLP. These notebooks now come with latest TensorFlow 2.11.0 and HuggingFace 4.25.1 releases. The notebooks also have TF signatures with data input types explicitly set to guarantee model sanity once imported into Spark NLP
Add validation for the number and type of columns set in TFNerDLGraphBuilder annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python
Add more details to Alphabet error message in EntityRuler annotator to better guide users
Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine
Welcoming new Databricks runtimes support
- 11.3
- 11.3 ML
- 11.3 GPU
Welcoming new EMR versions support
- 6.8.0
- 6.9.0
Refactor and implement a better error handling in ResourceDownloader. This change removes getObjectFromS3 allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader
Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases
UpdateUpgrade sbt-assembly to 1.2.0 that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR
Update sbt to 1.8.0 with improvements and bug fixes, but mostly for CVEs fixes:
- Updates to Coursier 2.1.0-RC1 to address GHSA-wv7w-rj2x-556x
- Updates to Ivy 2.3.0-sbt-a8f9eb5bf09d0539ea3658a2c2d4e09755b5133e to address GHSA-wv7w-rj2x-556x
Use the new withIncludeScala in assemblyOption instead of value

🐛 Bug Fixes

Fix an issue with the BigTextMatcher Annotator, where it would not match entities with overlapping definitions. For Example, if both lung and lung cancer are defined, lung would not be matched in a given text. This was due to an abstraction error of one of the subclasses of the BigTextMatcher during construction of the underlying data structure
Fix indexing issue for RegexTokenizer annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators
Refactor the Resolvers object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the new sbt

🛑 Known Issues

TypedDependencyParserModel annotator fails in Python in this release (will be fixed in 4.2.6 release next week)

Models

Spark NLP 4.2.5 comes with 400+ state-of-the-art pre-trained transformer models in many languages.

Featured Models

Model	Name	Lang
RoBertaForSequenceClassification	roberta_classifier_autotrain_neurips_chanllenge_1287149282	`en`
RoBertaForSequenceClassification	roberta_classifier_autonlp_imdb_rating_625417974	`en`
RoBertaForSequenceClassification	RoBertaForSequenceClassification	`bn`
RoBertaForSequenceClassification	roberta_classifier_autotrain_citizen_nlu_hindi_1370952776	`hi`
RoBertaForSequenceClassification	roberta_classifier_detect_acoso_twitter	`es`
RoBertaForQuestionAnswering	roberta_qa_deepset_base_squad2	`en`
RoBertaForQuestionAnswering	roberta_qa_icebert	`is`
RoBertaForQuestionAnswering	roberta_qa_mrm8488_base_bne_finetuned_s_c	`es`
RoBertaForQuestionAnswering	roberta_qa_base_bne_squad2	`es`
BertEmbeddings	bert_embeddings_rbt3	`zh`
BertEmbeddings	bert_embeddings_base_it_cased	`it`
BertEmbeddings	bert_embeddings_base_indonesian_522m	`id`
BertEmbeddings	bert_embeddings_base_german_uncased	`de
BertEmbeddings	bert_embeddings_base_japanese_char	`ja`
BertEmbeddings	bert_embeddings_bangla_base	`bn`
BertEmbeddings	bert_embeddings_base_arabertv01	`ar`

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 11700+ models & pipelines in 230+ languages is available on Models Hub

📓 New Notebooks

Spark NLP	Notebooks	Colab
CamemBertForTokenClassification	HuggingFace in Spark NLP - CamemBertForSequenceClassification

📓 Updated Notebooks

The following notebooks have been updated to use the last release of TensorFLow 2.11 and Hugging Face 4.25 libraries

Spark NLP	Notebooks	Colab
BertEmbeddings	HuggingFace in Spark NLP - BERT
BertSentenceEmbeddings	HuggingFace in Spark NLP - BERT Sentence
DistilBertEmbeddings	HuggingFace in Spark NLP - DistilBERT
CamemBertEmbeddings	HuggingFace in Spark NLP - CamemBERT
RoBertaEmbeddings	HuggingFace in Spark NLP - RoBERTa
DeBertaEmbeddings	HuggingFace in Spark NLP - DeBERTa
XlmRoBertaEmbeddings	HuggingFace in Spark NLP - XLM-RoBERTa
AlbertEmbeddings	HuggingFace in Spark NLP - ALBERT
BertForTokenClassification	HuggingFace in Spark NLP - BertForTokenClassification
DistilBertForTokenClassification	HuggingFace in Spark NLP - DistilBertForTokenClassification
AlbertForTokenClassification	HuggingFace in Spark NLP - AlbertForTokenClassification
RoBertaForTokenClassification	HuggingFace in Spark NLP - RoBertaForTokenClassification
XlmRoBertaForTokenClassification	HuggingFace in Spark NLP - XlmRoBertaForTokenClassification
CamemBertForTokenClassification	HuggingFace in Spark NLP - CamemBertForTokenClassification
CamemBertForTokenClassification	HuggingFace in Spark NLP - CamemBertForSequenceClassification
BertForSequenceClassification	HuggingFace in Spark NLP - BertForSequenceClassification
DistilBertForSequenceClassification	HuggingFace in Spark NLP - DistilBertForSequenceClassification
AlbertForSequenceClassification	HuggingFace in Spark NLP - AlbertForSequenceClassification
RoBertaForSequenceClassification	HuggingFace in Spark NLP - RoBertaForSequenceClassification
XlmRoBertaForSequenceClassification	HuggingFace in Spark NLP - XlmRoBertaForSequenceClassification
AlbertForQuestionAnswering	HuggingFace in Spark NLP - AlbertForQuestionAnswering
BertForQuestionAnswering	HuggingFace in Spark NLP - BertForQuestionAnswering
DeBertaForQuestionAnswering	HuggingFace in Spark NLP - DeBertaForQuestionAnswering
DistilBertForQuestionAnswering	HuggingFace in Spark NLP - DistilBertForQuestionAnswering
RoBertaForQuestionAnswering	HuggingFace in Spark NLP - RoBertaForQuestionAnswering
XlmRobertaForQuestionAnswering	HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering

You can visit Import Transformers in Spark NLP
You can visit Spark NLP Workshop for 100+ examples

📖 Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==4.2.5

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.2.5</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.2.5</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.2.5</version>
</dependency>

spark-nlp-aarch64:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>4.2.5</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.5.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.5.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.5.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.5.jar

What's Changed

Contributors

@Damla-Gurbaz @Cabir40 @josejuanmartinez @danilojsl @mhnavid @DevinTDHa @jsl-builder @KshitizGIT @suvrat-joshi @maziyarpanahi @agsfer

New Contributors

@mhnavid made their first contribution in #12977

Full Changelog: 4.2.4...4.2.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark NLP 4.2.5: New CamemBERT for sequence classification, better pipeline validation in LightPipeline, new Databricks 11.3 runtime, new EMR 6.8/6.9 versions with Spark 3.3, updated notebooks with latest TensorFlow 2.11, 400+ state-of-the-art models and many more!

📢 Overview

⭐ New Features & improvements

🐛 Bug Fixes

🛑 Known Issues

Models

Featured Models

The complete list of all 11700+ models & pipelines in 230+ languages is available on Models Hub

📓 New Notebooks

📓 Updated Notebooks

📖 Documentation

Installation

What's Changed

Contributors

New Contributors

Contributors

Uh oh!