Skip to content

Spark NLP 4.2.5: New CamemBERT for sequence classification, better pipeline validation in LightPipeline, new Databricks 11.3 runtime, new EMR 6.8/6.9 versions with Spark 3.3, updated notebooks with latest TensorFlow 2.11, 400+ state-of-the-art models and many more!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 16 Dec 09:03
· 964 commits to master since this release

📢 Overview

Spark NLP 4.2.5 🚀 comes with a new CamemBERT for sequence classification annotator (multi-class & multi-label), new pipeline validation for LightPipeline in Python, 26 updated noteooks to use the latest TensorFlow and Transformers libraries, support for new Databricks 11.3 runtime, support for new EMR versions of 6.8 and 6.9 (only EMR versions with Spark 3.3), over 400+ state-of-the-art multi-lingual pretrained models, and bug fixes.

Do not forget to visit Models Hub with over 11700+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉


⭐ New Features & improvements

  • NEW: Introducing CamemBertForSequenceClassification annotator in Spark NLP 🚀. CamemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using CamembertForSequenceClassification for PyTorch or TFCamembertForSequenceClassification for TensorFlow in HuggingFace 🤗
  • NEW: Add AnnotatorType validation in Spark NLP LightPipeline. Currently, a misconfiguration of inputCols in an annotator in a pipeline raises an exception when using transform method, but in LightPipeline it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now in LightPipeline too.
    • Add outputAnnotatorType for all annotators in Python
    • Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from AnnotatorApproach and AnnotatorModel
    • Adding AnnotatorType validation in LightPipeline
  • NEW: Migrate 26 notenooks to import external Transformer models into Spark NLP. These notebooks now come with latest TensorFlow 2.11.0 and HuggingFace 4.25.1 releases. The notebooks also have TF signatures with data input types explicitly set to guarantee model sanity once imported into Spark NLP
  • Add validation for the number and type of columns set in TFNerDLGraphBuilder annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python
  • Add more details to Alphabet error message in EntityRuler annotator to better guide users
  • Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine
  • Welcoming new Databricks runtimes support
    • 11.3
    • 11.3 ML
    • 11.3 GPU
  • Welcoming new EMR versions support
    • 6.8.0
    • 6.9.0
  • Refactor and implement a better error handling in ResourceDownloader. This change removes getObjectFromS3 allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader
  • Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases
  • UpdateUpgrade sbt-assembly to 1.2.0 that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR
  • Update sbt to 1.8.0 with improvements and bug fixes, but mostly for CVEs fixes:
  • Use the new withIncludeScala in assemblyOption instead of value

🐛 Bug Fixes

  • Fix an issue with the BigTextMatcher Annotator, where it would not match entities with overlapping definitions. For Example, if both lung and lung cancer are defined, lung would not be matched in a given text. This was due to an abstraction error of one of the subclasses of the BigTextMatcher during construction of the underlying data structure
  • Fix indexing issue for RegexTokenizer annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators
  • Refactor the Resolvers object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the new sbt

🛑 Known Issues

  • TypedDependencyParserModel annotator fails in Python in this release (will be fixed in 4.2.6 release next week)

Models

Spark NLP 4.2.5 comes with 400+ state-of-the-art pre-trained transformer models in many languages.

Featured Models

Model Name Lang
RoBertaForSequenceClassification roberta_classifier_autotrain_neurips_chanllenge_1287149282 en
RoBertaForSequenceClassification roberta_classifier_autonlp_imdb_rating_625417974 en
RoBertaForSequenceClassification RoBertaForSequenceClassification bn
RoBertaForSequenceClassification roberta_classifier_autotrain_citizen_nlu_hindi_1370952776 hi
RoBertaForSequenceClassification roberta_classifier_detect_acoso_twitter es
RoBertaForQuestionAnswering roberta_qa_deepset_base_squad2 en
RoBertaForQuestionAnswering roberta_qa_icebert is
RoBertaForQuestionAnswering roberta_qa_mrm8488_base_bne_finetuned_s_c es
RoBertaForQuestionAnswering roberta_qa_base_bne_squad2 es
BertEmbeddings bert_embeddings_rbt3 zh
BertEmbeddings bert_embeddings_base_it_cased it
BertEmbeddings bert_embeddings_base_indonesian_522m id
BertEmbeddings bert_embeddings_base_german_uncased `de
BertEmbeddings bert_embeddings_base_japanese_char ja
BertEmbeddings bert_embeddings_bangla_base bn
BertEmbeddings bert_embeddings_base_arabertv01 ar

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 11700+ models & pipelines in 230+ languages is available on Models Hub

📓 New Notebooks

Spark NLP Notebooks Colab
CamemBertForTokenClassification HuggingFace in Spark NLP - CamemBertForSequenceClassification Open In Colab

📓 Updated Notebooks

The following notebooks have been updated to use the last release of TensorFLow 2.11 and Hugging Face 4.25 libraries

Spark NLP Notebooks Colab
BertEmbeddings HuggingFace in Spark NLP - BERT Open In Colab
BertSentenceEmbeddings HuggingFace in Spark NLP - BERT Sentence Open In Colab
DistilBertEmbeddings HuggingFace in Spark NLP - DistilBERT Open In Colab
CamemBertEmbeddings HuggingFace in Spark NLP - CamemBERT Open In Colab
RoBertaEmbeddings HuggingFace in Spark NLP - RoBERTa Open In Colab
DeBertaEmbeddings HuggingFace in Spark NLP - DeBERTa Open In Colab
XlmRoBertaEmbeddings HuggingFace in Spark NLP - XLM-RoBERTa Open In Colab
AlbertEmbeddings HuggingFace in Spark NLP - ALBERT Open In Colab
BertForTokenClassification HuggingFace in Spark NLP - BertForTokenClassification Open In Colab
DistilBertForTokenClassification HuggingFace in Spark NLP - DistilBertForTokenClassification Open In Colab
AlbertForTokenClassification HuggingFace in Spark NLP - AlbertForTokenClassification Open In Colab
RoBertaForTokenClassification HuggingFace in Spark NLP - RoBertaForTokenClassification Open In Colab
XlmRoBertaForTokenClassification HuggingFace in Spark NLP - XlmRoBertaForTokenClassification Open In Colab
CamemBertForTokenClassification HuggingFace in Spark NLP - CamemBertForTokenClassification Open In Colab
CamemBertForTokenClassification HuggingFace in Spark NLP - CamemBertForSequenceClassification Open In Colab
BertForSequenceClassification HuggingFace in Spark NLP - BertForSequenceClassification Open In Colab
DistilBertForSequenceClassification HuggingFace in Spark NLP - DistilBertForSequenceClassification Open In Colab
AlbertForSequenceClassification HuggingFace in Spark NLP - AlbertForSequenceClassification Open In Colab
RoBertaForSequenceClassification HuggingFace in Spark NLP - RoBertaForSequenceClassification Open In Colab
XlmRoBertaForSequenceClassification HuggingFace in Spark NLP - XlmRoBertaForSequenceClassification Open In Colab
AlbertForQuestionAnswering HuggingFace in Spark NLP - AlbertForQuestionAnswering Open In Colab
BertForQuestionAnswering HuggingFace in Spark NLP - BertForQuestionAnswering Open In Colab
DeBertaForQuestionAnswering HuggingFace in Spark NLP - DeBertaForQuestionAnswering Open In Colab
DistilBertForQuestionAnswering HuggingFace in Spark NLP - DistilBertForQuestionAnswering Open In Colab
RoBertaForQuestionAnswering HuggingFace in Spark NLP - RoBertaForQuestionAnswering Open In Colab
XlmRobertaForQuestionAnswering HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering Open In Colab

📖 Documentation


Installation

Python

#PyPI

pip install spark-nlp==4.2.5

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.2.5</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.2.5</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.2.5</version>
</dependency>

spark-nlp-aarch64:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>4.2.5</version>
</dependency>

FAT JARs

What's Changed

Contributors

@Damla-Gurbaz @Cabir40 @josejuanmartinez @danilojsl @mhnavid @DevinTDHa @jsl-builder @KshitizGIT @suvrat-joshi @maziyarpanahi @agsfer

New Contributors

Full Changelog: 4.2.4...4.2.5