Spark NLP 4.2.5: New CamemBERT for sequence classification, better pipeline validation in LightPipeline, new Databricks 11.3 runtime, new EMR 6.8/6.9 versions with Spark 3.3, updated notebooks with latest TensorFlow 2.11, 400+ state-of-the-art models and many more!
📢 Overview
Spark NLP 4.2.5 🚀 comes with a new CamemBERT for sequence classification annotator (multi-class & multi-label), new pipeline validation for LightPipeline in Python, 26 updated noteooks to use the latest TensorFlow and Transformers libraries, support for new Databricks 11.3 runtime, support for new EMR versions of 6.8 and 6.9 (only EMR versions with Spark 3.3), over 400+ state-of-the-art multi-lingual pretrained models, and bug fixes.
Do not forget to visit Models Hub with over 11700+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉
⭐ New Features & improvements
- NEW: Introducing CamemBertForSequenceClassification annotator in Spark NLP 🚀.
CamemBertForSequenceClassification
can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingCamembertForSequenceClassification
for PyTorch orTFCamembertForSequenceClassification
for TensorFlow in HuggingFace 🤗 - NEW: Add
AnnotatorType
validation in Spark NLPLightPipeline
. Currently, a misconfiguration ofinputCols
in an annotator in a pipeline raises an exception when usingtransform
method, but inLightPipeline
it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now inLightPipeline
too.- Add outputAnnotatorType for all annotators in Python
- Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from
AnnotatorApproach
andAnnotatorModel
- Adding AnnotatorType validation in
LightPipeline
- NEW: Migrate
26 notenooks
to import external Transformer models into Spark NLP. These notebooks now come with latestTensorFlow 2.11.0
andHuggingFace 4.25.1
releases. The notebooks also have TF signatures with data input types explicitly set to guarantee model sanity once imported into Spark NLP - Add validation for the number and type of columns set in
TFNerDLGraphBuilder
annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python - Add more details to Alphabet error message in
EntityRuler
annotator to better guide users - Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine
- Welcoming new Databricks runtimes support
- 11.3
- 11.3 ML
- 11.3 GPU
- Welcoming new EMR versions support
- 6.8.0
- 6.9.0
- Refactor and implement a better error handling in ResourceDownloader. This change removes
getObjectFromS3
allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader - Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases
- UpdateUpgrade
sbt-assembly
to1.2.0
that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR - Update
sbt
to1.8.0
with improvements and bug fixes, but mostly for CVEs fixes:- Updates to Coursier 2.1.0-RC1 to address GHSA-wv7w-rj2x-556x
- Updates to Ivy 2.3.0-sbt-a8f9eb5bf09d0539ea3658a2c2d4e09755b5133e to address GHSA-wv7w-rj2x-556x
- Use the new withIncludeScala in assemblyOption instead of value
🐛 Bug Fixes
- Fix an issue with the
BigTextMatcher
Annotator, where it would not match entities with overlapping definitions. For Example, if bothlung
andlung cancer
are defined,lung
would not be matched in a given text. This was due to an abstraction error of one of the subclasses of theBigTextMatcher
during construction of the underlying data structure - Fix indexing issue for
RegexTokenizer
annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators - Refactor the
Resolvers
object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the newsbt
🛑 Known Issues
TypedDependencyParserModel
annotator fails in Python in this release (will be fixed in 4.2.6 release next week)
Models
Spark NLP 4.2.5 comes with 400+ state-of-the-art pre-trained transformer models in many languages.
Featured Models
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (Marāṭhī)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian Bokmål
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,Tiv
,Tok Pisin
,Tonga (Tonga Islands)
,Tonga (Zambia)
,Tsonga
,Tswana
,Tumbuka
,Turkic languages
,Turkish
,Tuvalu
,Tzotzil
,Ukrainian
,Umbundu
,Uralic languages
,Urdu
,Venda
,Venezuelan Sign Language
,Vietnamese
,Wallisian
,Walloon
,Waray (Philippines)
,Welsh
,West Germanic languages
,West Slavic languages
,Western Malayo-Polynesian languages
,Wolaitta, Wolaytta
,Wolof
,Xhosa
,Yapese
,Yiddish
,Yoruba
,Yucatec Maya, Yucateco
,Zande (individual language)
,Zulu
The complete list of all 11700+ models & pipelines in 230+ languages is available on Models Hub
📓 New Notebooks
Spark NLP | Notebooks | Colab |
---|---|---|
CamemBertForTokenClassification | HuggingFace in Spark NLP - CamemBertForSequenceClassification |
📓 Updated Notebooks
The following notebooks have been updated to use the last release of TensorFLow 2.11
and Hugging Face 4.25
libraries
- You can visit Import Transformers in Spark NLP
- You can visit Spark NLP Workshop for 100+ examples
📖 Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==4.2.5
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5
M1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.2.5</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.2.5</version>
</dependency>
spark-nlp-m1:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-m1_2.12</artifactId>
<version>4.2.5</version>
</dependency>
spark-nlp-aarch64:
<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>4.2.5</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.5.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.5.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.5.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.5.jar
What's Changed
Contributors
@Damla-Gurbaz @Cabir40 @josejuanmartinez @danilojsl @mhnavid @DevinTDHa @jsl-builder @KshitizGIT @suvrat-joshi @maziyarpanahi @agsfer
New Contributors
Full Changelog: 4.2.4...4.2.5