Skip to content

2024-11-01-distilbart_xsum_12_6_en #14447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

74 changes: 74 additions & 0 deletions docs/_posts/ahmedlone127/2024-11-01-distilbart_xsum_12_6_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
layout: model
title: Abstractive Summarization by BART - DistilBART XSUM
author: John Snow Labs
name: distilbart_xsum_12_6
date: 2024-11-01
tags: [en, summarization, text_to_text, distil, open_source, openvino]
task: Summarization
language: en
edition: Spark NLP 5.5.0
spark_version: 3.0
supported: true
engine: openvino
annotator: BartTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

“BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer” The Facebook BART (Bidirectional and Auto-Regressive Transformer) model is a state-of-the-art language generation model that was introduced by Facebook AI in 2019. It is based on the transformer architecture and is designed to handle a wide range of natural language processing tasks such as text generation, summarization, and machine translation.

This pre-trained model is DistilBART fine-tuned on the Extreme Summarization (XSum) Dataset.

## Predicted Entities



{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbart_xsum_12_6_en_5.5.0_3.0_1730492024334.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbart_xsum_12_6_en_5.5.0_3.0_1730492024334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
bart = BartTransformer.pretrained("distilbart_xsum_12_6") \
.setTask("summarize:") \
.setMaxOutputLength(200) \
.setInputCols(["documents"]) \
.setOutputCol("summaries")
```
```scala
val bart = BartTransformer.pretrained("distilbart_xsum_12_6")
.setTask("summarize:")
.setMaxOutputLength(200)
.setInputCols("documents")
.setOutputCol("summaries")
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|distilbart_xsum_12_6|
|Compatibility:|Spark NLP 5.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[generation]|
|Language:|en|
|Size:|853.7 MB|

## References

https://huggingface.co/sshleifer/distilbart-xsum-12-6
93 changes: 93 additions & 0 deletions docs/_posts/ahmedlone127/2024-11-03-gpt2_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
layout: model
title: GPT2 text-to-text model (Base)
author: John Snow Labs
name: gpt2
date: 2024-11-03
tags: [gpt2, en, open_source, onnx, openvino]
task: Text Generation
language: en
edition: Spark NLP 5.5.0
spark_version: 3.0
supported: true
engine: openvino
annotator: GPT2Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

“GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where the model is primed with an input and it generates a lengthy continuation.

## Predicted Entities



{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/gpt2_en_5.5.0_3.0_1730653115205.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/gpt2_en_5.5.0_3.0_1730653115205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

gpt2 = GPT2Transformer.pretrained("gpt2") \
.setInputCols(["documents"]) \
.setMaxOutputLength(50) \
.setOutputCol("generation")

pipeline = Pipeline().setStages([documentAssembler, gpt2])
data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("summaries.generation").show(truncate=False)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")

val gpt2 = GPT2Transformer.pretrained("gpt2")
.setInputCols(Array("documents"))
.setMinOutputLength(10)
.setMaxOutputLength(50)
.setDoSample(false)
.setTopK(50)
.setNoRepeatNgramSize(3)
.setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2))

val data = Seq("My name is Leonardo.").toDF("text")
val result = pipeline.fit(data).transform(data)
results.select("generation.result").show(truncate = false)
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|gpt2|
|Compatibility:|Spark NLP 5.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[generation]|
|Language:|en|
|Size:|467.4 MB|

## References

https://huggingface.co/openai-community/gpt2
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
layout: model
title: Japanese hubert_large_japanese_asr HubertForCTC from TKU410410103
author: John Snow Labs
name: hubert_large_japanese_asr
date: 2024-11-08
tags: [ja, open_source, onnx, asr, hubert]
task: Automatic Speech Recognition
language: ja
edition: Spark NLP 5.5.1
spark_version: 3.0
supported: true
engine: onnx
annotator: HubertForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained HubertForCTC model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`hubert_large_japanese_asr` is a Japanese model originally trained by TKU410410103.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hubert_large_japanese_asr_ja_5.5.1_3.0_1731106819898.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hubert_large_japanese_asr_ja_5.5.1_3.0_1731106819898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

audioAssembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")

speechToText = HubertForCTC.pretrained("hubert_large_japanese_asr","ja") \
.setInputCols(["audio_assembler"]) \
.setOutputCol("text")

pipeline = Pipeline().setStages([audioAssembler, speechToText])
pipelineModel = pipeline.fit(data)
pipelineDF = pipelineModel.transform(data)

```
```scala

val audioAssembler = new DocumentAssembler()
.setInputCols("audio_content")
.setOutputCols("audio_assembler")

val speechToText = HubertForCTC.pretrained("hubert_large_japanese_asr", "ja")
.setInputCols(Array("audio_assembler"))
.setOutputCol("text")

val pipeline = new Pipeline().setStages(Array(documentAssembler, speechToText))
val pipelineModel = pipeline.fit(data)
val pipelineDF = pipelineModel.transform(data)

```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|hubert_large_japanese_asr|
|Compatibility:|Spark NLP 5.5.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ja|
|Size:|2.4 GB|

## References

https://huggingface.co/TKU410410103/hubert-large-japanese-asr
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
layout: model
title: Japanese hubert_large_japanese_asr_pipeline pipeline HubertForCTC from TKU410410103
author: John Snow Labs
name: hubert_large_japanese_asr_pipeline
date: 2024-11-08
tags: [ja, open_source, pipeline, onnx]
task: Automatic Speech Recognition
language: ja
edition: Spark NLP 5.5.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained HubertForCTC, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`hubert_large_japanese_asr_pipeline` is a Japanese model originally trained by TKU410410103.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hubert_large_japanese_asr_pipeline_ja_5.5.1_3.0_1731106937966.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hubert_large_japanese_asr_pipeline_ja_5.5.1_3.0_1731106937966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

pipeline = PretrainedPipeline("hubert_large_japanese_asr_pipeline", lang = "ja")
annotations = pipeline.transform(df)

```
```scala

val pipeline = new PretrainedPipeline("hubert_large_japanese_asr_pipeline", lang = "ja")
val annotations = pipeline.transform(df)

```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|hubert_large_japanese_asr_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 5.5.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|ja|
|Size:|2.4 GB|

## References

https://huggingface.co/TKU410410103/hubert-large-japanese-asr

## Included Models

- AudioAssembler
- HubertForCTC
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
layout: model
title: Ukrainian hubert_ukrainian_pipeline pipeline HubertForCTC from Yehor
author: John Snow Labs
name: hubert_ukrainian_pipeline
date: 2024-11-08
tags: [uk, open_source, pipeline, onnx]
task: Automatic Speech Recognition
language: uk
edition: Spark NLP 5.5.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained HubertForCTC, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`hubert_ukrainian_pipeline` is a Ukrainian model originally trained by Yehor.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hubert_ukrainian_pipeline_uk_5.5.1_3.0_1731106461400.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hubert_ukrainian_pipeline_uk_5.5.1_3.0_1731106461400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

pipeline = PretrainedPipeline("hubert_ukrainian_pipeline", lang = "uk")
annotations = pipeline.transform(df)

```
```scala

val pipeline = new PretrainedPipeline("hubert_ukrainian_pipeline", lang = "uk")
val annotations = pipeline.transform(df)

```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|hubert_ukrainian_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 5.5.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|uk|
|Size:|708.6 MB|

## References

https://huggingface.co/Yehor/hubert-uk

## Included Models

- AudioAssembler
- HubertForCTC
Loading