Skip to content

Commit 4e44281

Browse files
prabodDevinTDHa
authored andcommitted
[SPARKNLP-1115] Introducing SmolVLM (#14552)
* smolvlm utils Signed-off-by: Prabod Rathnayaka <prabod@rathnayaka.me> * add smolvlm image preprocess * added scala api Signed-off-by: Prabod Rathnayaka <prabod@rathnayaka.me> * added scala API and python API Signed-off-by: Prabod Rathnayaka <prabod@rathnayaka.me> * documentation * update default model * add notebook * add resource downloader * revert the changes to BPE tokenizer and move changes to SmolVLMTokenizer Signed-off-by: Prabod Rathnayaka <prabod@rathnayaka.me> * removed configProtoBytes as we do not support TF for this model Signed-off-by: Prabod Rathnayaka <prabod@rathnayaka.me> * remove images in openvino notebook dir Signed-off-by: Prabod Rathnayaka <prabod@rathnayaka.me> --------- Signed-off-by: Prabod Rathnayaka <prabod@rathnayaka.me>
1 parent bc3f105 commit 4e44281

File tree

18 files changed

+5590
-7
lines changed

18 files changed

+5590
-7
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
{%- capture title -%}
2+
JanusForMultiModal
3+
{%- endcapture -%}
4+
5+
{%- capture description -%}
6+
Unified Multimodal Understanding and Generation using Janus.
7+
8+
JanusForMultiModal can load Janus Vision models for unified multimodal understanding and generation.
9+
The model consists of a vision encoder, a text encoder, and a text decoder. Janus decouples visual encoding for enhanced flexibility, leveraging a unified transformer architecture for both understanding and generation tasks.
10+
11+
Janus uses SigLIP-L as the vision encoder, supporting 384 x 384 image inputs. For image generation, it utilizes a tokenizer with a downsample rate of 16. The framework is based on DeepSeek-LLM-1.3b-base, trained on approximately 500B text tokens.
12+
13+
Pretrained models can be loaded with `pretrained` of the companion object:
14+
15+
```scala
16+
val visualQA = JanusForMultiModal.pretrained()
17+
.setInputCols("image_assembler")
18+
.setOutputCol("answer")
19+
```
20+
{%- capture input_anno -%}
21+
IMAGE
22+
{%- endcapture -%}
23+
24+
{%- capture output_anno -%}
25+
DOCUMENT
26+
{%- endcapture -%}
27+
28+
{%- capture python_example -%}
29+
import sparknlp
30+
from sparknlp.base import *
31+
from sparknlp.annotator import *
32+
from pyspark.ml import Pipeline
33+
from pyspark.sql.functions import lit
34+
35+
image_df = spark.read.format("image").load(path=images_path) # Replace with your image path
36+
test_df = image_df.withColumn(
37+
"text",
38+
lit("User: <image_placeholder>Describe image in details\n\nAssistant:")
39+
)
40+
imageAssembler = ImageAssembler() \\
41+
.setInputCol("image") \\
42+
.setOutputCol("image_assembler")
43+
visualQAClassifier = JanusForMultiModal.pretrained() \\
44+
.setInputCols("image_assembler") \\
45+
.setOutputCol("answer")
46+
pipeline = Pipeline().setStages([
47+
imageAssembler,
48+
visualQAClassifier
49+
])
50+
result = pipeline.fit(test_df).transform(test_df)
51+
result.select("image_assembler.origin", "answer.result").show(truncate=False)
52+
53+
{%- endcapture -%}
54+
55+
{%- capture scala_example -%}
56+
import spark.implicits._
57+
import com.johnsnowlabs.nlp.base._
58+
import com.johnsnowlabs.nlp.annotator._
59+
import org.apache.spark.ml.Pipeline
60+
import org.apache.spark.sql.DataFrame
61+
import org.apache.spark.sql.functions.lit
62+
63+
val imageDF: DataFrame = spark.read
64+
.format("image")
65+
.option("dropInvalid", value = true)
66+
.load(imageFolder) // Replace with your image folder
67+
68+
val testDF: DataFrame = imageDF.withColumn("text", lit("User: <image_placeholder>Describe image in details\n\nAssistant:"))
69+
70+
val imageAssembler: ImageAssembler = new ImageAssembler()
71+
.setInputCol("image")
72+
.setOutputCol("image_assembler")
73+
74+
val visualQAClassifier = JanusForMultiModal.pretrained()
75+
.setInputCols("image_assembler")
76+
.setOutputCol("answer")
77+
78+
val pipeline = new Pipeline().setStages(Array(
79+
imageAssembler,
80+
visualQAClassifier
81+
))
82+
83+
val result = pipeline.fit(testDF).transform(testDF)
84+
85+
result.select("image_assembler.origin", "answer.result").show(truncate=false)
86+
{%- endcapture -%}
87+
88+
{%- capture api_link -%}
89+
[JanusForMultiModal](/api/com/johnsnowlabs/nlp/annotators/cv/JanusForMultiModal)
90+
{%- endcapture -%}
91+
92+
{%- capture python_api_link -%}
93+
[JanusForMultiModal](/api/python/reference/autosummary/sparknlp/annotator/cv/janus_for_multimodal/index.html#sparknlp.annotator.cv.janus_for_multimodal.JanusForMultiModal)
94+
{%- endcapture -%}
95+
96+
{%- capture source_link -%}
97+
[JanusForMultiModal](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/cv/JanusForMultiModal.scala)
98+
{%- endcapture -%}
99+
100+
{% include templates/anno_template.md
101+
title=title
102+
description=description
103+
input_anno=input_anno
104+
output_anno=output_anno
105+
python_example=python_example
106+
scala_example=scala_example
107+
api_link=api_link
108+
python_api_link=python_api_link
109+
source_link=source_link
110+
%}
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
{%- capture title -%}
2+
SmolVLMTransformer
3+
{%- endcapture -%}
4+
5+
{%- capture description -%}
6+
Compact Multimodal Model for Visual Question Answering using SmolVLM.
7+
8+
SmolVLMTransformer can load SmolVLM models for visual question answering. The model consists of a vision encoder, a text encoder, and a text decoder. The vision encoder encodes the input image, the text encoder processes the input question alongside the image encoding, and the text decoder generates the answer to the question.
9+
10+
SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs. Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images, or function as a pure language model without visual inputs.
11+
12+
Pretrained models can be loaded with `pretrained` of the companion object:
13+
14+
```scala
15+
val visualQA = SmolVLMTransformer.pretrained()
16+
.setInputCols("image_assembler")
17+
.setOutputCol("answer")
18+
```
19+
{%- capture input_anno -%}
20+
IMAGE
21+
{%- endcapture -%}
22+
23+
{%- capture output_anno -%}
24+
DOCUMENT
25+
{%- endcapture -%}
26+
27+
{%- capture python_example -%}
28+
import sparknlp
29+
from sparknlp.base import *
30+
from sparknlp.annotator import *
31+
from pyspark.ml import Pipeline
32+
from pyspark.sql.functions import lit
33+
34+
image_df = spark.read.format("image").load(path=images_path) # Replace with your image path
35+
test_df = image_df.withColumn(
36+
"text",
37+
lit("<|im_start|>User:<image>Can you describe the image?<end_of_utterance>\nAssistant:")
38+
)
39+
imageAssembler = ImageAssembler() \\
40+
.setInputCol("image") \\
41+
.setOutputCol("image_assembler")
42+
visualQAClassifier = SmolVLMTransformer.pretrained() \\
43+
.setInputCols("image_assembler") \\
44+
.setOutputCol("answer")
45+
pipeline = Pipeline().setStages([
46+
imageAssembler,
47+
visualQAClassifier
48+
])
49+
result = pipeline.fit(test_df).transform(test_df)
50+
result.select("image_assembler.origin", "answer.result").show(truncate=False)
51+
52+
{%- endcapture -%}
53+
54+
{%- capture scala_example -%}
55+
import spark.implicits._
56+
import com.johnsnowlabs.nlp.base._
57+
import com.johnsnowlabs.nlp.annotator._
58+
import org.apache.spark.ml.Pipeline
59+
import org.apache.spark.sql.DataFrame
60+
import org.apache.spark.sql.functions.lit
61+
62+
val imageDF: DataFrame = spark.read
63+
.format("image")
64+
.option("dropInvalid", value = true)
65+
.load(imageFolder) // Replace with your image folder
66+
67+
val testDF: DataFrame = imageDF.withColumn("text", lit("<|im_start|>User:<image>Can you describe the image?<end_of_utterance>\nAssistant:"))
68+
69+
val imageAssembler: ImageAssembler = new ImageAssembler()
70+
.setInputCol("image")
71+
.setOutputCol("image_assembler")
72+
73+
val visualQAClassifier = SmolVLMTransformer.pretrained()
74+
.setInputCols("image_assembler")
75+
.setOutputCol("answer")
76+
77+
val pipeline = new Pipeline().setStages(Array(
78+
imageAssembler,
79+
visualQAClassifier
80+
))
81+
82+
val result = pipeline.fit(testDF).transform(testDF)
83+
84+
result.select("image_assembler.origin", "answer.result").show(truncate=false)
85+
{%- endcapture -%}
86+
87+
{%- capture api_link -%}
88+
[SmolVLMTransformer](/api/com/johnsnowlabs/nlp/annotators/cv/SmolVLMTransformer)
89+
{%- endcapture -%}
90+
91+
{%- capture python_api_link -%}
92+
[SmolVLMTransformer](/api/python/reference/autosummary/sparknlp/annotator/cv/smolvlm_transformer/index.html#sparknlp.annotator.cv.smolvlm_transformer.SmolVLMTransformer)
93+
{%- endcapture -%}
94+
95+
{%- capture source_link -%}
96+
[SmolVLMTransformer](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/cv/SmolVLMTransformer.scala)
97+
{%- endcapture -%}
98+
99+
{% include templates/anno_template.md
100+
title=title
101+
description=description
102+
input_anno=input_anno
103+
output_anno=output_anno
104+
python_example=python_example
105+
scala_example=scala_example
106+
api_link=api_link
107+
python_api_link=python_api_link
108+
source_link=source_link
109+
%}

0 commit comments

Comments
 (0)