WIP: multimodal support #227

sohamparikh · 2025-04-08T06:51:40Z

✨ Description

Please provide a brief summary of the changes, relevant motivation, and context.
Include any related issue numbers or links to discussions, and explain why this change is necessary.

Closes #

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

RaymondLi0

Thanks for the great work! 🚀
Couldn't look at everything yet, but here are some first comments, will continue tomorrow

Dockerfile

.github/workflows/docs.yaml

fast_llm/models/gpt/model.py

fast_llm/layers/vision_encoder/preprocessing.py

fast_llm/layers/vision_encoder/patch_conv.py

fast_llm/data/dataset/gpt/memmap.py

fast_llm/data/dataset/gpt/sampled.py

RaymondLi0 · 2025-06-10T23:52:52Z

fast_llm/data/preparator/gpt_memmap/prepare.py

@@ -298,11 +305,7 @@ def run(self) -> None:
            raise ValueError(f"Both chosen and rejected loss masking spans must be specified if one is specified.")

        # route tokenize function
-        if self._config.dataset.loss_masking_spans is not None:
-            if self._config.dataset.loss_masking_spans not in dataset.column_names:
-                raise ValueError(f"Dataset does not have spans field '{self._config.dataset.loss_masking_spans}'.")


Why remove this?

i moved everything to a single tokenize function, since the combinations were getting too much (images, loss spans, audio next)

Should we still keep the check of the column-names?

fast_llm/data/preparator/gpt_memmap/prepare.py

fast_llm/data/dataset/gpt/memmap.py

fast_llm/data/tokenizer.py

fast_llm/layers/transformer/vision_transformer.py

Co-authored-by: sohamparikh <sohamparikh47@gmail.com>

fast_llm/layers/vision_encoder/config.py

Co-authored-by: RaymondLi0 <raymond.li@servicenow.com>

…M into soham/pixtral-support

RaymondLi0 · 2025-06-19T15:19:45Z

fast_llm/models/gpt/conversion.py

@@ -548,6 +563,350 @@ def _get_mlp_converters(self, fast_llm_prefix: str, hf_prefix: str) -> list[Weig
        ]


+class PixtralHuggingfaceCheckpointHandler(WeightAndBiasConverterMixin, HuggingfaceStateDictCheckpointHandler):


Is my understanding correct that this class handles conversion for the vision encoder (to a PixtralVisionModel model),
whereas LlavaHuggingfaceCheckpointHandler is the class that handles conversion of the full model, and is the one to use to convert pixtral-12b ?

Yes, exactly! It's inspired by the config.json on HF, and increasingly more omni models seem to be converging to a similar format.

sohamparikh added 19 commits April 8, 2025 06:51

WIP: multimodal support

7709e65

rough idea for memmap

0db2bd2

faster image size reading

0d89f68

solidify prepare

3866a53

wip

8413983

vision model

6521e41

wip

daf586f

wip

ef4488d

missing files

6d9d595

make it work, barely

6cb8f5d

fix

5761a2d

fixes

d45d600

changes

74a99b8

patches and fixes

99ad5d9

fix dependency

bcb557a

remove for testing

a6f5364

mising

73b431b

fix

6d65676

Merge branch 'main' into soham/pixtral-support

46aefc1

tscholak mentioned this pull request May 9, 2025

StarDoc model training #5

Closed

sohamparikh added 10 commits May 9, 2025 18:39

fixes

66e7081

fix

7f86a7f

more fixes after merge

3a8a99d

conv cleanup

d16284e

more conv cleanup

b3134aa

images + loss-masks

c8aa66e

minor fixes

0baae59

cleanup

48855be

cleanup

f35e003

cleanup

4eb34cb

sohamparikh added 11 commits June 2, 2025 23:35

fix span offset with images

ff8fecc

move image logic to sampled

c663cbb

cleanup

f52f02b

merge main

5436357

cleanup

02f6d8f

jpeg dependency

6843129

install libjpeg-dev in gh actions

b94b1ee

fix sampling test

9e4f14f

fix

d1c804f

fix data cache reloading

75d64a6

fix tokenization

cba6986

RaymondLi0 reviewed Jun 11, 2025

View reviewed changes

fast_llm/data/tokenizer.py Outdated Show resolved Hide resolved

RaymondLi0 reviewed Jun 11, 2025

View reviewed changes

fast_llm/data/tokenizer.py Outdated Show resolved Hide resolved

RaymondLi0 reviewed Jun 11, 2025

View reviewed changes

fast_llm/layers/transformer/vision_transformer.py Outdated Show resolved Hide resolved

pixtral SFT (#296)

275fefa

Co-authored-by: sohamparikh <sohamparikh47@gmail.com>

sohamparikh mentioned this pull request Jun 11, 2025

pixtral SFT #296

Merged

25 tasks

review comments

605cc7f

RaymondLi0 reviewed Jun 12, 2025

View reviewed changes

fast_llm/layers/vision_encoder/config.py Outdated Show resolved Hide resolved

sohamparikh and others added 9 commits June 12, 2025 17:40

simplified tokenization with spans

06aa740

Update fast_llm/data/preparator/gpt_memmap/prepare.py

30e3d34

Co-authored-by: RaymondLi0 <raymond.li@servicenow.com>

rename

c1aa709

Merge branch 'soham/pixtral-support' of github.com:ServiceNow/Fast-LL…

0ada42b

…M into soham/pixtral-support

merge main

4e7afd8

fix conversion

8e106f7

fix sequence lengths, parallel conv

080dcb5

minor

f186868

fix image at beginning

6b9ea2e

RaymondLi0 reviewed Jun 19, 2025

View reviewed changes

pixtral fix conversion (#315)

ad18ea1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: multimodal support #227

WIP: multimodal support #227

Uh oh!

sohamparikh commented Apr 8, 2025

Uh oh!

RaymondLi0 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RaymondLi0 Jun 10, 2025

Uh oh!

sohamparikh Jun 11, 2025

Uh oh!

RaymondLi0 Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RaymondLi0 Jun 19, 2025

Uh oh!

sohamparikh Jun 20, 2025

Uh oh!

Uh oh!

		@@ -548,6 +563,350 @@ def _get_mlp_converters(self, fast_llm_prefix: str, hf_prefix: str) -> list[Weig
		]


		class PixtralHuggingfaceCheckpointHandler(WeightAndBiasConverterMixin, HuggingfaceStateDictCheckpointHandler):

WIP: multimodal support #227

Are you sure you want to change the base?

WIP: multimodal support #227

Uh oh!

Conversation

sohamparikh commented Apr 8, 2025

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

RaymondLi0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RaymondLi0 Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

sohamparikh Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RaymondLi0 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

sohamparikh Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!