Skip to content

Commit

Permalink
Granitemoe (#33207)
Browse files Browse the repository at this point in the history
* first commit

* drop tokenizer

* drop tokenizer

* drop tokenizer

* drop convert

* granite

* drop tokenization test

* mup

* fix

* reformat

* reformat

* reformat

* fix docs

* stop checking for checkpoint

* update support

* attention multiplier

* update model

* tiny drop

* saibo drop

* skip test

* fix test

* fix test

* drop

* drop useless imports

* update docs

* drop flash function

* copied from

* drop pretraining tp

* drop pretraining tp

* drop pretraining tp

* drop unused import

* drop code path

* change name

* softmax scale

* head dim

* drop legacy cache

* rename params

* cleanup

* fix copies

* comments

* add back legacy cache

* multipliers

* multipliers

* multipliers

* text fix

* fix copies

* merge

* multipliers

* attention multiplier

* drop unused imports

* add granitemoe

* add decoration

* remove moe from sequenceclassification

* fix test

* fix

* fix

* fix

* move rope?

* merge

* drop bias

* drop bias

* Update src/transformers/models/granite/configuration_granite.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix

* Update src/transformers/models/granite/modeling_granite.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix

* fix

* fix

* fix

* drop

* drop

* fix

* fix

* cleanup

* cleanup

* fix

* fix granite tests

* fp32 test

* fix

* drop jitter

* fix

* rename

* rename

* fix config

* add gen test

---------

Co-authored-by: Yikang Shen <yikang.shn@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
  • Loading branch information
3 people authored Sep 20, 2024
1 parent 49a0bef commit e472e07
Show file tree
Hide file tree
Showing 16 changed files with 2,393 additions and 58 deletions.
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,8 @@
title: GPTSw3
- local: model_doc/granite
title: Granite
- local: model_doc/granitemoe
title: GraniteMoe
- local: model_doc/herbert
title: HerBERT
- local: model_doc/ibert
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ Flax), PyTorch, and/or TensorFlow.
| [GPTBigCode](model_doc/gpt_bigcode) ||||
| [GPTSAN-japanese](model_doc/gptsan-japanese) ||||
| [Granite](model_doc/granite) ||||
| [GraniteMoeMoe](model_doc/granitemoe) ||||
| [Graphormer](model_doc/graphormer) ||||
| [Grounding DINO](model_doc/grounding-dino) ||||
| [GroupViT](model_doc/groupvit) ||||
Expand Down
74 changes: 74 additions & 0 deletions docs/source/en/model_doc/granitemoe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# GraniteMoe

## Overview

The GraniteMoe model was proposed in [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

The abstract from the paper is the following:

*Finding the optimal learning rate for language model pretraining is a challenging task.
This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored.
In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (\mup) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models.
We [open source](https://huggingface.co/collections/ibm/power-lm-66be64ae647ddf11b9808000) these pretrained models.*

Tips:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "ibm/PowerMoE-3b"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
model.eval()

# change input text as desired
prompt = "Write a code to find the maximum value in a list of numbers."

# tokenize the text
input_tokens = tokenizer(prompt, return_tensors="pt")
# generate output tokens
output = model.generate(**input_tokens, max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# loop over the batch to print, in this example the batch size is 1
for i in output:
print(i)
```

This model was contributed by [mayank-mishra](https://huggingface.co/mayank-mishra).


## GraniteMoeConfig

[[autodoc]] GraniteMoeConfig

## GraniteMoeModel

[[autodoc]] GraniteMoeModel
- forward

## GraniteMoeForCausalLM

[[autodoc]] GraniteMoeForCausalLM
- forward
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel)
* [GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJModel)
* [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel)
* [GraniteMoe](https://huggingface.co/docs/transformers/model_doc/granitemoe#transformers.GraniteMoeModel)
* [Idefics2](https://huggingface.co/docs/transformers/model_doc/idefics2#transformers.Idefics2Model)
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
Expand Down Expand Up @@ -226,6 +227,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [Hubert](https://huggingface.co/docs/transformers/model_doc/hubert#transformers.HubertModel)
* [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel)
* [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel)
* [GraniteMoe](https://huggingface.co/docs/transformers/model_doc/granitemoe#transformers.GraniteMoeModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
Expand Down
14 changes: 14 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,6 +465,7 @@
"models.gpt_sw3": [],
"models.gptj": ["GPTJConfig"],
"models.granite": ["GraniteConfig"],
"models.granitemoe": ["GraniteMoeConfig"],
"models.grounding_dino": [
"GroundingDinoConfig",
"GroundingDinoProcessor",
Expand Down Expand Up @@ -2343,6 +2344,13 @@
"GranitePreTrainedModel",
]
)
_import_structure["models.granitemoe"].extend(
[
"GraniteMoeForCausalLM",
"GraniteMoeModel",
"GraniteMoePreTrainedModel",
]
)
_import_structure["models.grounding_dino"].extend(
[
"GroundingDinoForObjectDetection",
Expand Down Expand Up @@ -5237,6 +5245,7 @@
)
from .models.gptj import GPTJConfig
from .models.granite import GraniteConfig
from .models.granitemoe import GraniteMoeConfig
from .models.grounding_dino import (
GroundingDinoConfig,
GroundingDinoProcessor,
Expand Down Expand Up @@ -6976,6 +6985,11 @@
GraniteModel,
GranitePreTrainedModel,
)
from .models.granitemoe import (
GraniteMoeForCausalLM,
GraniteMoeModel,
GraniteMoePreTrainedModel,
)
from .models.grounding_dino import (
GroundingDinoForObjectDetection,
GroundingDinoModel,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@
gpt_sw3,
gptj,
granite,
granitemoe,
grounding_dino,
groupvit,
herbert,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@
("gptj", "GPTJConfig"),
("gptsan-japanese", "GPTSanJapaneseConfig"),
("granite", "GraniteConfig"),
("granitemoe", "GraniteMoeConfig"),
("graphormer", "GraphormerConfig"),
("grounding-dino", "GroundingDinoConfig"),
("groupvit", "GroupViTConfig"),
Expand Down Expand Up @@ -417,6 +418,7 @@
("gptj", "GPT-J"),
("gptsan-japanese", "GPTSAN-japanese"),
("granite", "Granite"),
("granitemoe", "GraniteMoeMoe"),
("graphormer", "Graphormer"),
("grounding-dino", "Grounding DINO"),
("groupvit", "GroupViT"),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@
("gptj", "GPTJModel"),
("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"),
("granite", "GraniteModel"),
("granitemoe", "GraniteMoeModel"),
("graphormer", "GraphormerModel"),
("grounding-dino", "GroundingDinoModel"),
("groupvit", "GroupViTModel"),
Expand Down Expand Up @@ -485,6 +486,7 @@
("gpt_neox_japanese", "GPTNeoXJapaneseForCausalLM"),
("gptj", "GPTJForCausalLM"),
("granite", "GraniteForCausalLM"),
("granitemoe", "GraniteMoeForCausalLM"),
("jamba", "JambaForCausalLM"),
("jetmoe", "JetMoeForCausalLM"),
("llama", "LlamaForCausalLM"),
Expand Down
57 changes: 57 additions & 0 deletions src/transformers/models/granitemoe/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright 2024 EleutherAI and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
)


_import_structure = {
"configuration_granitemoe": ["GraniteMoeConfig"],
}

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_granitemoe"] = [
"GraniteMoeForCausalLM",
"GraniteMoeModel",
"GraniteMoePreTrainedModel",
]

if TYPE_CHECKING:
from .configuration_granitemoe import GraniteMoeConfig

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_granitemoe import (
GraniteMoeForCausalLM,
GraniteMoeModel,
GraniteMoePreTrainedModel,
)

else:
import sys

sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
Loading

0 comments on commit e472e07

Please sign in to comment.