Exact command to reproduce the curve in MOD is Vibe? #10

jzhang38 · 2024-04-13T06:29:46Z

Hi Joey,

Thank you for such a wonderful OS work! !

Could you share the exact command to reproduce the curve in your MOD is Vibe blog? For example, did you use DDP and how many GPUs?

joey00072 · 2024-04-13T17:24:15Z

I trained 300M model on single A6000 (from paperspace grident) with bf16-mixed presicon ,

experiments/mixture_of_depth/train_mod.py here is location of training script you can look into it and change sizes, or keep it default to reproduce.

first install this repo git clone and pip install -e .
then pretokenize dataset python examples/prepare-dataset.py open this file and change dataset to minipile
and than run train_mod.py.
I am using lightning fabric so it should be pretty easy to multi node training but I trained on single 48gig a6000

I'll add README here experiments/mixture_of_depth/ in details tonight or tomorrow 😅

jzhang38 · 2024-04-14T00:55:31Z

I'll add README here experiments/mixture_of_depth/ in details tonight or tomorrow

Thank you so much!

Yeah I've pretty much read your code related to MoD.

One concern for me is that I noticed the dataset is implemented as an iterator object. So I am not sure whether lightning fabric would handle this correctly in a multi-gpu setup as we would need a distributed sampler.

jzhang38 · 2024-04-14T07:13:10Z

Looking at Figure 7 from the paper, I feel they also multiply the router weights to those skipped tokens as well.

joey00072 · 2024-04-14T07:28:07Z

We are taking softmax over long seq length, most values at other end will be close to zeros,
if we multiply all tokens by router logits pass though token will become really tiny like 1e-5 or something.

WuNein · 2024-04-28T06:28:32Z

@joey00072
One more thing, about MoD.

ohara/experiments/mixture_of_depth/mixture_of_depth.py

Line 121 in 16941b3

    
           filtered_x = torch.gather(input=x, dim=1, index=indices_expanded)  # -> batch, capacity, dim

Since MoD only makes a very small fraction of the tokens for caluating attention, I have concerns about model performance in some extreme cases, such as very short inputs.

ohara/experiments/mixture_of_depth/mixture_of_depth.py

Line 102 in 16941b3

top_k = int(seq_len * self.capacity_factor) # may be i should use math.ceil

I think topK should have a minial value, like 10 or seq_len (when seq_len < 10).

And when faced with long text, this thing is kind of like sparse attention or a sliding window, which I think is acceptable.

I would like to consult your thoughts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exact command to reproduce the curve in MOD is Vibe? #10

Exact command to reproduce the curve in MOD is Vibe? #10

jzhang38 commented Apr 13, 2024 •

edited

Loading

joey00072 commented Apr 13, 2024 •

edited

Loading

jzhang38 commented Apr 14, 2024 •

edited

Loading

jzhang38 commented Apr 14, 2024

joey00072 commented Apr 14, 2024

WuNein commented Apr 28, 2024 •

edited

Loading

Exact command to reproduce the curve in MOD is Vibe? #10

Exact command to reproduce the curve in MOD is Vibe? #10

Comments

jzhang38 commented Apr 13, 2024 • edited Loading

joey00072 commented Apr 13, 2024 • edited Loading

jzhang38 commented Apr 14, 2024 • edited Loading

jzhang38 commented Apr 14, 2024

joey00072 commented Apr 14, 2024

WuNein commented Apr 28, 2024 • edited Loading

jzhang38 commented Apr 13, 2024 •

edited

Loading

joey00072 commented Apr 13, 2024 •

edited

Loading

jzhang38 commented Apr 14, 2024 •

edited

Loading

WuNein commented Apr 28, 2024 •

edited

Loading