Training losses of the models #1

borgr · 2024-03-29T20:36:48Z

Hi,
Can you share the logs or training losses of the different models (per arch isoflop size etc.)? I will be sure to cite ;-)

borgr · 2024-04-24T21:18:06Z

Please?

Zymrael · 2024-05-02T08:03:13Z

Hi @borgr, I added a sample here, with losses per isoFLOP group, with indication of model size. The sample includes Transformer++ (Llama) and SH with ~8.3% striping.

Hopefully it is useful :) What type of analysis are you planning? If you let me know I can try to collect the relevant data

borgr · 2024-05-02T11:58:49Z

I am collecting a meta-dataset with losses\downstream-eval for different architectures pretraining schemas data etc. To allow to ask questions about pretraining without doing a lot of pretraining, to allow for results on understanding scaling laws or a\c testings themselves (this is quite mature, we have results on things like how to make a scaling law more efficiently what affects results how good are the predictions etc.), and to allow other questions (we just started another effort related to economics... so quite diverse). Data is power, and architectures are often less diverse than what you have :-)
The minimum useful is loss\eval throughout the pretraining (per family big plus) plus model size and metadata helpful (arch data or what exists)

Zymrael · 2024-05-02T19:26:57Z

Great, then it sounds like the sample above should be good, take a look and let me know!

borgr · 2024-05-02T22:48:08Z

Looks great! If you have anything else I would be happy if you sent it my way (just because you said "sample", means you might have more hyennas and stuff, right? In the paper you had quite a few more, AB testings are a good thing and quite rare, and you had a few of those. as they allow future people to see if they can predict one is better than the other)
If the size is annoying, You can cap the maximum steps per model to something smaller if that's an issue (e.g., 10K?)
If useful for anyone to parse:


    numpy_dict = np.load("raw_data/loss_size_flops_llama_sh.npy", allow_pickle=True)
    rows = []
    for model, mod_dict in numpy_dict.tolist().items():
        for flops, isoflop_dict in mod_dict.items():
            for model_data in zip(*isoflop_dict.values()):
                metadata = {key.lower(): val for key, val in zip(isoflop_dict.keys(), model_data)}
                for loss, cur_flops in zip(metadata["loss"], np.linspace(0, float(flops), len(metadata["loss"]))):
                    row = metadata.copy()
                    row["loss"] = loss
                    row["scaled_set"] = model
                    row["flops"] = cur_flops
                    rows.append(row)
    df = pd.DataFrame.from_records(rows)

borgr · 2024-05-08T14:47:04Z

Also, is it possible that the number of tokens seen is ~10e9-max 1e11?
Model_params_max-flops tokens_seen
Striped Hyena 1-11_0.10B_20.00E 3.207023e+10
Striped Hyena 1-11_0.10B_40.00E 6.414046e+10
Striped Hyena 1-11_0.16B_80.00E 8.338310e+10
Striped Hyena 1-11_0.17B_20.00E 1.913711e+10
Striped Hyena 1-11_0.17B_40.00E 3.827422e+10
Striped Hyena 1-11_0.22B_20.00E 1.501130e+10
Striped Hyena 1-11_0.22B_200.00E 1.502729e+11
Striped Hyena 1-11_0.29B_20.00E 1.154081e+10
Striped Hyena 1-11_0.29B_40.00E 2.308162e+10
Striped Hyena 1-11_0.29B_80.00E 4.616324e+10
Striped Hyena 1-11_0.36B_20.00E 9.219595e+09
Striped Hyena 1-11_0.36B_200.00E 9.227016e+10
Striped Hyena 1-11_0.36B_40.00E 1.843919e+10
Striped Hyena 1-11_0.44B_20.00E 7.539229e+09
Striped Hyena 1-11_0.54B_40.00E 1.243781e+10
Striped Hyena 1-11_0.55B_80.00E 2.405991e+10
Striped Hyena 1-11_0.65B_200.00E 5.109001e+10
Striped Hyena 1-11_0.65B_40.00E 1.021174e+10
Striped Hyena 1-11_0.76B_80.00E 1.755822e+10
Striped Hyena 1-11_1.03B_80.00E 1.295666e+10
Striped Hyena 1-11_1.16B_20.00E 2.869705e+09
Striped Hyena 1-11_1.16B_200.00E 2.871053e+10
Striped Hyena 1-11_1.16B_40.00E 5.739410e+09
Striped Hyena 1-11_1.90B_200.00E 1.751262e+10
Striped Hyena 1-11_1.90B_40.00E 3.501019e+09
Striped Hyena 1-11_1.90B_80.00E 7.002038e+09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training losses of the models #1

Training losses of the models #1

borgr commented Mar 29, 2024

borgr commented Apr 24, 2024

Zymrael commented May 2, 2024

borgr commented May 2, 2024

Zymrael commented May 2, 2024

borgr commented May 2, 2024 •

edited

Loading

borgr commented May 8, 2024

Training losses of the models #1

Training losses of the models #1

Comments

borgr commented Mar 29, 2024

borgr commented Apr 24, 2024

Zymrael commented May 2, 2024

borgr commented May 2, 2024

Zymrael commented May 2, 2024

borgr commented May 2, 2024 • edited Loading

borgr commented May 8, 2024

borgr commented May 2, 2024 •

edited

Loading