Accelerate Bradley Terry MLE model fitting #3523

cthorrez · 2024-09-13T07:16:03Z

Why are these changes needed?

The bootstrap Bradley Terry model takes upwards of 15 minutes to run for 100 samples. This is costly on resources and hinders experiments such as studying hyperparameters like scale and base. With these changes the BT MLE bootstrap takes around 10 seconds. Parity tests are conducted in this repo: https://github.com/cthorrez/faster-arena-elo/tree/main

Added functions to:
- preprocess battles into deduplicated unique outcomes (model_a, model_b, winner) weighted by occurrence count
- compute and optimize the Bradley Terry log-likelihood on the deduplicated data
- do bootstrap sampling directly in the deduplicated space by sampling counts for each possible outcome via multinomial and fit the bootstrap samples in parallel with multiprocessing
modified the call in __main__ to look at the new BT bootstrap function

Checks

I've run format.sh to lint the changes in this PR.
[N/A] I've included any doc changes needed. (interface in leaderboard.md is unchanged)
[N/A?] I've made sure the relevant tests are passing (if applicable).

cthorrez · 2024-09-13T07:18:03Z

One question I have is about if the scipy import is OK. In the old version sklearn is used (which brings in scipy) but it is imported within the function not at the top. Neither scipy nor sklearn are in the overall package dependencies. Is it ok how it is now, or should the scipy imports go into the respective functions, or should sklearn and scipy be added as fschat dependencies?

CodingWithTim · 2024-09-13T07:57:09Z

@cthorrez This is amazing stuff. Really appreciate you for doing this!

I will look into running this on my end and see if I can reproduce Arena rankings.

Would be it possible for you to also implement this for style control? The style control code is also in the same file, it is the same procedure roughly but you add a few additional coefficient (token length difference, etc) which is already pre computed. More information on style control is in our blogpost link here.

Feel free to let me know if you got any questions.

CodingWithTim · 2024-09-13T07:58:44Z

@infwinston Correct me if I am wrong,

Neither scipy nor sklearn are in the overall package dependencies. Is it ok how it is now, or should the scipy imports go into the respective functions, or should sklearn and scipy be added as fschat dependencies?

The scipy imports should ideally be in their respective functions?

cthorrez · 2024-09-13T08:28:47Z

Moved the scipy imports.

For the style one, I just started looking at that today I had previously only been looking at the colab notebook when I wrote these optimizations. I think I can probably speed them up significantly too but not in the exact same way due to the non sparsity of the covariates.

Also yeah please let me know if you can reproduce. I just ran again now and noticed some discrepancies which didn't show up when I did parity tests against the jupyter notebook version.

One difference is that in the notebook it sets the tol in the LogisticRegression to 1e-6 rather than the default of 1e-4. Though even when I rerun main branch with 1e-6 it is slightly out of sync with this PR branch.

I can start doing a more thorough comparison as I could have introduced a bug while porting from my repo to this one or there is something else different about between the notebook code and this code.

CodingWithTim · 2024-09-13T08:59:12Z

@cthorrez Thanks man!

So yea the style control code in elo_analysis.py is located here:

FastChat/fastchat/serve/monitor/elo_analysis.py

Line 498 in ef16c16

def get_bootstrap_result_style_control(

If it is makes it easier, the colab notebook for the style control link is also here link.

Style control is very important because we have to compute the style control version of the leaderboard for every single category displayed on Chatbot Arena Leaderboard. If you go to our leaderboard and you can see that you can click on "style control" in the "apply filter" box. Even if we speed up the computation for the normal leaderboard, we still have to wait for the new style control leaderboard to finish. It would be really great if style control can be speed up as well.

cthorrez · 2024-09-13T09:33:05Z

Ok I just pushed a new change which now has parity with main branch, if you modify main to have tol=1e-6 on line 126.
The change I made is to optimize with unnormalized weights, I think when the weights sum to 1, some of them are too small and we lose precision.

As for style, do you think it makes sense to do that in another PR or to hold off on this one until that is also done?
I don't have any guarantee on when I could have that ready.

aangelopoulos · 2024-09-20T03:37:44Z

Hey @cthorrez . You are awesome. Great job implementing this. It is super helpful.

aangelopoulos · 2024-09-20T03:42:18Z

Also -- Awesome idea with the multinomial trick! That's very creative.

cthorrez · 2024-09-20T05:31:37Z

@aangelopoulos Thanks! for multinomial I literally went to sleep thinking about how to do bootstrap sampling in the unique matchup space and woke up with it lol.

Anyways I've converted this back to draft for now, I have some more refactors going on in my own repo to accelerate the online Elo and style control methods as well. https://github.com/cthorrez/faster-arena-elo

aangelopoulos · 2024-09-20T05:34:08Z

Cool, thanks!

It would be great to have a fast version for continuous-valued features. For those, the multinomial trick won't work, but I'm looking forward to whatever creative ideas you come up with :)

cthorrez · 2024-09-20T05:38:00Z

yeah don't get your hope too high haha the biggest win there is not having to duplicate the data to handle ties. For Online Elo It's a bit better doing a single pass over as many bootstrap samples of the dataset as you want and doing all the updates vectorized.

If you really want style fast you could prob do it in jax and get some really big speedup on GPU/TPU but that can be another investigation

aangelopoulos · 2024-09-20T05:55:46Z

By online elo do you mean constructing intervals using online gradient descent on the logistic loss? If so I don't think there's really a need to do bootstrapping. We really shouldn't even be constructing intervals in that setting, and should delete the whole thing lol.

cthorrez · 2024-09-20T06:22:59Z

Yeah I mean this part:

FastChat/fastchat/serve/monitor/elo_analysis.py

Lines 611 to 616 in a04072e

    
           elif rating_system == "elo": 
        
               bootstrap_df = get_bootstrap_result( 
        
                   battles, compute_elo, num_round=num_bootstrap 
        
               ) 
        
               elo_rating_median = get_median_elo_from_bootstrap(bootstrap_df) 
        
               elo_rating_final = elo_rating_median

I agree it's not really a good practice when the underlying strength of the competitors isn't changing over time and when BT is available but it's still in the notebook and it's still in the codebase so I though I'd try to accelerate it haha

cthorrez · 2024-09-21T07:56:51Z

I did a refactor of Elo, BT, and style control. In my tests the style control only has a 3x speedup but is has a much better memory reduction from [2N,M+D] to [N,2+D] where N is the number of rows, M is the number of models, and D is the number of style control features. On my PC the original bootstrap style control code hits OOM when run on the full 1.8M dataset but the new version runs for me but still takes a while. (~15 min to run python elo_analysis.py --clean-battle-file clean_battle_20240826_public.json --rating-system bt --style-control)

Please let me know if you want to see any changes or have any questions.

CodingWithTim · 2024-09-23T01:46:36Z

fastchat/serve/monitor/rating_systems.py

+    )
+
+    # this one is still memory and cpu intensive so don't make too many processes
+    with mp.Pool(4) as pool:


Let's

Add a progress bar here. I found the method below to work well.
results = list(tqdm(pool.imap(contextual_bt_fn, boot_idxs), total=num_rounds))

Allow the argparse to set the number of cpu. Something like

with mp.Pool(num_cpu) as pool: results = list(tqdm(pool.imap(contextual_bt_fn, boot_idxs), total=num_rounds))

CodingWithTim · 2024-09-23T01:50:57Z

Hey @cthorrez, the code looks amazing. I was able to reproduce in-variance results on both overall and style control bootstrapping. I added a comment. The machine we are using to compute elo has a ton of cpu, using your code, we were able to speed up bootstrapping for style control from 50 minutes to less than 10 minutes (tho I haven't try more than pool=12).

cthorrez · 2024-09-23T03:32:11Z

@CodingWithTim Good suggestions about tqdm and making the number of cores configurable. I did a test locally and confirmed that with 24 cores by pc runs the bootstrap in 7 min and full script in 11.
Good to see the large speedup on the style control on the prod server.

I pushed a new commit with both of those changes

CodingWithTim · 2024-09-23T06:26:12Z

@cthorrez Thank you very much for your contribution. This is super helpful for us and very important to our operation! If we start updating our leaderboard more frequently, it is all thanks to you!

cthorrez added 2 commits September 12, 2024 23:55

speed up bradley terry mle

ae98043

fix autoformat issues

2838751

cthorrez marked this pull request as ready for review September 13, 2024 07:18

CodingWithTim self-requested a review September 13, 2024 07:54

CodingWithTim self-assigned this Sep 13, 2024

move scipy import

d44f1fc

optimize with unnormalized weights

0060644

cthorrez marked this pull request as draft September 20, 2024 05:28

cthorrez added 3 commits September 21, 2024 00:44

refactor elo, bt, style

5b0584c

reset imports

79e6942

run reformat

51fd193

cthorrez marked this pull request as ready for review September 21, 2024 07:52

CodingWithTim reviewed Sep 23, 2024

View reviewed changes

add tqdm and num-cpu

846cc69

CodingWithTim approved these changes Sep 23, 2024

View reviewed changes

CodingWithTim merged commit e208d56 into lm-sys:main Sep 23, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate Bradley Terry MLE model fitting #3523

Accelerate Bradley Terry MLE model fitting #3523

cthorrez commented Sep 13, 2024 •

edited

Loading

cthorrez commented Sep 13, 2024

CodingWithTim commented Sep 13, 2024 •

edited

Loading

CodingWithTim commented Sep 13, 2024

cthorrez commented Sep 13, 2024

CodingWithTim commented Sep 13, 2024

cthorrez commented Sep 13, 2024

aangelopoulos commented Sep 20, 2024

aangelopoulos commented Sep 20, 2024 •

edited

Loading

cthorrez commented Sep 20, 2024

aangelopoulos commented Sep 20, 2024

cthorrez commented Sep 20, 2024

aangelopoulos commented Sep 20, 2024

cthorrez commented Sep 20, 2024

cthorrez commented Sep 21, 2024

CodingWithTim Sep 23, 2024

CodingWithTim commented Sep 23, 2024

cthorrez commented Sep 23, 2024

CodingWithTim commented Sep 23, 2024

Accelerate Bradley Terry MLE model fitting #3523

Accelerate Bradley Terry MLE model fitting #3523

Conversation

cthorrez commented Sep 13, 2024 • edited Loading

Why are these changes needed?

Checks

cthorrez commented Sep 13, 2024

CodingWithTim commented Sep 13, 2024 • edited Loading

CodingWithTim commented Sep 13, 2024

cthorrez commented Sep 13, 2024

CodingWithTim commented Sep 13, 2024

cthorrez commented Sep 13, 2024

aangelopoulos commented Sep 20, 2024

aangelopoulos commented Sep 20, 2024 • edited Loading

cthorrez commented Sep 20, 2024

aangelopoulos commented Sep 20, 2024

cthorrez commented Sep 20, 2024

aangelopoulos commented Sep 20, 2024

cthorrez commented Sep 20, 2024

cthorrez commented Sep 21, 2024

CodingWithTim Sep 23, 2024

Choose a reason for hiding this comment

CodingWithTim commented Sep 23, 2024

cthorrez commented Sep 23, 2024

CodingWithTim commented Sep 23, 2024

cthorrez commented Sep 13, 2024 •

edited

Loading

CodingWithTim commented Sep 13, 2024 •

edited

Loading

aangelopoulos commented Sep 20, 2024 •

edited

Loading