Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate Bradley Terry MLE model fitting #3523

Merged
merged 8 commits into from
Sep 23, 2024
Merged

Conversation

cthorrez
Copy link
Contributor

@cthorrez cthorrez commented Sep 13, 2024

Why are these changes needed?

The bootstrap Bradley Terry model takes upwards of 15 minutes to run for 100 samples. This is costly on resources and hinders experiments such as studying hyperparameters like scale and base. With these changes the BT MLE bootstrap takes around 10 seconds. Parity tests are conducted in this repo: https://github.com/cthorrez/faster-arena-elo/tree/main

  • Added functions to:
    • preprocess battles into deduplicated unique outcomes (model_a, model_b, winner) weighted by occurrence count
    • compute and optimize the Bradley Terry log-likelihood on the deduplicated data
    • do bootstrap sampling directly in the deduplicated space by sampling counts for each possible outcome via multinomial and fit the bootstrap samples in parallel with multiprocessing
  • modified the call in __main__ to look at the new BT bootstrap function

Checks

  • I've run format.sh to lint the changes in this PR.
  • [N/A] I've included any doc changes needed. (interface in leaderboard.md is unchanged)
  • [N/A?] I've made sure the relevant tests are passing (if applicable).

@cthorrez
Copy link
Contributor Author

One question I have is about if the scipy import is OK. In the old version sklearn is used (which brings in scipy) but it is imported within the function not at the top. Neither scipy nor sklearn are in the overall package dependencies. Is it ok how it is now, or should the scipy imports go into the respective functions, or should sklearn and scipy be added as fschat dependencies?

@cthorrez cthorrez marked this pull request as ready for review September 13, 2024 07:18
@CodingWithTim CodingWithTim self-assigned this Sep 13, 2024
@CodingWithTim
Copy link
Collaborator

CodingWithTim commented Sep 13, 2024

@cthorrez This is amazing stuff. Really appreciate you for doing this!

I will look into running this on my end and see if I can reproduce Arena rankings.

Would be it possible for you to also implement this for style control? The style control code is also in the same file, it is the same procedure roughly but you add a few additional coefficient (token length difference, etc) which is already pre computed. More information on style control is in our blogpost link here.

Feel free to let me know if you got any questions.

@CodingWithTim
Copy link
Collaborator

@infwinston Correct me if I am wrong,

Neither scipy nor sklearn are in the overall package dependencies. Is it ok how it is now, or should the scipy imports go into the respective functions, or should sklearn and scipy be added as fschat dependencies?

The scipy imports should ideally be in their respective functions?

@cthorrez
Copy link
Contributor Author

Moved the scipy imports.

For the style one, I just started looking at that today I had previously only been looking at the colab notebook when I wrote these optimizations. I think I can probably speed them up significantly too but not in the exact same way due to the non sparsity of the covariates.

Also yeah please let me know if you can reproduce. I just ran again now and noticed some discrepancies which didn't show up when I did parity tests against the jupyter notebook version.

One difference is that in the notebook it sets the tol in the LogisticRegression to 1e-6 rather than the default of 1e-4. Though even when I rerun main branch with 1e-6 it is slightly out of sync with this PR branch.

I can start doing a more thorough comparison as I could have introduced a bug while porting from my repo to this one or there is something else different about between the notebook code and this code.

@CodingWithTim
Copy link
Collaborator

@cthorrez Thanks man!

So yea the style control code in elo_analysis.py is located here:

def get_bootstrap_result_style_control(

If it is makes it easier, the colab notebook for the style control link is also here link.

Style control is very important because we have to compute the style control version of the leaderboard for every single category displayed on Chatbot Arena Leaderboard. If you go to our leaderboard and you can see that you can click on "style control" in the "apply filter" box. Even if we speed up the computation for the normal leaderboard, we still have to wait for the new style control leaderboard to finish. It would be really great if style control can be speed up as well.

@cthorrez
Copy link
Contributor Author

Ok I just pushed a new change which now has parity with main branch, if you modify main to have tol=1e-6 on line 126.
The change I made is to optimize with unnormalized weights, I think when the weights sum to 1, some of them are too small and we lose precision.

As for style, do you think it makes sense to do that in another PR or to hold off on this one until that is also done?
I don't have any guarantee on when I could have that ready.

@aangelopoulos
Copy link
Collaborator

Hey @cthorrez . You are awesome. Great job implementing this. It is super helpful.

@aangelopoulos
Copy link
Collaborator

aangelopoulos commented Sep 20, 2024

Also -- Awesome idea with the multinomial trick! That's very creative.

@cthorrez cthorrez marked this pull request as draft September 20, 2024 05:28
@cthorrez
Copy link
Contributor Author

@aangelopoulos Thanks! for multinomial I literally went to sleep thinking about how to do bootstrap sampling in the unique matchup space and woke up with it lol.

Anyways I've converted this back to draft for now, I have some more refactors going on in my own repo to accelerate the online Elo and style control methods as well. https://github.com/cthorrez/faster-arena-elo

@aangelopoulos
Copy link
Collaborator

Cool, thanks!

It would be great to have a fast version for continuous-valued features. For those, the multinomial trick won't work, but I'm looking forward to whatever creative ideas you come up with :)

@cthorrez
Copy link
Contributor Author

yeah don't get your hope too high haha the biggest win there is not having to duplicate the data to handle ties. For Online Elo It's a bit better doing a single pass over as many bootstrap samples of the dataset as you want and doing all the updates vectorized.

If you really want style fast you could prob do it in jax and get some really big speedup on GPU/TPU but that can be another investigation

@aangelopoulos
Copy link
Collaborator

By online elo do you mean constructing intervals using online gradient descent on the logistic loss? If so I don't think there's really a need to do bootstrapping. We really shouldn't even be constructing intervals in that setting, and should delete the whole thing lol.

@cthorrez
Copy link
Contributor Author

Yeah I mean this part:

elif rating_system == "elo":
bootstrap_df = get_bootstrap_result(
battles, compute_elo, num_round=num_bootstrap
)
elo_rating_median = get_median_elo_from_bootstrap(bootstrap_df)
elo_rating_final = elo_rating_median

I agree it's not really a good practice when the underlying strength of the competitors isn't changing over time and when BT is available but it's still in the notebook and it's still in the codebase so I though I'd try to accelerate it haha

@cthorrez cthorrez marked this pull request as ready for review September 21, 2024 07:52
@cthorrez
Copy link
Contributor Author

I did a refactor of Elo, BT, and style control. In my tests the style control only has a 3x speedup but is has a much better memory reduction from [2N,M+D] to [N,2+D] where N is the number of rows, M is the number of models, and D is the number of style control features. On my PC the original bootstrap style control code hits OOM when run on the full 1.8M dataset but the new version runs for me but still takes a while. (~15 min to run python elo_analysis.py --clean-battle-file clean_battle_20240826_public.json --rating-system bt --style-control)

Please let me know if you want to see any changes or have any questions.

)

# this one is still memory and cpu intensive so don't make too many processes
with mp.Pool(4) as pool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's

  1. Add a progress bar here. I found the method below to work well.
    results = list(tqdm(pool.imap(contextual_bt_fn, boot_idxs), total=num_rounds))

  2. Allow the argparse to set the number of cpu. Something like

with mp.Pool(num_cpu) as pool:
    results = list(tqdm(pool.imap(contextual_bt_fn, boot_idxs), total=num_rounds))

@CodingWithTim
Copy link
Collaborator

Hey @cthorrez, the code looks amazing. I was able to reproduce in-variance results on both overall and style control bootstrapping. I added a comment. The machine we are using to compute elo has a ton of cpu, using your code, we were able to speed up bootstrapping for style control from 50 minutes to less than 10 minutes (tho I haven't try more than pool=12).

@cthorrez
Copy link
Contributor Author

@CodingWithTim Good suggestions about tqdm and making the number of cores configurable. I did a test locally and confirmed that with 24 cores by pc runs the bootstrap in 7 min and full script in 11.
Good to see the large speedup on the style control on the prod server.

I pushed a new commit with both of those changes

@CodingWithTim
Copy link
Collaborator

@cthorrez Thank you very much for your contribution. This is super helpful for us and very important to our operation! If we start updating our leaderboard more frequently, it is all thanks to you!

@CodingWithTim CodingWithTim merged commit e208d56 into lm-sys:main Sep 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants