-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate Bradley Terry MLE model fitting #3523
Conversation
One question I have is about if the scipy import is OK. In the old version sklearn is used (which brings in scipy) but it is imported within the function not at the top. Neither scipy nor sklearn are in the overall package dependencies. Is it ok how it is now, or should the scipy imports go into the respective functions, or should sklearn and scipy be added as fschat dependencies? |
@cthorrez This is amazing stuff. Really appreciate you for doing this! I will look into running this on my end and see if I can reproduce Arena rankings. Would be it possible for you to also implement this for style control? The style control code is also in the same file, it is the same procedure roughly but you add a few additional coefficient (token length difference, etc) which is already pre computed. More information on style control is in our blogpost link here. Feel free to let me know if you got any questions. |
@infwinston Correct me if I am wrong,
The scipy imports should ideally be in their respective functions? |
Moved the scipy imports. For the style one, I just started looking at that today I had previously only been looking at the colab notebook when I wrote these optimizations. I think I can probably speed them up significantly too but not in the exact same way due to the non sparsity of the covariates. Also yeah please let me know if you can reproduce. I just ran again now and noticed some discrepancies which didn't show up when I did parity tests against the jupyter notebook version. One difference is that in the notebook it sets the tol in the LogisticRegression to 1e-6 rather than the default of 1e-4. Though even when I rerun main branch with 1e-6 it is slightly out of sync with this PR branch. I can start doing a more thorough comparison as I could have introduced a bug while porting from my repo to this one or there is something else different about between the notebook code and this code. |
@cthorrez Thanks man! So yea the style control code in
If it is makes it easier, the colab notebook for the style control link is also here link. Style control is very important because we have to compute the style control version of the leaderboard for every single category displayed on Chatbot Arena Leaderboard. If you go to our leaderboard and you can see that you can click on "style control" in the "apply filter" box. Even if we speed up the computation for the normal leaderboard, we still have to wait for the new style control leaderboard to finish. It would be really great if style control can be speed up as well. |
Ok I just pushed a new change which now has parity with main branch, if you modify main to have tol=1e-6 on line 126. As for style, do you think it makes sense to do that in another PR or to hold off on this one until that is also done? |
Hey @cthorrez . You are awesome. Great job implementing this. It is super helpful. |
Also -- Awesome idea with the multinomial trick! That's very creative. |
@aangelopoulos Thanks! for multinomial I literally went to sleep thinking about how to do bootstrap sampling in the unique matchup space and woke up with it lol. Anyways I've converted this back to draft for now, I have some more refactors going on in my own repo to accelerate the online Elo and style control methods as well. https://github.com/cthorrez/faster-arena-elo |
Cool, thanks! It would be great to have a fast version for continuous-valued features. For those, the multinomial trick won't work, but I'm looking forward to whatever creative ideas you come up with :) |
yeah don't get your hope too high haha the biggest win there is not having to duplicate the data to handle ties. For Online Elo It's a bit better doing a single pass over as many bootstrap samples of the dataset as you want and doing all the updates vectorized. If you really want style fast you could prob do it in jax and get some really big speedup on GPU/TPU but that can be another investigation |
By online elo do you mean constructing intervals using online gradient descent on the logistic loss? If so I don't think there's really a need to do bootstrapping. We really shouldn't even be constructing intervals in that setting, and should delete the whole thing lol. |
Yeah I mean this part: FastChat/fastchat/serve/monitor/elo_analysis.py Lines 611 to 616 in a04072e
I agree it's not really a good practice when the underlying strength of the competitors isn't changing over time and when BT is available but it's still in the notebook and it's still in the codebase so I though I'd try to accelerate it haha |
I did a refactor of Elo, BT, and style control. In my tests the style control only has a 3x speedup but is has a much better memory reduction from [2N,M+D] to [N,2+D] where N is the number of rows, M is the number of models, and D is the number of style control features. On my PC the original bootstrap style control code hits OOM when run on the full 1.8M dataset but the new version runs for me but still takes a while. (~15 min to run Please let me know if you want to see any changes or have any questions. |
) | ||
|
||
# this one is still memory and cpu intensive so don't make too many processes | ||
with mp.Pool(4) as pool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's
-
Add a progress bar here. I found the method below to work well.
results = list(tqdm(pool.imap(contextual_bt_fn, boot_idxs), total=num_rounds))
-
Allow the argparse to set the number of cpu. Something like
with mp.Pool(num_cpu) as pool:
results = list(tqdm(pool.imap(contextual_bt_fn, boot_idxs), total=num_rounds))
Hey @cthorrez, the code looks amazing. I was able to reproduce in-variance results on both overall and style control bootstrapping. I added a comment. The machine we are using to compute elo has a ton of cpu, using your code, we were able to speed up bootstrapping for style control from 50 minutes to less than 10 minutes (tho I haven't try more than pool=12). |
@CodingWithTim Good suggestions about tqdm and making the number of cores configurable. I did a test locally and confirmed that with 24 cores by pc runs the bootstrap in 7 min and full script in 11. I pushed a new commit with both of those changes |
@cthorrez Thank you very much for your contribution. This is super helpful for us and very important to our operation! If we start updating our leaderboard more frequently, it is all thanks to you! |
Why are these changes needed?
The bootstrap Bradley Terry model takes upwards of 15 minutes to run for 100 samples. This is costly on resources and hinders experiments such as studying hyperparameters like scale and base. With these changes the BT MLE bootstrap takes around 10 seconds. Parity tests are conducted in this repo: https://github.com/cthorrez/faster-arena-elo/tree/main
__main__
to look at the new BT bootstrap functionChecks
format.sh
to lint the changes in this PR.