Description of the experiment: Preset Arena: 17,205 comparisons between 241 different presets.
Some numbers:
- 7215 valid votes
- 951 voting sessions
- 288 users with usernames
Preset definitions: presets
https://huggingface.co/datasets/oobabooga/preset-arena
The first step in the analysis of the votes was to try to identify suspicious voters. Each voting session received a unique uuid string, allowing the frequency of left/right votes to be analyzed.
I have used the following code to calculate the probability that a voting session was biased. It was obtained by asking ChatGPT for a fair coin test:
from scipy.stats import beta
def compute_bias_probability(outcomes, prior_alpha=1, prior_beta=1, _print=False):
# Count the number of heads and tails
num_heads = outcomes.count('left')
num_tails = outcomes.count('right')
if _print:
print(num_heads, num_tails)
# Update the prior with the observed outcomes
posterior_alpha = prior_alpha + num_heads
posterior_beta = prior_beta + num_tails
# Calculate the bias probability using the Beta distribution
bias_probability = beta.cdf(0.5, posterior_alpha, posterior_beta)
return bias_probability
A session was disconsidered if bias_probability > 0.99
, which happened for 0.6% of all sessions.
The basic formula is
def update_rating(rating, opponent_rating, outcome, k=32):
expected_score = 1 / (1 + 10**((opponent_rating - rating) / 400))
new_rating = rating + k * (outcome - expected_score)
return new_rating
where the ratings are initialized as 1000
for all presets, and outcome
is 1 in case of winning and 0 in case of losing.
To make things more robust, I have used the following procedure instead of calculating the elo scores just once:
- take a random subsample containing 90% of votes
- using that sample, calculate the elo scores for chat and instruct prompts separately
- repeat 200 times
- take the averages of the elo scores for each preset
Additionally, I have not counted votes where both completions are identical.
-
I find that the top chat presets are all kind of the same. It may be due to the chat prompts being too simple and short, causing presets with low top_p to be favored.
-
5 variations of the Mirostat preset were included. It turned out that
Mirostat-5
was a bit better than theMirostat
preset originally included in text-generation-webui:
preset | params | elo score (chat) | elo score (instruct) | elo score (all) | matches (chat) | matches (instruct) |
---|---|---|---|---|---|---|
Mirostat-5 | 2 | 1012.723756636154 | 1100.0171006055577 | 1056.3704286208558 | 36 | 23 |
Mirostat | 1 | 993.0564327577029 | 1109.172602933306 | 1051.1145178455045 | 27 | 22 |
Mirostat-2 | 2 | 1067.8824770156248 | 1028.214156025321 | 1048.0483165204728 | 29 | 25 |
Mirostat-4 | 2 | 1031.9219927236945 | 1020.1965461643792 | 1026.059269444037 | 37 | 35 |
Mirostat-3 | 2 | 988.1664164954003 | 1021.2103791101517 | 1004.6883978027761 | 29 | 29 |
- Similarly, 5 Contrastive Search variations were included,
Contrastive Search-3
ended up being a bit better than the originalContrastive Search
:
preset | params | elo score (chat) | elo score (instruct) | elo score (all) | matches (chat) | matches (instruct) |
---|---|---|---|---|---|---|
Special-Contrastive Search-3 | 3 | 1077.6702759297164 | 1115.8151721393688 | 1096.7427240345426 | 27 | 18 |
Special-Contrastive Search | 3 | 1077.3415040295642 | 1095.4654729538931 | 1086.4034884917287 | 35 | 31 |
Special-Contrastive Search-1 | 3 | 899.7205727080627 | 851.8635177853589 | 875.7920452467108 | 16 | 10 |
Special-Contrastive Search-4 | 3 | 765.788679774467 | 790.9640810990088 | 778.3763804367379 | 33 | 19 |
Special-Contrastive Search-2 | 3 | 801.0156035678388 | 736.8621355164904 | 768.9388695421646 | 27 | 25 |
- Eta Sampling (another special technique) did not perform very well (
but its parameters are present in other top-performing presets):
preset | params | elo score (chat) | elo score (instruct) | elo score (all) | matches (chat) | matches (instruct) |
---|---|---|---|---|---|---|
Special-Eta Sampling | 3 | 1018.5269796896921 | 1016.4519009597249 | 1017.4894403247085 | 29 | 25 |
- The best preset overall, considering the average of the chat and instruct elo scores, was also perhaps the most obvious. I originally named it
simple-1
not expecting it to get anywhere:
temperature: 0.7
top_p: 0.9
repetition_penalty: 1.15
top_k: 20
The StarChat preset, also very simple, also performed well:
temperature: 0.2
top_p: 0.95
top_k: 50
This demonstrates that fancy samplers may not be that necessary.
For the purpose of including better presets in text-generation-webui, I removed presets with top_p < 0.05
or top_k < 3
because that seemed too low and artificial. That left me with the following (in decreasing order of elo score):
Preset | New name |
---|---|
random_preset_066 | Divine Intellect |
random_preset_134 | Big O |
simple-1 | |
random_preset_035 | Space Alien |
starchat | StarChat |
random_preset_183 | Titanic |
tfs-with-top-a | |
random_preset_002 | Asterism |
Special-Contrastive Search-3 | Contrastive Search |
Preset | New name |
---|---|
random_preset_101 | Midnight Enigma |
random_preset_161 | Yara |
random_preset_120 | Shortwave |
Kobold-Godlike |
I took the liberty of giving gave some cheesy names for the new random presets.
In those 13 new presets, these are the sampling parameters that are present and how many times they appear:
12 temperature
11 top_p
11 top_k
11 repetition_penalty
5 top_a
3 tfs
2 typical_p
2 eta_cutoff
1 penalty_alpha
1 epsilon_cutoff
1 encoder_repetition_penalty
In a follow-up analysis, I have tried removing samplers from the presets and seeing if the resulting logits changed.
For that, I took some random story that I copied and pasted from the internet, split it by spaces, and computed the logits using as input the first N words for N <= 200. That is, 200 logit vectors were computed for each preset. Then I considered a parameter as redundant if its removal kept the logits identical 90% of the time or more.
The resulting parameter frequency after this clean-up was:
12 temperature
11 top_p
11 top_k
11 repetition_penalty
2 typical_p
2 tfs
1 top_a
1 penalty_alpha
1 encoder_repetition_penalty
Note that the eta sampling parameters (epsilon_cutoff
and eta_cutoff
) disappeared completely.