Feature request: Per-token logits/logprobs in server (for reward models) #13697
mashdragon
started this conversation in
Ideas
Replies: 1 comment
-
How to implement it yourself if you need it: diff --git a/tools/server/server.cpp b/tools/server/server.cpp
index 7424da52..b9abbcf3 100644
--- a/tools/server/server.cpp
+++ b/tools/server/server.cpp
@@ -559,6 +559,7 @@ struct completion_token_output {
float prob;
};
std::vector<prob_info> probs;
+ float token_0_prob;
json to_json(bool post_sampling_probs) const {
json probs_for_token = json::array();
@@ -595,6 +596,10 @@ struct completion_token_output {
post_sampling_probs ? "top_probs" : "top_logprobs",
p.to_json(post_sampling_probs)
},
+ {
+ post_sampling_probs ? "token_0_prob" : "token_0_logprob",
+ post_sampling_probs ? p.token_0_prob : logarithm(p.token_0_prob)
+ },
});
}
return out;
@@ -2379,6 +2384,7 @@ struct server_context {
void populate_token_probs(const server_slot & slot, completion_token_output & result, bool post_sampling, bool special, int idx) {
size_t n_probs = slot.params.sampling.n_probs;
size_t n_vocab = llama_vocab_n_tokens(vocab);
+
if (post_sampling) {
const auto * cur_p = common_sampler_get_candidates(slot.smpl);
const size_t max_probs = cur_p->size;
@@ -2390,6 +2396,12 @@ struct server_context {
break;
}
}
+ for (size_t i = 0; i < max_probs; i++) {
+ if (cur_p->data[i].id == 0) {
+ result.token_0_prob = cur_p->data[i].p;
+ break;
+ }
+ }
// set probability for top n_probs tokens
result.probs.reserve(max_probs);
@@ -2412,6 +2424,13 @@ struct server_context {
break;
}
}
+ for (size_t i = 0; i < n_vocab; i++) {
+ // set probability for token id 0
+ if (cur[i].id == 0) {
+ result.token_0_prob = cur[i].p;
+ break;
+ }
+ }
// set probability for top n_probs tokens
result.probs.reserve(n_probs); |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Reward models such as https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF output the result by reading the logit for token id = 0 after generating a single token.
I can handle this (as logprobs instead of logits I guess...) in llama-server today using the completion API, but the only way exposed is with
n_probs
and I have to output ALL the logprobs of the entire vocabulary (over 128,000 tokens) because the logprobs for token id = 0 is less than all the other tokens. The other tokens have a static probability larger than the token id = 0.It would be nice if we could speicfy a per-token ID logprobs output so I don't have to send 13 MB over the network on every query.
Beta Was this translation helpful? Give feedback.
All reactions