Feature request: Per-token logits/logprobs in server (for reward models) #13697

mashdragon · 2025-05-22T03:59:27Z

mashdragon
May 22, 2025

Reward models such as https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF output the result by reading the logit for token id = 0 after generating a single token.

I can handle this (as logprobs instead of logits I guess...) in llama-server today using the completion API, but the only way exposed is with n_probs and I have to output ALL the logprobs of the entire vocabulary (over 128,000 tokens) because the logprobs for token id = 0 is less than all the other tokens. The other tokens have a static probability larger than the token id = 0.

It would be nice if we could speicfy a per-token ID logprobs output so I don't have to send 13 MB over the network on every query.

mashdragon · 2025-05-22T06:16:40Z

mashdragon
May 22, 2025
Author

How to implement it yourself if you need it:

diff --git a/tools/server/server.cpp b/tools/server/server.cpp
index 7424da52..b9abbcf3 100644
--- a/tools/server/server.cpp
+++ b/tools/server/server.cpp
@@ -559,6 +559,7 @@ struct completion_token_output {
         float prob;
     };
     std::vector<prob_info> probs;
+    float token_0_prob;
 
     json to_json(bool post_sampling_probs) const {
         json probs_for_token = json::array();
@@ -595,6 +596,10 @@ struct completion_token_output {
                     post_sampling_probs ? "top_probs" : "top_logprobs",
                     p.to_json(post_sampling_probs)
                 },
+                {
+                    post_sampling_probs ? "token_0_prob" : "token_0_logprob",
+                    post_sampling_probs ? p.token_0_prob : logarithm(p.token_0_prob)
+                },
             });
         }
         return out;
@@ -2379,6 +2384,7 @@ struct server_context {
     void populate_token_probs(const server_slot & slot, completion_token_output & result, bool post_sampling, bool special, int idx) {
         size_t n_probs = slot.params.sampling.n_probs;
         size_t n_vocab = llama_vocab_n_tokens(vocab);
+
         if (post_sampling) {
             const auto * cur_p = common_sampler_get_candidates(slot.smpl);
             const size_t max_probs = cur_p->size;
@@ -2390,6 +2396,12 @@ struct server_context {
                     break;
                 }
             }
+            for (size_t i = 0; i < max_probs; i++) {
+                if (cur_p->data[i].id == 0) {
+                    result.token_0_prob = cur_p->data[i].p;
+                    break;
+                }
+            }
 
             // set probability for top n_probs tokens
             result.probs.reserve(max_probs);
@@ -2412,6 +2424,13 @@ struct server_context {
                     break;
                 }
             }
+            for (size_t i = 0; i < n_vocab; i++) {
+                // set probability for token id 0
+                if (cur[i].id == 0) {
+                    result.token_0_prob = cur[i].p;
+                    break;
+                }
+            }
 
             // set probability for top n_probs tokens
             result.probs.reserve(n_probs);

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: Per-token logits/logprobs in server (for reward models) #13697

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature request: Per-token logits/logprobs in server (for reward models) #13697

Uh oh!

mashdragon May 22, 2025

Replies: 1 comment

Uh oh!

mashdragon May 22, 2025 Author

mashdragon
May 22, 2025

mashdragon
May 22, 2025
Author