Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] Completion Statistic Metrics #306

Closed
solyarisoftware opened this issue Sep 9, 2023 · 7 comments
Closed

[discussion] Completion Statistic Metrics #306

solyarisoftware opened this issue Sep 9, 2023 · 7 comments

Comments

@solyarisoftware
Copy link

solyarisoftware commented Sep 9, 2023

Hi guys

here just a tinkering with you.

BTW, yesterday I published prompter.vim my vim plugin transforming the vim editor as a LLMs playground. As stated in readme doc, I want to substitute the Azure/openai LLM interface using LiteLLM in a near future.

In my plugin after each completion I print a on-line statistic. Something like

Latency: 1480ms (1.5s) Tokens: 228 (prompt: 167 completion: 61) Speed: 154 Words: 28 Chars: 176, Lines: 7

Back to the point of converging on common metrics to measure a completion, obvious variables are:

Input (prompt):

  • max_token
  • temperature
  • etc.

Output (completion):

Now I find interesting some other basic LLMs measurements:

  • Speed (Tokens / Latency ratio) measured as tokens per second.

    I implemented just as https://github.com/solyarisoftware/prompter.vim/blob/master/python/utils.py#L46
    If we measure time in seconds, topically the speed is a positive integer, maybe from 1 to n-000...

    I'm not totally sure about the practical impacts of having this speed metric, but in theory it has sense, I guess.

  • Prompt/Completion ratio as the prompt tokens nr. / completion tokens nr.

    This metric could measure the length of prompt completion texts, having some practical meaning in vertical application (i'm not fully dure if this metric is really relevant)

  • many other metric / statistics

So the general proposal is to integrate in LiteLLM a support of these and maybe smarter common metrics to measure /and compare LLMs behavior. I'm thinkg about some statistic function, and some nice (terminal based) pretty printer of these statistics.

But above all: what do you think about the need of these possible metrics as shared satndard in the LLMs community (and afterward in LiteLLM).

Thanks
giorgio

@solyarisoftware solyarisoftware changed the title [discussion] Completion Statstic Metrics [discussion] Completion Statistic Metrics Sep 9, 2023
@ishaan-jaff
Copy link
Contributor

https://github.com/solyarisoftware/prompter.vim looks amazing ! just take a look at it

@ishaan-jaff
Copy link
Contributor

@solyarisoftware so here are proposed action items, let me know if I understood it correctly:

Add metrics calculation to litellm:

  • Tokens/second
  • Prompt tokens/completion tokens ratio

@ishaan-jaff
Copy link
Contributor

@krrishdholakia
Copy link
Contributor

krrishdholakia commented Sep 9, 2023

I'm confused - don't we already return the token usage, prompt tokens, completion tokens as part of each completion call?

I want to substitute the Azure/openai LLM interface using LiteLLM in a near future.

@solyarisoftware was there a reason you didn't start with litellm?

@solyarisoftware
Copy link
Author

Thanks @ishaan-jaff all for feedbacks.
I have to correct myself.

Add metrics calculation to litellm:

  • Tokens/second
  • Prompt tokens/completion tokens ratio

yes, when we come into completion metrics,
in addition to the latency (or response time), calculated as the elapsed time in millisecond of the LLM completion,
these are two are maybe relevant. But there are maybe other variables ratios that could be interesting for a statistical evaluation of a model behavior.

By the way, I have to apologyze and correct myself in the previous definition of "speed". This is misleading. The completion tokens/second is maybe more precisely defined as "Token Throughput (Token-Per-Millisecond Ratio)".
Please read this table as possible elaboration:

Concept Token Throughput (Token-Per-Millisecond Ratio)
Definition Token Throughput measures the efficiency of a large language model (LLM) in terms of token generation relative to its response time. It represents the rate at which the model can produce tokens given a specific latency or response time.
Token Aspect It accounts for the number of tokens generated by the model. For example, if the LLM produces 1000 tokens, it reflects its language generation capacity.
Throughput Aspect It considers the latency or response time required to generate those tokens. A lower latency with a higher number of tokens signifies higher efficiency.
Calculation Formula Token Throughput = Number of Tokens / Latency (in milliseconds)
Units Tokens per millisecond (Tokens/ms)
Comprehensive Metric Token Throughput is a comprehensive metric as it doesn't focus solely on speed (tokens per second) or frequency (how often responses occur) but combines both aspects. It reflects the LLM's ability to deliver meaningful responses efficiently – a balance between generating more content and doing so in a timely manner.
Use Cases - A high Token Throughput indicates that the LLM can generate a substantial amount of content within a short time, crucial for real-time applications like chatbots or content generation. - A lower Token Throughput may be acceptable in situations where response time is less critical, such as batch processing of text.

And also another possible metric is this:

Concept Tokens Prompt / Tokens Completion Ratio
Definition The Tokens Prompt / Tokens Completion Ratio is a metric used to measure the relationship between the number of tokens in the input prompt and the number of tokens in the model's generated completion. It quantifies how efficiently the model responds to input in terms of token economy.
Token Prompt Aspect It represents the number of tokens in the input prompt provided to the language model. A longer prompt may contain more context or instructions for the model.
Token Completion Aspect It refers to the number of tokens generated by the model in its response or completion to the input prompt. This includes the content produced as a result of the prompt.
Calculation Formula Tokens Prompt / Tokens Completion Ratio = Number of Tokens in Prompt / Number of Tokens in Completion
Units Unitless ratio (No units)
Interpretation - A higher ratio indicates that the model produces more content in its completion relative to the input prompt's length. - A lower ratio suggests that the model may be more concise and generates shorter responses compared to the prompt's length.
Use Cases - Monitoring the efficiency of language models in utilizing input information. - Understanding how the model balances the length of responses with the length of input. - Assessing the relevance and completeness of model-generated content relative to the prompt.

does https://github.com/solyarisoftware/prompter.vim use streaming ?

good (expected) question :-) So far I didn't implement streaming completion in prompter.vim, just to take things simple in my alpha release, avoiding visualizations complications (coding the plugin).
Nevertheless you are right, streaming completion would require the definitions of common metrics to be analyzed.

I never used API completion streaming mode, but I imagine streaming completion is composed by a list of completion "chunck" (events):

C1    C2        C3                                Cn  (last chunck) -> full completion                                   
███ ████ █████████ █████████ █████████████ ██████████

So maybe you can apply the token troughput metrics to each chunk, and maybe a mean value.
Having the average size of tokens of each chunk (or the mean pf all chunks part of a completion) could be useful fro vertical application on top of the LLM as a streaming text-to-speech engine...


I'm confused - don't we already return the token usage, prompt tokens, completion tokens as part of each completion call?

Right. I'm not critiquing. My proposa now is just to brainstorm WHICH are the common/interesting/useful "derivative" variables on top of I/O variables you mentioned, that's are already available, sure.
After an alignment on that set of variables, my proposal could be to integrate in LiteLLM some statistics variables as a component "on-top", maybe also supplying some nice terminal-oriented tool to show these statistics. Just an idea.

I want to substitute the Azure/openai LLM interface using LiteLLM in a near future.

@solyarisoftware was there a reason you didn't start with litellm?

That's easy: I already coded the openai APi interface when I discovered LiteLLM :) but I think it make sense to substitute my ugly code with LLM in a near future.

@krrishdholakia
Copy link
Contributor

Thanks for the great feedback @solyarisoftware

@ishaan-jaff
Copy link
Contributor

closing due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants