[discussion] Completion Statistic Metrics #306

solyarisoftware · 2023-09-09T17:37:48Z

Hi guys

here just a tinkering with you.

BTW, yesterday I published prompter.vim my vim plugin transforming the vim editor as a LLMs playground. As stated in readme doc, I want to substitute the Azure/openai LLM interface using LiteLLM in a near future.

In my plugin after each completion I print a on-line statistic. Something like

Latency: 1480ms (1.5s) Tokens: 228 (prompt: 167 completion: 61) Speed: 154 Words: 28 Chars: 176, Lines: 7

Back to the point of converging on common metrics to measure a completion, obvious variables are:

Input (prompt):

max_token
temperature
etc.

Output (completion):

Token count returned, split as prompt tokens and completion tokens
Latency (you say "response time" in your beautiful compare_llms terminal tool).
Costs (you already did it: https://docs.litellm.ai/docs/token_usage)!

Now I find interesting some other basic LLMs measurements:

Speed (Tokens / Latency ratio) measured as tokens per second.

I implemented just as https://github.com/solyarisoftware/prompter.vim/blob/master/python/utils.py#L46
If we measure time in seconds, topically the speed is a positive integer, maybe from 1 to n-000...

I'm not totally sure about the practical impacts of having this speed metric, but in theory it has sense, I guess.
Prompt/Completion ratio as the prompt tokens nr. / completion tokens nr.

This metric could measure the length of prompt completion texts, having some practical meaning in vertical application (i'm not fully dure if this metric is really relevant)
many other metric / statistics

So the general proposal is to integrate in LiteLLM a support of these and maybe smarter common metrics to measure /and compare LLMs behavior. I'm thinkg about some statistic function, and some nice (terminal based) pretty printer of these statistics.

But above all: what do you think about the need of these possible metrics as shared satndard in the LLMs community (and afterward in LiteLLM).

Thanks
giorgio

The text was updated successfully, but these errors were encountered:

ishaan-jaff · 2023-09-09T17:46:33Z

https://github.com/solyarisoftware/prompter.vim looks amazing ! just take a look at it

ishaan-jaff · 2023-09-09T17:49:22Z

@solyarisoftware so here are proposed action items, let me know if I understood it correctly:

Add metrics calculation to litellm:

Tokens/second
Prompt tokens/completion tokens ratio

ishaan-jaff · 2023-09-09T17:50:02Z

does https://github.com/solyarisoftware/prompter.vim use streaming ? @solyarisoftware

krrishdholakia · 2023-09-09T18:19:41Z

I'm confused - don't we already return the token usage, prompt tokens, completion tokens as part of each completion call?

I want to substitute the Azure/openai LLM interface using LiteLLM in a near future.

@solyarisoftware was there a reason you didn't start with litellm?

solyarisoftware · 2023-09-10T12:37:43Z

Thanks @ishaan-jaff all for feedbacks.
I have to correct myself.

Add metrics calculation to litellm:

Tokens/second

Prompt tokens/completion tokens ratio

yes, when we come into completion metrics,
in addition to the latency (or response time), calculated as the elapsed time in millisecond of the LLM completion,
these are two are maybe relevant. But there are maybe other variables ratios that could be interesting for a statistical evaluation of a model behavior.

By the way, I have to apologyze and correct myself in the previous definition of "speed". This is misleading. The completion tokens/second is maybe more precisely defined as "Token Throughput (Token-Per-Millisecond Ratio)".
Please read this table as possible elaboration:

Concept	Token Throughput (Token-Per-Millisecond Ratio)
Definition	Token Throughput measures the efficiency of a large language model (LLM) in terms of token generation relative to its response time. It represents the rate at which the model can produce tokens given a specific latency or response time.
Token Aspect	It accounts for the number of tokens generated by the model. For example, if the LLM produces 1000 tokens, it reflects its language generation capacity.
Throughput Aspect	It considers the latency or response time required to generate those tokens. A lower latency with a higher number of tokens signifies higher efficiency.
Calculation Formula	Token Throughput = Number of Tokens / Latency (in milliseconds)
Units	Tokens per millisecond (Tokens/ms)
Comprehensive Metric	Token Throughput is a comprehensive metric as it doesn't focus solely on speed (tokens per second) or frequency (how often responses occur) but combines both aspects. It reflects the LLM's ability to deliver meaningful responses efficiently – a balance between generating more content and doing so in a timely manner.
Use Cases	- A high Token Throughput indicates that the LLM can generate a substantial amount of content within a short time, crucial for real-time applications like chatbots or content generation. - A lower Token Throughput may be acceptable in situations where response time is less critical, such as batch processing of text.

And also another possible metric is this:

Concept	Tokens Prompt / Tokens Completion Ratio
Definition	The Tokens Prompt / Tokens Completion Ratio is a metric used to measure the relationship between the number of tokens in the input prompt and the number of tokens in the model's generated completion. It quantifies how efficiently the model responds to input in terms of token economy.
Token Prompt Aspect	It represents the number of tokens in the input prompt provided to the language model. A longer prompt may contain more context or instructions for the model.
Token Completion Aspect	It refers to the number of tokens generated by the model in its response or completion to the input prompt. This includes the content produced as a result of the prompt.
Calculation Formula	Tokens Prompt / Tokens Completion Ratio = Number of Tokens in Prompt / Number of Tokens in Completion
Units	Unitless ratio (No units)
Interpretation	- A higher ratio indicates that the model produces more content in its completion relative to the input prompt's length. - A lower ratio suggests that the model may be more concise and generates shorter responses compared to the prompt's length.
Use Cases	- Monitoring the efficiency of language models in utilizing input information. - Understanding how the model balances the length of responses with the length of input. - Assessing the relevance and completeness of model-generated content relative to the prompt.

does https://github.com/solyarisoftware/prompter.vim use streaming ?

good (expected) question :-) So far I didn't implement streaming completion in prompter.vim, just to take things simple in my alpha release, avoiding visualizations complications (coding the plugin).
Nevertheless you are right, streaming completion would require the definitions of common metrics to be analyzed.

I never used API completion streaming mode, but I imagine streaming completion is composed by a list of completion "chunck" (events):

C1    C2        C3                                Cn  (last chunck) -> full completion                                   
███ ████ █████████ █████████ █████████████ ██████████

So maybe you can apply the token troughput metrics to each chunk, and maybe a mean value.
Having the average size of tokens of each chunk (or the mean pf all chunks part of a completion) could be useful fro vertical application on top of the LLM as a streaming text-to-speech engine...

I'm confused - don't we already return the token usage, prompt tokens, completion tokens as part of each completion call?

Right. I'm not critiquing. My proposa now is just to brainstorm WHICH are the common/interesting/useful "derivative" variables on top of I/O variables you mentioned, that's are already available, sure.
After an alignment on that set of variables, my proposal could be to integrate in LiteLLM some statistics variables as a component "on-top", maybe also supplying some nice terminal-oriented tool to show these statistics. Just an idea.

I want to substitute the Azure/openai LLM interface using LiteLLM in a near future.

@solyarisoftware was there a reason you didn't start with litellm?

That's easy: I already coded the openai APi interface when I discovered LiteLLM :) but I think it make sense to substitute my ugly code with LLM in a near future.

krrishdholakia · 2023-09-10T16:23:48Z

Thanks for the great feedback @solyarisoftware

ishaan-jaff · 2023-11-02T21:59:59Z

closing due to inactivity

solyarisoftware changed the title ~~[discussion] Completion Statstic Metrics~~ [discussion] Completion Statistic Metrics Sep 9, 2023

solyarisoftware mentioned this issue Sep 17, 2023

[Feature]: add latency in text_completion output format #389

Closed

ishaan-jaff closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] Completion Statistic Metrics #306

[discussion] Completion Statistic Metrics #306

solyarisoftware commented Sep 9, 2023 •

edited

Loading

ishaan-jaff commented Sep 9, 2023

ishaan-jaff commented Sep 9, 2023

ishaan-jaff commented Sep 9, 2023

krrishdholakia commented Sep 9, 2023 •

edited

Loading

solyarisoftware commented Sep 10, 2023

krrishdholakia commented Sep 10, 2023

ishaan-jaff commented Nov 2, 2023

[discussion] Completion Statistic Metrics #306

[discussion] Completion Statistic Metrics #306

Comments

solyarisoftware commented Sep 9, 2023 • edited Loading

ishaan-jaff commented Sep 9, 2023

ishaan-jaff commented Sep 9, 2023

ishaan-jaff commented Sep 9, 2023

krrishdholakia commented Sep 9, 2023 • edited Loading

solyarisoftware commented Sep 10, 2023

krrishdholakia commented Sep 10, 2023

ishaan-jaff commented Nov 2, 2023

solyarisoftware commented Sep 9, 2023 •

edited

Loading

krrishdholakia commented Sep 9, 2023 •

edited

Loading