-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Open
Description
What happened?
- When using
gemini-2.5-prothrough LiteLLM with context caching enabled, a single request produced the following usage:prompt_tokens=262,960,prompt_tokens_details.cached_tokens=257,955,completion_tokens=1,744. - LiteLLM reported
response_cost = 0.7642 USD, but recalculating with Google Vertex pricing (cache miss tokens × 1.25/million + cache hit tokens × 0.625/million + output tokens × 10/million) gives0.1849 USD. - The gap matches charging cache hits twice: once via
text_tokens * input_cost_per_tokenand once viacache_hit_tokens * cache_read_input_token_cost. - Reading
litellm/litellm_core_utils/llm_cost_calc/utils.pyshows_parse_prompt_tokens_detailskeepstext_tokensequal to the full prompt count, and_calculate_input_costadds both terms, so Gemini cache hits are double-counted. - Expected behaviour: cache hit tokens should only be charged at the cache-read rate (after removing them from the normal prompt bucket).
Relevant log output
LiteLLM usage block:
{
"total_tokens": 264704,
"prompt_tokens": 262960,
"completion_tokens": 1744,
"prompt_tokens_details": {
"text_tokens": 262960,
"cached_tokens": 257955
},
"completion_tokens_details": {
"reasoning_tokens": 0
},
"response_cost": 0.7642
}Manual recomputation:
- cache miss tokens = 262,960 - 257,955 = 5,005 → 5,005 × 1.25 / 1e6 = 0.0062563
- cache hit tokens = 257,955 × 0.625 / 1e6 = 0.1612219
- output tokens = 1,744 × 10 / 1e6 = 0.01744
- expected total = 0.1849182 USD
Are you a ML Ops Team?
No
What LiteLLM version are you on ?
v1.77.3
Twitter / LinkedIn details
N/A
chrisranderson, Yengas, nathansolidatus, yigitcan-dh and alvarosevilla95
Metadata
Metadata
Assignees
Labels
No labels