Skip to content

[Bug]: Gemini cache hit tokens double counted in cost calculator #14849

@tjx666

Description

@tjx666

What happened?

  • When using gemini-2.5-pro through LiteLLM with context caching enabled, a single request produced the following usage: prompt_tokens=262,960, prompt_tokens_details.cached_tokens=257,955, completion_tokens=1,744.
  • LiteLLM reported response_cost = 0.7642 USD, but recalculating with Google Vertex pricing (cache miss tokens × 1.25/million + cache hit tokens × 0.625/million + output tokens × 10/million) gives 0.1849 USD.
  • The gap matches charging cache hits twice: once via text_tokens * input_cost_per_token and once via cache_hit_tokens * cache_read_input_token_cost.
  • Reading litellm/litellm_core_utils/llm_cost_calc/utils.py shows _parse_prompt_tokens_details keeps text_tokens equal to the full prompt count, and _calculate_input_cost adds both terms, so Gemini cache hits are double-counted.
  • Expected behaviour: cache hit tokens should only be charged at the cache-read rate (after removing them from the normal prompt bucket).

Relevant log output

LiteLLM usage block:

{
  "total_tokens": 264704,
  "prompt_tokens": 262960,
  "completion_tokens": 1744,
  "prompt_tokens_details": {
    "text_tokens": 262960,
    "cached_tokens": 257955
  },
  "completion_tokens_details": {
    "reasoning_tokens": 0
  },
  "response_cost": 0.7642
}

Manual recomputation:

  • cache miss tokens = 262,960 - 257,955 = 5,005 → 5,005 × 1.25 / 1e6 = 0.0062563
  • cache hit tokens = 257,955 × 0.625 / 1e6 = 0.1612219
  • output tokens = 1,744 × 10 / 1e6 = 0.01744
  • expected total = 0.1849182 USD

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.77.3

Twitter / LinkedIn details

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions