Skip to content

A command-line tool to locally calculate the Perplexity (PPL) of a given text using a specified language model

License

Notifications You must be signed in to change notification settings

cezarc1/perplexity

Repository files navigation

Perplexity Calculator

A command-line tool to locally calculate the perplexity (PPL) of a given text using a specified language model.

...perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution.

This is not to be confused with Perplexity, the search engine product.

This repo largely follows the code provided on the excellent HuggingFace documentation on perplexity.

Supports cuda, mlx (mac m-series) and cpu inference on recurrent llms (Llama, Mistral, etc) and encoder-decoder LLMs (BERT).

Coming Soon

Non-HF hosted models (OpenAI, Anthropic, Gemmini-series)

Masked LLMs

Installation

  1. Clone this repository:

    git clone https://github.com/cezarc1/perplexity
    cd perplexity
  2. (Optional) If you plan to use the shell script, ensure you have uv installed. If not, the script will prompt you to install it if it's not found. See here for more info on uv.

Usage

Option 1: Running as a Shell Script

  1. Make the script executable:

    chmod +x calculate_perplexity.sh
  2. Run the script with text:

    ./calculate_perplexity.sh --model_id "google/gemma-2-2b-it" \
      --text "It's simple: Overspecialize, and you breed in weakness. It's slow death."

    Or with a text file:

    ./calculate_perplexity.sh --model_id "google/gemma-2-2b-it" \
      --text_file "path/to/your/text_file.txt"

Option 2a: Running as a Python Script

Run the Python script directly with uv:

uv run --with-requirements requirements.txt calculate_perplexity.py \
  --model_id "google/gemma-2-2b-it" \
  --text "It's simple: Overspecialize, and you breed in weakness. It's slow death."

Or with a text file:

uv run --with-requirements requirements.txt calculate_perplexity.py \
  --model_id "google/gemma-2-2b-it" \
  --text_file "path/to/your/text_file.txt"

Option 2b: Running as a Python Script (venv)

python -m venv .venv
source .venv/bin/activate 
pip install -r requirements.txt
python calculate_perplexity.py --model_id "google/gemma-2-2b-it" \
  --text "It's simple: Overspecialize, and you breed in weakness. It's slow death."

Or with a text file:

python calculate_perplexity.py --model_id "google/gemma-2-2b-it" \
  --text_file "path/to/your/text_file.txt"

Arguments

  • --model_id: The ID of the model to use (e.g., "meta-llama/Meta-Llama-3-8B")
  • --model_type: The type of model to use (choices: "recurrent", "encoder_decoder", "masked")
  • --text: The text to calculate perplexity on
  • --text_file: Path to a text file to calculate perplexity on
  • --stride (optional): The stride length to use for calculating perplexity (default: 512)

Note: You must provide either --text or --text_file, but not both.

Notes

  • The shell script version uses uv to manage dependencies and run the Python script.
  • The Python script version requires you to manually install the dependencies listed in requirements.txt.
  • Make sure you have sufficient permissions to download and use the specified model on HuggingFace.

About

A command-line tool to locally calculate the Perplexity (PPL) of a given text using a specified language model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published