A comprehensive toolkit for Large Language Model (LLM) inference, evaluation, and data processing.
- Unified Inference & Judgment: Single command-line tool supporting both single-round and multi-round inference and evaluation
- Model Evaluation: Flexible evaluation framework with customizable metrics and comparison capabilities
- Data Processing: Utilities for merging, converting, and preparing training data (SFT, DPO)
- Batch Processing: Efficient parallel processing with ThreadPoolExecutor
- Format Support: Automatic detection and handling of both JSON and JSONL formats
git clone https://github.com/aexcellent/llm-cli-tools.git
cd llm-cli-tools
pip install -e .pip install llm-cli-toolsThe following parameters are available for tools that interact with LLM APIs:
| Parameter | Description | Default |
|---|---|---|
--model |
Model name (e.g., qwen-plus, gpt-4, deepseek-chat) | Required |
--api-key |
API Key (reads from environment variable if not specified) | None |
--base-url |
API Base URL (uses default if not specified) | None |
--temperature |
Temperature parameter for generation | 0.6 |
--max-tokens |
Maximum number of tokens to generate | 4096 |
--max-workers |
Number of parallel threads | 10 |
Modes:
inference: Single-round inferenceinference-round: Multi-round inferencejudge: Single-round judgmentjudge-round: Multi-round judgment
Common Parameters:
| Parameter | Description | Default |
|---|---|---|
--mode |
Running mode (inference, inference-round, judge, judge-round) | Required |
--input-path |
Input file path (JSON or JSONL format) | Required |
--output-path |
Output file path (auto-generated if not specified) | None |
--preserve-fields |
Fields to preserve from input data (comma-separated) | None |
Multi-round Parameters:
| Parameter | Description | Default |
|---|---|---|
--rounds |
Number of rounds to get results for each sample | 1 |
Judgment Parameters:
| Parameter | Description | Default |
|---|---|---|
--write-path |
Path to inference results file for judgment | Required for judge modes |
--prompt-file |
Custom judgment prompt file path | None |
--skip-no-output |
Skip samples with None output | False |
--save-original |
Save original input and output to results | False |
Examples:
Run single-round inference:
llm-cli inference \
--model gpt-4 \
--input-path data/input.jsonl \
--output-path results/output.jsonl \
--api-key YOUR_API_KEYRun multi-round inference:
llm-cli inference-round \
--model gpt-4 \
--input-path data/input.jsonl \
--output-path results/output.jsonl \
--rounds 3 \
--api-key YOUR_API_KEYRun single-round judgment:
llm-cli judge \
--model gpt-4 \
--input-path data/input.jsonl \
--output-path results/judgment.jsonl \
--api-key YOUR_API_KEYRun multi-round judgment:
llm-cli judge-round \
--model gpt-4 \
--input-path data/input.jsonl \
--output-path results/judgment.jsonl \
--rounds 3 \
--api-key YOUR_API_KEYParameters:
| Parameter | Description | Default |
|---|---|---|
--input-path |
Input test file (JSON or JSONL format) | test_data.json |
--output-path |
Output file path for results | outputs.jsonl |
--api-url |
API URL | http://localhost:8101/v1/chat/completions |
--timeout |
API request timeout (seconds) | 300 |
--json-mode |
Force JSON output | False |
--result-key |
Result field name in model output (supports nested fields like 'prediction,class') | auditresult |
--expected-key |
Expected value field name in test data (supports nested fields like 'output,label') | auditresult |
--eval-mode |
Evaluation mode: binary, multiclass, regression, or exact match | binary |
--only-infer |
Only run inference without evaluation | False |
--only-eval |
Only evaluate existing results | False |
--resume |
Skip existing trace_id | False |
--limit |
Limit number of test cases | None |
--verbose |
Show detailed logs | False |
Example:
llm-eval \
--input-path data/input.jsonl \
--output-path results/evaluation.jsonl \
--model gpt-4 \
--api-key YOUR_API_KEY \
--eval-mode binary \
--result-key auditresult \
--expected-key auditresultParameters:
| Parameter | Description |
|---|---|
input_files |
Input file paths (multiple files, can be JSON or JSONL format) |
--output-path |
Output file path (format auto-detected by extension) |
--dedupe |
Deduplicate by specified key (e.g., --dedupe id) |
--dedupe-all |
Deduplicate by complete content |
--verbose |
Show detailed statistics |
Example:
llm-merge \
file1.jsonl file2.jsonl file3.json \
--output-path merged.jsonl \
--dedupe id \
--verboseParameters:
| Parameter | Description | Default |
|---|---|---|
input_file |
Input file path (JSON or JSONL format) | Required |
--check-fields |
Comma-separated list of fields to check (removes if value is None, False, or "null") | output |
--output-path |
Output file path (defaults to input filename with _cleaned suffix) | None |
--overwrite |
Overwrite original file (creates backup automatically) | False |
--verbose |
Show detailed information | False |
Example:
# Clean with default check field (output)
llm-clean input.jsonl --output-path cleaned.jsonl
# Check multiple fields
llm-clean input.jsonl --check-fields output,score --output-path cleaned.jsonl --verboseParameters:
| Parameter | Description |
|---|---|
input_files |
Input file paths (multiple files, can be JSON or JSONL format) |
--output-path |
Output file path (can be specified multiple times for multiple outputs) |
--verbose |
Show detailed processing information |
Example:
llm-convert \
input.jsonl \
--output-path data/sft.jsonl \
--verboseParameters:
| Parameter | Description | Default |
|---|---|---|
score_file |
Score file path (JSON or JSONL format) | Required |
ref_file |
Reference file path containing messages and output | Required |
--output-path |
Output file path | Required |
--min-margin |
Minimum score difference threshold | 20.0 |
--min-chosen-score |
Minimum score for chosen samples | 60.0 |
--save-filtered |
Save filtered samples log to specified file | None |
--id-key |
ID field name | id |
--round-key |
Round field name | round |
--verbose |
Show detailed statistics | False |
Note: For each sample ID, the highest-scoring output is used as chosen, and all other outputs that meet the margin threshold are used as rejected. This means a single sample can generate multiple DPO pairs with the same instruction, input, and chosen but different rejected outputs.
Example:
llm-dpo \
score_data.jsonl \
ref_data.jsonl \
--output-path data/dpo.jsonl \
--min-margin 15 \
--min-chosen-score 70 \
--verboseParameters:
| Parameter | Description | Default |
|---|---|---|
--current-model-output |
Path to current model outputs JSONL file | Required |
--evaluation-files |
Evaluation detail files containing ground truth and predictions | Required |
--difficulty-files |
Difficulty files (used for mapping and weighting) | Required |
--result-key |
Field name for result key in model output | auditresult |
--eval-mode |
Evaluation mode: binary, multiclass, or regression | binary |
--trace-id-key |
Field name for trace ID | trace_id |
--difficulty-key |
Field name for difficulty | difficulty |
--evaluations-key |
Field name for evaluations | evaluations |
--ground-truth-key |
Field name for ground truth in evaluations | ground_truth |
--predicted-key |
Field name for predicted value in evaluations | predicted |
--output-path |
Output file to save results (JSON format) | None |
--model-name |
Custom name for the current model | current_model |
--log-level |
Logging level: DEBUG, INFO, WARNING, ERROR | INFO |
Example:
llm-compare \
--current-model-output current_model.jsonl \
--evaluation-files eval1.jsonl eval2.jsonl \
--difficulty-files diff1.jsonl diff2.jsonl \
--output-path comparison.jsonl \
--eval-mode binaryllm-cli-tools/
├── llm_cli_tools/
│ ├── cli/ # Command-line interface tools
│ │ └── llm_cli_unified.py
│ ├── eval/ # Evaluation and comparison tools
│ │ ├── llm_eval.py
│ │ └── compare_models_metrics.py
│ ├── data/ # Data processing utilities
│ │ ├── merge_jsonl.py
│ │ ├── convert2sftdata.py
│ │ └── build_dpo.py
│ └── utils/ # Common utilities
│ ├── file_utils.py
│ ├── nested_utils.py
│ └── normalize.py
├── pyproject.toml
└── README.md
Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your-api-key-here"Or pass it directly via command-line arguments:
llm-cli inference --api-key YOUR_API_KEY ...You can customize the judgment prompt by modifying the DEFAULT_JUDGE_PROMPT in the source code or passing a custom prompt file.
- Python 3.8+
- openai>=1.0.0
- requests>=2.31.0
- tqdm>=4.65.0
pip install -e ".[dev]"Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Author: Deyou Jiang
- Email: jiangdeyou@inspur.com
- GitHub: https://github.com/aexcellent/llm-cli-tools
- Built with OpenAI API
- Inspired by various LLM evaluation frameworks