English | 简体中文
Below are the relevant test results:
This project tests the recall capabilities of mainstream large language models using a "needle in a haystack" methodology. This testing approach may correlate with the performance of model attention mechanisms to some extent.
Primary models tested:
- gemini-2.5-pro
- gemini-2.5-flash-preview-09-2025
- claude-sonnet-4-5-thinking
- DeepSeek_V3_0324
- gemini-deepthink
Additionally, we tested recently popular hidden models:
- lithiumflow
- orionmist
test_results/ directory.
The testing process follows these steps:
- Construct Test Text: Randomly insert multiple four-digit numbers (1000-9999) into a fixed-token-length context
- Model Task: Require the model to extract all four-digit numbers and output them in JSON format in order of appearance
- Scoring Algorithm: Use an algorithm based on Edit Distance (Levenshtein Distance) to score model responses
The scoring system is based on the Edit Distance algorithm and comprehensively evaluates model performance:
- ✅ Penalizes extra keys (Hallucination errors): Model outputs non-existent numbers
- ✅ Penalizes missing keys (Omission errors): Model misses numbers that should be extracted
- ✅ Penalizes incorrect values: Extracted number values are incorrect
- ✅ Penalizes order errors: Number sequence doesn't match the original text
Accuracy Calculation Formula:
Accuracy = (1 - Edit Distance / Max Sequence Length) × 100%
.
├── 收集数据/ # Data generation and collection module
│ ├── generate_text.py # Generate test text
│ ├── run_batch_test.py # Batch API testing script
│ ├── numbers.json # Standard answers
│ ├── output.md # Generated test text
│ └── 数据库/ # Test results database
│
├── 数据分析/ # Data analysis module
│ ├── analyze_database.py # Basic database analysis
│ ├── analyze_summary.py # Model overview statistics
│ ├── analyze_errors.py # Error type analysis (misorder/hallucination/missing)
│ ├── analyze_position_accuracy.py # Position accuracy analysis (LCS algorithm)
│ ├── create_heatmap.py # Generate position accuracy heatmap
│ ├── create_hallucination_heatmap.py # Generate hallucination error heatmap
│ ├── create_missing_heatmap.py # Generate missing error heatmap
│ ├── create_misorder_position_heatmap.py # Generate misorder heatmap
│ ├── generate_all_heatmaps.py # Batch generate all heatmaps
│ ├── grading_utils.py # Scoring utility functions
│ └── 分析结果/ # Analysis results database
│
├── test_results/ # Test results (classified by model)
│ ├── gemini-2.5-pro/
│ ├── gemini_2_5_flash_preview_09_2025/
│ ├── claude_sonnet_4_5_thinking/
│ ├── lithiumflow/
│ └── orionmist/
│
├── grading_utils.py # Core scoring algorithm
├── evaluate_test.py # Single test evaluation
├── 答案.json # Standard answer example
├── test.json # Test answer example
└── README.md # Project documentation
Use generate_text.py to generate test text:
cd 收集数据
python generate_text.py [context_length] [insertion_count]Parameters:
context_length: Base string length (default: 50000)insertion_count: Number of four-digit numbers to insert (default: 40)
Example:
python generate_text.py 30000 50 # Generate 30000 characters, insert 50 numbersUse run_batch_test.py for batch API testing:
cd 收集数据
python run_batch_test.py [runs] [concurrency] [delay] [context_length] [insertions] [base_pattern]Parameters:
runs: Number of test runs (default: 10)concurrency: Number of concurrent requests (default: 10)delay: Delay between requests in seconds (default: 0)context_length: Context byte count (default: 30000)insertions: Number of numbers to insert (default: 40)base_pattern: Base string pattern (default:"a|")
Example:
python run_batch_test.py 20 5 1 30000 50 # 20 tests, 5 concurrent, 1 second delayNote: Configuration required in the script:
- API URL (
API_URL) - Model ID (
MODEL_ID) - API Key (
HEADERS['authorization'])
python 数据分析/analyze_database.py <database_path>python 数据分析/analyze_summary.py <database_path>Analyze three types of errors (misorder, hallucination, missing):
python 数据分析/analyze_errors.py <database_path>Analyze position accuracy using LCS (Longest Common Subsequence) algorithm:
python 数据分析/analyze_position_accuracy.py <database_path># Generate all heatmaps
python 数据分析/generate_all_heatmaps.py <database_path>
# Or generate individually
python 数据分析/create_heatmap.py <position_accuracy_database_path>
python 数据分析/create_hallucination_heatmap.py <error_stats_database_path>
python 数据分析/create_missing_heatmap.py <error_stats_database_path>python evaluate_test.pyThis command evaluates the accuracy of test.json relative to 答案.json.
Test results are stored in SQLite databases, separated by byte count:
- Raw Data:
bytes_{byte_count}tables store raw test records - Statistical Summary:
bytes_statstable records answered count and parse failure count
Analysis results are stored in separate databases:
- Model Overview:
model_summary_{model_id}.db - Error Statistics:
error_stats_{model_id}.db - Position Accuracy:
position_accuracy_{model_id}.db
The project generates various visualization charts (stored in each model's test_results/ subdirectory):
- 📊 Average Accuracy Score Chart: Shows average accuracy at different byte counts
- 🔥 Position Accuracy Heatmap: Shows correct answer probability at each position
- 🎯 Hallucination Test Heatmap: Shows distribution of hallucination errors
- 📉 Missing Test Heatmap: Shows distribution of missing errors
- 📈 JSON Output Failure Probability Chart: Shows probability distribution of parse failures
- 📑 Statistical Tables: Detailed statistics in Excel format
The project uses algorithms to analyze three types of errors:
- Misorder Errors: Uses LCS algorithm to identify correctly ordered numbers as anchors; correct values not in anchors are misorders
- Hallucination Errors: Identifies numbers output by the model that don't exist in standard answers, and locates their intervals between anchors
- Missing Errors: Counts numbers present in standard answers but not correctly output by the model
Based on the Longest Common Subsequence (LCS) algorithm:
- Finds subsequences in model responses that are completely consistent with standard answer order
- Counts the frequency of correct identification at each position
- Generates position accuracy distribution charts
pip install aiohttp numpy matplotlib seaborn openpyxlaiohttp: Async HTTP requestssqlite3: Database operations (Python built-in)numpy: Numerical computationmatplotlib: Chart plottingseaborn: Heatmap visualizationopenpyxl: Excel file operations
These tests only represent results from inserting needles within a| repetitive sequences. In actual usage, model recall and attention are significantly influenced by the input text characteristics. Therefore, the accuracy measured in these tests has comparative value only, not absolute value.
Do not directly apply these test results to claim that a model's recall and attention at a certain context length are limited to what these results show. Different input content, text structures, and linguistic features can all significantly affect actual model performance.
These tests are suitable for:
- Horizontal comparison: Comparing different models under identical testing conditions
- Longitudinal comparison: Evaluating the same model's recall performance across different context window sizes
For DeepSeek models, since the DeepSeek V3.1 version, a special attention mechanism has been introduced internally. This causes the test results to be inaccurate when using the simple a| pattern — specifically, it shows abnormally high accuracy across the entire context range, which is not normal behavior.
When we changed the insertion pattern and stopped using the simple a| repetition, DeepSeek models showed drastically different results. This suggests that the model may have special optimizations for specific repetitive patterns.
Future Plans: We will create a revised version of the test set to more accurately evaluate models with special attention mechanisms, including DeepSeek.
Additionally, we tested the GPT-5 model. However, GPT-5 uses an internal model routing mechanism that causes severe instability issues:
- In the same context (50k tokens) tests, accuracy distribution is extremely unstable, ranging wildly from 5% to 100%
- 100% accuracy likely indicates routing to the best model in the ensemble, but this cannot be reliably reproduced
- We attempted various methods to stabilize routing results, including:
- Modifying prompt content
- Adjusting thinking intensity settings (even when set to
high) - Various other optimization strategies
Test Conclusion: Due to the inability to obtain stable and reproducible test results, GPT-5 model test sets will not be publicly released at this time. We will publish the test set once better testing methodologies are developed.
If this project helps your research, citations are welcome!
Last Updated: 2025.11.01
