-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Pandas 3.0.0 Compatibility Issue in hedtools KeyMap
Executive Summary
Issue: hedtools 0.9.0's KeyMap class fails with pandas 3.0.0 when creating a pd.Series from a dictionary with certain large integer hash values.
Location: hed/tools/analysis/key_map.py, line 149:
map_series = pd.Series(self.map_dict)Root Cause: Pandas 3.0.0 has a bug in pd.Series._init_dict() that causes it to incorrectly handle dictionaries with large negative integer keys that trigger RangeIndex optimization attempts, leading to integer overflow and index length mismatches.
Workaround: Create the Index separately before creating the Series.
Detailed Analysis
The Failing Code
In key_map.py line 149, KeyMap tries to create a Series from its map_dict:
def _remap(self, df):
# ... earlier code ...
map_series = pd.Series(self.map_dict) # <-- FAILS HERE with pandas 3.0The Problem Dict
The issue occurs with hash dictionaries like:
{-4186896901282141619: 0, -8311529505453501279: 1}These are legitimate hash values generated by KeyMap for string keys like '6' and '2'.
Pandas 3.0 Bug Details
When pd.Series(dict) is called, pandas 3.0's _init_dict method:
- Extracts keys and values from the dict
- Attempts to optimize by creating a RangeIndex if keys appear sequential
- With large negative integers, this calculation overflows:
keys[0] + diffexceeds int64 max - The overflow corrupts the index creation, resulting in an empty index (length 0)
- When trying to create the Series with values (length 2) and empty index (length 0), it raises:
ValueError: Length of values (2) does not match length of index (0)
Reproduction
Fails:
import pandas as pd # version 3.0.0
d = {-4186896901282141619: 0, -8311529505453501279: 1}
s = pd.Series(d) # ValueError: Length of values (2) does not match length of index (0)Works:
import pandas as pd # version 3.0.0
d = {-4186896901282141619: 0, -8311529505453501279: 1}
idx = pd.Index(list(d.keys()))
s = pd.Series(list(d.values()), index=idx) # SUCCESSWorks with pandas 2.x:
import pandas as pd # version 2.3.3
d = {-4186896901282141619: 0, -8311529505453501279: 1}
s = pd.Series(d) # SUCCESSWhy It Doesn't Always Fail
The issue is triggered by:
- Specific hash values that have large magnitudes and specific differences
- Dictionary creation timing - Python's hash function is randomized per session
- Pandas optimization heuristics - RangeIndex optimization only triggers for certain patterns
This explains why:
- Simple test cases often work (hash values differ)
- First operation may succeed but second fails (different hash values)
- Behavior varies between Python sessions (hash randomization)
Recommended Fixes
Option 1: Fix in hedtools (Recommended)
File: hed/tools/analysis/key_map.py, line 149
Before:
map_series = pd.Series(self.map_dict)After:
# Workaround for pandas 3.0 bug with large integer dict keys
if self.map_dict:
idx = pd.Index(list(self.map_dict.keys()))
map_series = pd.Series(list(self.map_dict.values()), index=idx)
else:
map_series = pd.Series(self.map_dict)Option 2: Constrain pandas version (Temporary)
In consuming packages (like table-remodeler):
pyproject.toml:
dependencies = [
"pandas>=2.2.3,<3.0",
]Impact Assessment
Affected Code
- hedtools:
KeyMapclass inhed/tools/analysis/key_map.py - table-remodeler: Any operation using
RemapColumnsOpwith integer sources - Potential: Any hedtools code path that uses KeyMap with hash-based lookups
Severity
- High: Causes complete operation failure
- Intermittent: Depends on hash values (session-dependent)
- Silent: May work in testing but fail in production
When It Occurs
- Multiple cascading remap operations
- Large datasets with many unique values
- String columns converted to integer types
- Operations that create intermediate columns used as sources for subsequent operations
Testing
Minimal Reproduction Test
import pandas as pd
assert pd.__version__ == '3.0.0', "Test requires pandas 3.0.0"
# This specific dict triggers the bug
problem_dict = {-4186896901282141619: 0, -8311529505453501279: 1}
try:
series = pd.Series(problem_dict)
print("UNEXPECTED: Series created successfully - bug may be fixed")
except ValueError as e:
if "Length of values" in str(e) and "does not match length of index" in str(e):
print("BUG CONFIRMED: Pandas 3.0 dict-to-Series bug reproduced")
else:
print(f"DIFFERENT ERROR: {e}")Full Integration Test
See: table-remodeler/.status/test_exact_scenario.py
Status
- Reported: January 27, 2026
- Pandas Version Affected: 3.0.0
- Pandas Versions Working: 2.3.3 and earlier
- hedtools Version: 0.9.0
- Workaround Implemented: pandas version constraint in table-remodeler
- Permanent Fix Needed: In hedtools KeyMap class
References
- hedtools repository: https://github.com/hed-standard/hed-python
- Pandas issue tracker: https://github.com/pandas-dev/pandas/issues
- Related warnings:
RuntimeWarning: overflow encountered in scalar add/subtractinpandas/core/indexes/base.py
Recommendations for hedtools Maintainers
- Immediate: Apply the workaround in KeyMap._remap()
- Short-term: Add test coverage for large integer hash dictionaries
- Report: File bug report with pandas team with minimal reproduction case
- Version constraint: Consider adding
pandas<3.0constraint until pandas fixes the issue or hedtools applies workaround - Documentation: Add note about pandas 3.0 compatibility in changelog
Recommendations for table-remodeler
- Current: pandas<3.0 constraint is appropriate workaround ✓
- Monitor: Watch for hedtools updates with pandas 3.0 support
- Future: Remove constraint once hedtools addresses the issue
- Documentation: Note the pandas version requirement in user-facing docs