Skip to content

[BUG] KeyMap fails with large keys and Pandas 3.0.0 #1197

@VisLab

Description

@VisLab

Pandas 3.0.0 Compatibility Issue in hedtools KeyMap

Executive Summary

Issue: hedtools 0.9.0's KeyMap class fails with pandas 3.0.0 when creating a pd.Series from a dictionary with certain large integer hash values.

Location: hed/tools/analysis/key_map.py, line 149:

map_series = pd.Series(self.map_dict)

Root Cause: Pandas 3.0.0 has a bug in pd.Series._init_dict() that causes it to incorrectly handle dictionaries with large negative integer keys that trigger RangeIndex optimization attempts, leading to integer overflow and index length mismatches.

Workaround: Create the Index separately before creating the Series.

Detailed Analysis

The Failing Code

In key_map.py line 149, KeyMap tries to create a Series from its map_dict:

def _remap(self, df):
    # ... earlier code ...
    map_series = pd.Series(self.map_dict)  # <-- FAILS HERE with pandas 3.0

The Problem Dict

The issue occurs with hash dictionaries like:

{-4186896901282141619: 0, -8311529505453501279: 1}

These are legitimate hash values generated by KeyMap for string keys like '6' and '2'.

Pandas 3.0 Bug Details

When pd.Series(dict) is called, pandas 3.0's _init_dict method:

  1. Extracts keys and values from the dict
  2. Attempts to optimize by creating a RangeIndex if keys appear sequential
  3. With large negative integers, this calculation overflows: keys[0] + diff exceeds int64 max
  4. The overflow corrupts the index creation, resulting in an empty index (length 0)
  5. When trying to create the Series with values (length 2) and empty index (length 0), it raises:
    ValueError: Length of values (2) does not match length of index (0)
    

Reproduction

Fails:

import pandas as pd  # version 3.0.0
d = {-4186896901282141619: 0, -8311529505453501279: 1}
s = pd.Series(d)  # ValueError: Length of values (2) does not match length of index (0)

Works:

import pandas as pd  # version 3.0.0
d = {-4186896901282141619: 0, -8311529505453501279: 1}
idx = pd.Index(list(d.keys()))
s = pd.Series(list(d.values()), index=idx)  # SUCCESS

Works with pandas 2.x:

import pandas as pd  # version 2.3.3
d = {-4186896901282141619: 0, -8311529505453501279: 1}
s = pd.Series(d)  # SUCCESS

Why It Doesn't Always Fail

The issue is triggered by:

  1. Specific hash values that have large magnitudes and specific differences
  2. Dictionary creation timing - Python's hash function is randomized per session
  3. Pandas optimization heuristics - RangeIndex optimization only triggers for certain patterns

This explains why:

  • Simple test cases often work (hash values differ)
  • First operation may succeed but second fails (different hash values)
  • Behavior varies between Python sessions (hash randomization)

Recommended Fixes

Option 1: Fix in hedtools (Recommended)

File: hed/tools/analysis/key_map.py, line 149

Before:

map_series = pd.Series(self.map_dict)

After:

# Workaround for pandas 3.0 bug with large integer dict keys
if self.map_dict:
    idx = pd.Index(list(self.map_dict.keys()))
    map_series = pd.Series(list(self.map_dict.values()), index=idx)
else:
    map_series = pd.Series(self.map_dict)

Option 2: Constrain pandas version (Temporary)

In consuming packages (like table-remodeler):

pyproject.toml:

dependencies = [
    "pandas>=2.2.3,<3.0",
]

Impact Assessment

Affected Code

  • hedtools: KeyMap class in hed/tools/analysis/key_map.py
  • table-remodeler: Any operation using RemapColumnsOp with integer sources
  • Potential: Any hedtools code path that uses KeyMap with hash-based lookups

Severity

  • High: Causes complete operation failure
  • Intermittent: Depends on hash values (session-dependent)
  • Silent: May work in testing but fail in production

When It Occurs

  • Multiple cascading remap operations
  • Large datasets with many unique values
  • String columns converted to integer types
  • Operations that create intermediate columns used as sources for subsequent operations

Testing

Minimal Reproduction Test

import pandas as pd
assert pd.__version__ == '3.0.0', "Test requires pandas 3.0.0"

# This specific dict triggers the bug
problem_dict = {-4186896901282141619: 0, -8311529505453501279: 1}

try:
    series = pd.Series(problem_dict)
    print("UNEXPECTED: Series created successfully - bug may be fixed")
except ValueError as e:
    if "Length of values" in str(e) and "does not match length of index" in str(e):
        print("BUG CONFIRMED: Pandas 3.0 dict-to-Series bug reproduced")
    else:
        print(f"DIFFERENT ERROR: {e}")

Full Integration Test

See: table-remodeler/.status/test_exact_scenario.py

Status

  • Reported: January 27, 2026
  • Pandas Version Affected: 3.0.0
  • Pandas Versions Working: 2.3.3 and earlier
  • hedtools Version: 0.9.0
  • Workaround Implemented: pandas version constraint in table-remodeler
  • Permanent Fix Needed: In hedtools KeyMap class

References

Recommendations for hedtools Maintainers

  1. Immediate: Apply the workaround in KeyMap._remap()
  2. Short-term: Add test coverage for large integer hash dictionaries
  3. Report: File bug report with pandas team with minimal reproduction case
  4. Version constraint: Consider adding pandas<3.0 constraint until pandas fixes the issue or hedtools applies workaround
  5. Documentation: Add note about pandas 3.0 compatibility in changelog

Recommendations for table-remodeler

  1. Current: pandas<3.0 constraint is appropriate workaround ✓
  2. Monitor: Watch for hedtools updates with pandas 3.0 support
  3. Future: Remove constraint once hedtools addresses the issue
  4. Documentation: Note the pandas version requirement in user-facing docs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions