⚡️ Speed up method `DataIndexableCol.get_atom_data` by 12% #108

codeflash-ai · 2025-10-29T15:00:36Z

📄 12% (0.12x) speedup for `DataIndexableCol.get_atom_data` in `pandas/io/pytables.py`

⏱️ Runtime : 2.42 milliseconds → 2.16 milliseconds (best of 24 runs)

📝 Explanation and details

The optimization adds @lru_cache(maxsize=64) to the get_atom_coltype method, which provides an 11% overall speedup by caching expensive attribute lookups.

Key optimization:

The getattr(_tables(), col_name) call accounts for 97.3% of the original runtime according to profiling
@lru_cache caches the result of get_atom_coltype for each unique kind string, eliminating redundant getattr lookups on subsequent calls with the same kind
Added from functools import lru_cache import to enable this caching

Why this works:

The _tables() function returns a module object, and getattr() on modules involves attribute resolution and potential string processing overhead
Column types for the same kind (e.g., "int64", "float32") are immutable and frequently reused in data processing workflows
The cache eliminates this expensive lookup after the first call for each kind

Test results show consistent improvements:

Basic data types see 5-15% speedups (e.g., int64: 14.5% faster, float32: 8.97% faster)
Repeated calls benefit most (large scale test: 12.4% faster for 900 calls)
Cache hits on subsequent calls with the same kind provide near-instant returns
The 64-entry cache size is appropriate for typical PyTables column type variety without excessive memory usage

This optimization is particularly effective for workloads that repeatedly create columns of the same data types, which is common in data processing pipelines.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 936 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
from pandas.io.pytables import DataIndexableCol


# function to test
class Col:
    """Mock PyTables Col class for testing."""
    def __init__(self):
        pass

class UInt8Col(Col): pass
class UInt16Col(Col): pass
class UInt32Col(Col): pass
class UInt64Col(Col): pass
class Int64Col(Col): pass
class FloatCol(Col): pass
class StringCol(Col): pass
class BoolCol(Col): pass
class ComplexCol(Col): pass

def get_atom_data(shape, kind: str):
    # shape is ignored in this implementation (as in pandas)
    return get_atom_coltype(kind=kind)()

# unit tests

# 1. Basic Test Cases






















#------------------------------------------------
import pytest
from pandas.io.pytables import DataIndexableCol

# Alias for test
get_atom_data = DataIndexableCol.get_atom_data

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------

def test_get_atom_data_int8():
    """Test basic int8 kind returns Int8Col instance."""
    codeflash_output = get_atom_data((10,), "int8"); result = codeflash_output # 19.4μs -> 18.4μs (5.27% faster)

def test_get_atom_data_int16():
    """Test int16 kind returns Int16Col instance."""
    codeflash_output = get_atom_data((5, 2), "int16"); result = codeflash_output # 16.0μs -> 14.7μs (8.95% faster)

def test_get_atom_data_int32():
    """Test int32 kind returns Int32Col instance."""
    codeflash_output = get_atom_data((1,), "int32"); result = codeflash_output # 13.9μs -> 13.2μs (5.07% faster)

def test_get_atom_data_int64():
    """Test int64 kind returns Int64Col instance."""
    codeflash_output = get_atom_data((100,), "int64"); result = codeflash_output # 15.0μs -> 13.1μs (14.5% faster)





def test_get_atom_data_float32():
    """Test float32 kind returns Float32Col instance."""
    codeflash_output = get_atom_data((15,), "float32"); result = codeflash_output # 14.4μs -> 13.2μs (8.97% faster)

def test_get_atom_data_float64():
    """Test float64 kind returns Float64Col instance."""
    codeflash_output = get_atom_data((20,), "float64"); result = codeflash_output # 13.8μs -> 12.8μs (7.29% faster)



def test_get_atom_data_kind_case_insensitivity():
    """Test that kind is case-insensitive."""
    codeflash_output = get_atom_data((10,), "InT32"); result1 = codeflash_output # 14.8μs -> 14.5μs (1.60% faster)
    codeflash_output = get_atom_data((10,), "INT32"); result2 = codeflash_output # 4.50μs -> 4.21μs (6.87% faster)
    codeflash_output = get_atom_data((10,), "int32"); result3 = codeflash_output # 2.90μs -> 2.63μs (10.5% faster)

def test_get_atom_data_period_kind():
    """Test that period kind maps to Int64Col."""
    codeflash_output = get_atom_data((10,), "period[D]"); result = codeflash_output # 13.2μs -> 11.9μs (11.0% faster)

def test_get_atom_data_shape_variations():
    """Test that different shapes do not affect the result."""
    codeflash_output = get_atom_data((1,), "int8"); result1 = codeflash_output # 13.7μs -> 13.1μs (4.00% faster)
    codeflash_output = get_atom_data((100, 100), "int8"); result2 = codeflash_output # 4.42μs -> 4.02μs (9.97% faster)
    codeflash_output = get_atom_data((), "int8"); result3 = codeflash_output # 3.02μs -> 2.59μs (16.8% faster)




def test_get_atom_data_none_kind_raises():
    """Test that None as kind raises AttributeError (since lower() is called)."""
    with pytest.raises(AttributeError):
        get_atom_data((10,), None) # 2.18μs -> 3.34μs (34.7% slower)

def test_get_atom_data_shape_none():
    """Test that shape=None does not affect the result."""
    codeflash_output = get_atom_data(None, "int8"); result = codeflash_output # 19.6μs -> 17.9μs (9.58% faster)

def test_get_atom_data_shape_strange_types():
    """Test that shape as a list or other iterable does not affect the result."""
    codeflash_output = get_atom_data([10, 10], "float64"); result = codeflash_output # 15.1μs -> 14.3μs (4.96% faster)
    codeflash_output = get_atom_data("ignored", "float32"); result2 = codeflash_output # 6.19μs -> 5.36μs (15.6% faster)

# ---------------------------
# Large Scale Test Cases
# ---------------------------


def test_get_atom_data_many_kinds():
    """Test many kinds in a loop (stress test, <1000 iterations)."""
    kinds = [
        "int8", "int16", "int32", "int64",
        "uint8", "uint16", "uint32", "uint64",
        "float32", "float64", "string", "bool"
    ]
    for kind in kinds * 80:  # 12*80=960 < 1000
        codeflash_output = get_atom_data((10,), kind); result = codeflash_output
        expected = kind.capitalize() + "Col" if not kind.startswith("uint") else "UInt" + kind[4:] + "Col"
        if kind.startswith("period"):
            expected = "Int64Col"
        if kind == "string":
            expected = "StringCol"
        if kind == "bool":
            expected = "BoolCol"

def test_get_atom_data_large_number_of_calls():
    """Test function stability with many calls (scalability, <1000)."""
    for i in range(900):
        codeflash_output = get_atom_data((i+1,), "int64"); result = codeflash_output # 2.11ms -> 1.87ms (12.4% faster)

def test_get_atom_data_large_shape_and_kind():
    """Test with large shape and kind string."""
    shape = tuple(range(100))  # shape with 100 dimensions
    codeflash_output = get_atom_data(shape, "float64"); result = codeflash_output # 15.9μs -> 15.0μs (6.05% faster)

def test_get_atom_data_period_kind_large():
    """Test period kind with large shape."""
    codeflash_output = get_atom_data((999, 999), "period[M]"); result = codeflash_output # 13.7μs -> 13.0μs (6.11% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-DataIndexableCol.get_atom_data-mhc4hbjy and push.

The optimization adds `@lru_cache(maxsize=64)` to the `get_atom_coltype` method, which provides an **11% overall speedup** by caching expensive attribute lookups. **Key optimization:** - The `getattr(_tables(), col_name)` call accounts for 97.3% of the original runtime according to profiling - `@lru_cache` caches the result of `get_atom_coltype` for each unique `kind` string, eliminating redundant `getattr` lookups on subsequent calls with the same kind - Added `from functools import lru_cache` import to enable this caching **Why this works:** - The `_tables()` function returns a module object, and `getattr()` on modules involves attribute resolution and potential string processing overhead - Column types for the same `kind` (e.g., "int64", "float32") are immutable and frequently reused in data processing workflows - The cache eliminates this expensive lookup after the first call for each kind **Test results show consistent improvements:** - Basic data types see 5-15% speedups (e.g., int64: 14.5% faster, float32: 8.97% faster) - Repeated calls benefit most (large scale test: 12.4% faster for 900 calls) - Cache hits on subsequent calls with the same kind provide near-instant returns - The 64-entry cache size is appropriate for typical PyTables column type variety without excessive memory usage This optimization is particularly effective for workloads that repeatedly create columns of the same data types, which is common in data processing pipelines.

codeflash-ai bot requested a review from mashraf-222 October 29, 2025 15:00

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `DataIndexableCol.get_atom_data` by 12% #108

⚡️ Speed up method `DataIndexableCol.get_atom_data` by 12% #108

Uh oh!

codeflash-ai bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method DataIndexableCol.get_atom_data by 12% #108

Are you sure you want to change the base?

⚡️ Speed up method DataIndexableCol.get_atom_data by 12% #108

Uh oh!

Conversation

codeflash-ai bot commented Oct 29, 2025

📄 12% (0.12x) speedup for DataIndexableCol.get_atom_data in pandas/io/pytables.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `DataIndexableCol.get_atom_data` by 12% #108

⚡️ Speed up method `DataIndexableCol.get_atom_data` by 12% #108

📄 12% (0.12x) speedup for `DataIndexableCol.get_atom_data` in `pandas/io/pytables.py`