⚡️ Speed up function `_get_converter` by 5% #111

codeflash-ai · 2025-10-29T16:44:10Z

📄 5% (0.05x) speedup for `_get_converter` in `pandas/io/pytables.py`

⏱️ Runtime : 13.7 microseconds → 13.0 microseconds (best of 8 runs)

📝 Explanation and details

The optimization achieves a 5% speedup through two key changes:

Primary optimization: Replace substring search with prefix check

Changed "datetime64" in kind to kind.startswith("datetime64")
The in operator performs a general substring search which is O(n*m) complexity, while startswith() is optimized for prefix matching and stops checking once the prefix doesn't match
This is particularly effective for datetime64 dtype strings like "datetime64[D]", "datetime64[ms]" which always start with "datetime64"
Test results show 10-11% improvements for custom datetime64 units (e.g., "datetime64[D]" test: 1.04μs → 944ns)

Secondary optimization: Reduce lambda overhead

Moved nan_rep = None assignment outside the lambda for string converters
This avoids repeated variable lookups inside the lambda function, reducing per-call overhead
While a minor optimization, it contributes to the overall speedup

The optimizations are most effective for:

Custom datetime64 types with units (10%+ speedup in tests)
Cases involving the datetime64 prefix check path
Repeated converter usage where lambda overhead matters

The changes maintain identical functionality while leveraging Python's built-in string method optimizations and reducing unnecessary computation overhead.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 13 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import numpy as np
# imports
import pytest  # used for our unit tests
from pandas.io.pytables import _get_converter


# function to test
def _unconvert_string_array(x, nan_rep=None, encoding="utf-8", errors="strict"):
    # Dummy implementation for testing purposes
    # Decodes bytes to string, handles None as nan_rep
    if nan_rep is not None:
        return [item.decode(encoding, errors) if item is not None else nan_rep for item in x]
    else:
        return [item.decode(encoding, errors) if item is not None else None for item in x]
from pandas.io.pytables import _get_converter

# unit tests

# -------- BASIC TEST CASES --------

def test_datetime64_basic():
    # Test with kind == "datetime64"
    codeflash_output = _get_converter("datetime64", "utf-8", "strict"); converter = codeflash_output # 917ns -> 899ns (2.00% faster)
    arr = ["2020-01-01", "2021-01-01"]
    result = converter(arr)

def test_datetime64_custom_unit():
    # Test with kind containing "datetime64" and a custom unit
    codeflash_output = _get_converter("datetime64[D]", "utf-8", "strict"); converter = codeflash_output # 1.04μs -> 944ns (10.3% faster)
    arr = ["2020-01-01", "2021-01-01"]
    result = converter(arr)



def test_datetime64_empty_list():
    # Test with empty list for datetime64
    codeflash_output = _get_converter("datetime64", "utf-8", "strict"); converter = codeflash_output # 862ns -> 881ns (2.16% slower)
    arr = []
    result = converter(arr)





def test_invalid_kind_raises():
    # Test with invalid kind
    with pytest.raises(ValueError) as excinfo:
        _get_converter("foobar", "utf-8", "strict") # 1.84μs -> 1.68μs (9.90% faster)

def test_datetime64_with_nan():
    # Test with NaT for datetime64
    codeflash_output = _get_converter("datetime64", "utf-8", "strict"); converter = codeflash_output # 839ns -> 820ns (2.32% faster)
    arr = [np.datetime64("NaT"), "2020-01-01"]
    result = converter(arr)

def test_datetime64_with_mixed_types():
    # Test with mixed types: string and np.datetime64
    codeflash_output = _get_converter("datetime64", "utf-8", "strict"); converter = codeflash_output # 702ns -> 703ns (0.142% slower)
    arr = ["2020-01-01", np.datetime64("2021-01-01")]
    result = converter(arr)





def test_datetime64_large_scale_nat():
    # Test with large list including NaT values
    codeflash_output = _get_converter("datetime64", "utf-8", "strict"); converter = codeflash_output # 896ns -> 898ns (0.223% slower)
    arr = ["2020-01-01"] * 995 + [np.datetime64("NaT")] * 5
    result = converter(arr)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

import numpy as np
# imports
import pytest  # used for our unit tests
from pandas.io.pytables import _get_converter


def _unconvert_string_array(x, nan_rep=None, encoding="utf-8", errors="strict"):
    # For testing purposes, a minimal implementation:
    # Decodes bytes to string if needed, leaves strings as is.
    # If nan_rep is specified, replaces nan_rep with None.
    result = []
    for item in x:
        if isinstance(item, bytes):
            try:
                s = item.decode(encoding, errors)
            except Exception:
                s = None
        else:
            s = item
        if nan_rep is not None and s == nan_rep:
            s = None
        result.append(s)
    return result
from pandas.io.pytables import _get_converter

# unit tests

# ------------------------
# BASIC TEST CASES
# ------------------------

def test_datetime64_basic_conversion():
    # Test basic conversion for kind='datetime64'
    codeflash_output = _get_converter("datetime64", "utf-8", "strict"); converter = codeflash_output # 870ns -> 818ns (6.36% faster)
    # Should convert list of datetime64 strings to numpy datetime64[ns]
    arr = ["2023-01-01", "2024-06-01"]
    result = converter(arr)

def test_datetime64_custom_unit():
    # Test conversion for kind='datetime64[D]' (days)
    codeflash_output = _get_converter("datetime64[D]", "utf-8", "strict"); converter = codeflash_output # 1.06μs -> 956ns (11.1% faster)
    arr = ["2023-01-01", "2024-06-01"]
    result = converter(arr)




def test_empty_list_datetime64():
    # Test empty input for datetime64
    codeflash_output = _get_converter("datetime64", "utf-8", "strict"); converter = codeflash_output # 888ns -> 907ns (2.09% slower)
    arr = []
    result = converter(arr)


def test_invalid_kind_raises():
    # Test that invalid kind raises ValueError
    with pytest.raises(ValueError) as excinfo:
        _get_converter("int64", "utf-8", "strict") # 1.67μs -> 1.64μs (1.71% faster)

def test_datetime64_invalid_date():
    # Test conversion of invalid date string for datetime64
    codeflash_output = _get_converter("datetime64", "utf-8", "strict"); converter = codeflash_output # 898ns -> 695ns (29.2% faster)
    arr = ["not-a-date"]
    # numpy will raise ValueError on invalid date string
    with pytest.raises(ValueError):
        converter(arr)




def test_datetime64_with_milliseconds():
    # Test conversion for kind='datetime64[ms]'
    codeflash_output = _get_converter("datetime64[ms]", "utf-8", "strict"); converter = codeflash_output # 1.18μs -> 1.16μs (1.73% faster)
    arr = ["2023-01-01T12:34:56.789"]
    result = converter(arr)

To edit these changes git checkout codeflash/optimize-_get_converter-mhc86ij9 and push.

The optimization achieves a 5% speedup through two key changes: **Primary optimization: Replace substring search with prefix check** - Changed `"datetime64" in kind` to `kind.startswith("datetime64")` - The `in` operator performs a general substring search which is O(n*m) complexity, while `startswith()` is optimized for prefix matching and stops checking once the prefix doesn't match - This is particularly effective for datetime64 dtype strings like "datetime64[D]", "datetime64[ms]" which always start with "datetime64" - Test results show 10-11% improvements for custom datetime64 units (e.g., "datetime64[D]" test: 1.04μs → 944ns) **Secondary optimization: Reduce lambda overhead** - Moved `nan_rep = None` assignment outside the lambda for string converters - This avoids repeated variable lookups inside the lambda function, reducing per-call overhead - While a minor optimization, it contributes to the overall speedup The optimizations are most effective for: - Custom datetime64 types with units (10%+ speedup in tests) - Cases involving the datetime64 prefix check path - Repeated converter usage where lambda overhead matters The changes maintain identical functionality while leveraging Python's built-in string method optimizations and reducing unnecessary computation overhead.

codeflash-ai bot requested a review from mashraf-222 October 29, 2025 16:44

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_get_converter` by 5% #111

⚡️ Speed up function `_get_converter` by 5% #111

Uh oh!

codeflash-ai bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _get_converter by 5% #111

Are you sure you want to change the base?

⚡️ Speed up function _get_converter by 5% #111

Uh oh!

Conversation

codeflash-ai bot commented Oct 29, 2025

📄 5% (0.05x) speedup for _get_converter in pandas/io/pytables.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_get_converter` by 5% #111

⚡️ Speed up function `_get_converter` by 5% #111

📄 5% (0.05x) speedup for `_get_converter` in `pandas/io/pytables.py`