Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 29, 2025

📄 13% (0.13x) speedup for can_set_locale in pandas/_config/localization.py

⏱️ Runtime : 50.3 milliseconds 44.5 milliseconds (best of 62 runs)

📝 Explanation and details

The optimization achieves a 12% speedup through two key changes:

1. String concatenation optimization in set_locale:
Replaced f-string formatting f"{normalized_code}.{normalized_encoding}" with direct concatenation normalized_code + '.' + normalized_encoding. While f-strings are generally fast, direct concatenation avoids the overhead of format string parsing for this simple case.

2. Context manager elimination in can_set_locale:
The most significant optimization replaces the with set_locale(lc, lc_var=lc_var): context manager call with direct try/finally logic. This eliminates:

  • Function call overhead to set_locale
  • Generator setup and teardown costs from the @contextmanager decorator
  • Additional stack frame creation
  • Unnecessary locale normalization logic (the function only needs to test if locale setting succeeds, not return the normalized result)

Performance characteristics:
The optimization is most effective for scenarios with frequent can_set_locale calls, showing 15-58% improvements on invalid locales and 19-45% on valid ones. The direct approach is particularly beneficial when testing many locales in batch operations, as seen in the stress tests showing 15-36% improvements across 1000+ iterations.

The changes maintain identical behavior and error handling while reducing the call stack depth and computational overhead for this boolean validation function.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 116 Passed
🌀 Generated Regression Tests 5628 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
config/test_localization.py::test_can_set_current_locale 27.2μs 18.8μs 44.5%✅
config/test_localization.py::test_can_set_locale_invalid_get 29.7μs 27.2μs 9.04%✅
config/test_localization.py::test_can_set_locale_invalid_set 32.0μs 23.5μs 36.6%✅
config/test_localization.py::test_can_set_locale_no_leak 517μs 487μs 6.15%✅
config/test_localization.py::test_can_set_locale_valid_set 153μs 146μs 5.37%✅
io/json/test_ujson.py::TestUltraJSONTests.test_encode_non_c_locale 11.9μs 7.75μs 54.0%✅
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import locale
from collections.abc import Generator
from contextlib import contextmanager

# imports
import pytest
from pandas._config.localization import can_set_locale

# unit tests

# ----------- Basic Test Cases ------------

def test_can_set_locale_valid_utf8():
    # Most systems support 'C' locale
    codeflash_output = can_set_locale('C') # 13.1μs -> 10.6μs (24.5% faster)
    # Most systems support 'en_US.UTF-8' or 'en_US.utf8'
    # But not all systems have this locale, so we use a fallback
    # Try to get a valid locale from the system
    valid_locale = None
    for loc in locale.locale_alias.values():
        if "utf" in loc and "." in loc:
            if can_set_locale(loc):
                valid_locale = loc
                break
    if valid_locale:
        codeflash_output = can_set_locale(valid_locale)

def test_can_set_locale_invalid_locale():
    # Should return False for a clearly invalid locale
    codeflash_output = can_set_locale('not_a_real_locale') # 13.2μs -> 8.30μs (58.9% faster)
    # Should return False for an empty string
    codeflash_output = can_set_locale('') # 111μs -> 112μs (0.397% slower)

def test_can_set_locale_valid_tuple():
    # Accepts tuple as locale
    # Try to find a valid tuple locale from the system
    loc = locale.getlocale()
    if all(loc) and can_set_locale(loc[0] + '.' + loc[1]):
        codeflash_output = can_set_locale((loc[0], loc[1])) # 13.3μs -> 11.1μs (19.5% faster)

def test_can_set_locale_valid_lc_var():
    # Should work for other locale categories, e.g. LC_NUMERIC
    codeflash_output = can_set_locale('C', lc_var=locale.LC_NUMERIC) # 8.76μs -> 6.02μs (45.6% faster)

# ----------- Edge Test Cases ------------

def test_can_set_locale_case_sensitivity():
    # Locale names are case-sensitive on some systems
    # Try lower/upper case variants
    codeflash_output = can_set_locale('C') # 12.5μs -> 10.1μs (23.4% faster)
    codeflash_output = can_set_locale('c') # 95.1μs -> 87.8μs (8.34% faster)

def test_can_set_locale_partial_locale_string():
    # Should return False for incomplete locale strings
    codeflash_output = can_set_locale('en_US') # 22.8μs -> 17.5μs (29.7% faster)
    codeflash_output = can_set_locale('UTF-8') # 57.3μs -> 51.9μs (10.4% faster)

def test_can_set_locale_non_string_input():
    # Should raise TypeError for non-string, non-tuple input
    with pytest.raises(TypeError):
        can_set_locale(12345) # 8.43μs -> 6.28μs (34.3% faster)

def test_can_set_locale_tuple_with_invalid_values():
    # Tuple with invalid values should return False
    codeflash_output = can_set_locale(('not_a_lang', 'not_an_encoding')) # 26.8μs -> 25.8μs (3.92% faster)


def test_can_set_locale_valid_locale_with_space():
    # Locale strings with spaces are invalid
    codeflash_output = can_set_locale('en US.UTF-8') # 103μs -> 102μs (0.713% faster)

def test_can_set_locale_valid_locale_with_special_chars():
    # Locale strings with special characters are invalid
    codeflash_output = can_set_locale('en_US@UTF-8') # 43.7μs -> 41.1μs (6.38% faster)

def test_can_set_locale_valid_locale_with_dash():
    # Some systems may use dash instead of underscore, but it's usually invalid
    codeflash_output = can_set_locale('en-US.UTF-8') # 88.5μs -> 85.2μs (3.92% faster)

def test_can_set_locale_valid_locale_with_extra_dot():
    # Extra dot in locale string should be invalid
    codeflash_output = can_set_locale('en_US..UTF-8') # 18.0μs -> 14.6μs (23.2% faster)

# ----------- Large Scale Test Cases ------------

def test_can_set_locale_many_locales():
    # Test can_set_locale on a large number of locale strings
    # Use locale.locale_alias for a diverse set
    aliases = list(locale.locale_alias.values())
    tested = 0
    for loc in aliases[:1000]:  # limit to 1000 for performance
        # Only test string locales
        if isinstance(loc, str):
            codeflash_output = can_set_locale(loc); result = codeflash_output
            tested += 1

def test_can_set_locale_stress_valid_locale():
    # Stress test with repeated valid locale setting
    # Use 'C' locale, which is always present
    for _ in range(1000):
        codeflash_output = can_set_locale('C') # 3.78ms -> 2.77ms (36.3% faster)

def test_can_set_locale_stress_invalid_locale():
    # Stress test with repeated invalid locale setting
    for _ in range(1000):
        codeflash_output = can_set_locale('not_a_real_locale') # 6.39ms -> 5.54ms (15.4% faster)

def test_can_set_locale_large_tuple_list():
    # Stress test with a large list of tuple locales
    tuples = [('en_US', 'UTF-8')] * 1000
    for tup in tuples:
        # Should return bool, True only if locale exists
        codeflash_output = can_set_locale(tup); result = codeflash_output # 7.96ms -> 6.88ms (15.7% faster)

def test_can_set_locale_mixed_types_large_scale():
    # Mix valid and invalid, string and tuple
    locales = ['C', 'not_a_real_locale', ('en_US', 'UTF-8'), ('bad', 'bad')] * 250
    for loc in locales:
        codeflash_output = can_set_locale(loc); result = codeflash_output # 10.9ms -> 9.87ms (10.6% faster)
        # Check that valid locales return True, invalid return False
        if loc == 'C':
            pass
        elif loc == 'not_a_real_locale':
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

import locale
from collections.abc import Generator
from contextlib import contextmanager

# imports
import pytest
from pandas._config.localization import can_set_locale

# unit tests

# ---------- Basic Test Cases ----------

def test_valid_locale_string():
    # Most systems support 'C' locale
    codeflash_output = can_set_locale('C') # 14.4μs -> 11.8μs (22.1% faster)


def test_invalid_locale_string():
    # Gibberish locale should not be settable
    codeflash_output = can_set_locale('not_a_real_locale') # 20.1μs -> 16.8μs (19.8% faster)

def test_valid_locale_tuple():
    # Locale can be set with tuple (language code, encoding)
    # 'en_US' and 'UTF-8' is common, but may not be available everywhere
    codeflash_output = can_set_locale(('en_US', 'UTF-8')); result = codeflash_output # 28.6μs -> 25.5μs (11.9% faster)

def test_empty_string_locale():
    # Empty string means default locale, should not raise
    codeflash_output = can_set_locale('') # 27.4μs -> 25.3μs (8.60% faster)

def test_numeric_locale():
    # Locale as a number should raise ValueError
    with pytest.raises(TypeError):
        can_set_locale(1234) # 9.23μs -> 6.68μs (38.3% faster)

# ---------- Edge Test Cases ----------


def test_partial_locale_string():
    # 'en_US' without encoding may or may not be valid
    codeflash_output = can_set_locale('en_US'); result = codeflash_output # 32.8μs -> 16.7μs (96.7% faster)

def test_long_gibberish_locale():
    # Excessively long invalid locale string
    codeflash_output = can_set_locale('x' * 500) # 8.34μs -> 5.42μs (54.0% faster)

def test_special_characters_locale():
    # Locale string with special characters should fail
    codeflash_output = can_set_locale('en_US@#$.UTF-8') # 17.0μs -> 14.2μs (19.4% faster)

def test_unicode_locale():
    # Unicode characters in locale string
    codeflash_output = can_set_locale('日本語.UTF-8') # 10.1μs -> 7.58μs (33.9% faster)

def test_valid_locale_with_different_lc_var():
    # Try LC_TIME, which is usually settable
    codeflash_output = can_set_locale('C', lc_var=locale.LC_TIME) # 10.4μs -> 7.98μs (30.6% faster)

def test_invalid_lc_var():
    # Pass an invalid lc_var (not an int)
    with pytest.raises(TypeError):
        can_set_locale('C', lc_var='not_an_int') # 4.74μs -> 2.16μs (120% faster)

def test_tuple_with_invalid_encoding():
    # Tuple with invalid encoding
    codeflash_output = can_set_locale(('en_US', 'not_an_encoding')) # 34.1μs -> 31.3μs (8.82% faster)

def test_tuple_with_empty_encoding():
    # Tuple with empty encoding
    codeflash_output = can_set_locale(('en_US', '')); result = codeflash_output # 25.0μs -> 21.0μs (19.1% faster)

def test_tuple_with_special_characters():
    # Tuple with special characters in encoding
    codeflash_output = can_set_locale(('en_US', '@#)) # 35.7μs -> 27.8μs (28.7% faster)

def test_case_sensitivity():
    # Locale strings are case-sensitive; 'EN_us.UTF-8' likely fails
    codeflash_output = can_set_locale('EN_us.UTF-8') # 26.0μs -> 19.6μs (32.7% faster)

# ---------- Large Scale Test Cases ----------

def test_many_invalid_locales():
    # Try 100 invalid locales, all should return False
    for i in range(100):
        codeflash_output = can_set_locale(f'invalid_locale_{i}') # 515μs -> 404μs (27.5% faster)

def test_many_valid_like_locales():
    # Try 100 locales that look valid but probably aren't
    for i in range(100):
        codeflash_output = can_set_locale(f'en_US.UTF-{i}'); result = codeflash_output # 1.58ms -> 1.50ms (5.38% faster)


def test_stress_many_empty_strings():
    # 500 times with empty string (should always succeed)
    for _ in range(500):
        codeflash_output = can_set_locale('') # 7.60ms -> 6.95ms (9.36% faster)

def test_stress_many_special_char_locales():
    # 200 special character locale strings
    for i in range(200):
        codeflash_output = can_set_locale(f'en_US@{i}!$%') # 2.44ms -> 2.29ms (6.64% faster)

def test_stress_many_tuple_locales():
    # 100 tuples with valid language, random encoding
    for i in range(100):
        codeflash_output = can_set_locale(('en_US', f'UTF-{i}')); result = codeflash_output # 1.60ms -> 1.52ms (5.25% faster)

def test_stress_different_lc_vars():
    # Try setting locale with all LC_* constants
    lc_vars = [getattr(locale, name) for name in dir(locale) if name.startswith('LC_') and isinstance(getattr(locale, name), int)]
    for lc_var in lc_vars:
        codeflash_output = can_set_locale('C', lc_var=lc_var) # 36.1μs -> 25.5μs (41.6% faster)

# ---------- Determinism and Mutation Safety ----------

def test_mutation_safety_false_positive():
    # If can_set_locale always returns True, this test should fail
    codeflash_output = can_set_locale('not_a_real_locale') # 10.8μs -> 9.40μs (14.5% faster)

def test_mutation_safety_false_negative():
    # If can_set_locale always returns False, this test should fail
    codeflash_output = can_set_locale('C') # 12.5μs -> 9.60μs (30.2% faster)

def test_mutation_safety_exception_handling():
    # If exception handling is removed, this should raise
    codeflash_output = can_set_locale('not_a_real_locale') # 11.9μs -> 6.89μs (73.0% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-can_set_locale-mhbmmfkj and push.

Codeflash

The optimization achieves a 12% speedup through two key changes:

**1. String concatenation optimization in `set_locale`:**
Replaced f-string formatting `f"{normalized_code}.{normalized_encoding}"` with direct concatenation `normalized_code + '.' + normalized_encoding`. While f-strings are generally fast, direct concatenation avoids the overhead of format string parsing for this simple case.

**2. Context manager elimination in `can_set_locale`:**
The most significant optimization replaces the `with set_locale(lc, lc_var=lc_var):` context manager call with direct `try/finally` logic. This eliminates:
- Function call overhead to `set_locale`
- Generator setup and teardown costs from the `@contextmanager` decorator
- Additional stack frame creation
- Unnecessary locale normalization logic (the function only needs to test if locale setting succeeds, not return the normalized result)

**Performance characteristics:**
The optimization is most effective for scenarios with frequent `can_set_locale` calls, showing 15-58% improvements on invalid locales and 19-45% on valid ones. The direct approach is particularly beneficial when testing many locales in batch operations, as seen in the stress tests showing 15-36% improvements across 1000+ iterations.

The changes maintain identical behavior and error handling while reducing the call stack depth and computational overhead for this boolean validation function.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 29, 2025 06:40
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant