Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 20, 2025

📄 31% (0.31x) speedup for sanitize_filename in skyvern/forge/sdk/api/files.py

⏱️ Runtime : 539 microseconds 411 microseconds (best of 250 runs)

📝 Explanation and details

The optimization achieves a 31% speedup by making two key changes that improve both membership checking and string construction performance:

What was optimized:

  1. Set-based membership checking: Replaced the inline list ["-", "_", ".", "%", " "] with a pre-created set {"-", "_", ".", "%", " "}, improving character lookup from O(n) to O(1) complexity.
  2. List comprehension over generator expression: Changed from a generator expression to a list comprehension inside str.join(), reducing function call overhead.

Why this leads to speedup:

  • Set lookup efficiency: When checking c in allowed, Python must scan through each element in a list (5 comparisons worst-case), but with a set it uses hash-based lookup (constant time). For longer filenames with many characters to filter, this compounds significantly.
  • Reduced generator overhead: List comprehensions avoid the repeated function call overhead of generator expressions when used with str.join(), providing a small but consistent performance gain.

Performance patterns from tests:

  • Best gains on files with many disallowed characters (34-84% faster) - the set lookup optimization shines when frequently rejecting characters
  • Largest scale benefits on long filenames (24-56% faster) - O(1) vs O(n) lookup scales better with input size
  • Minimal overhead on simple cases (some 1-13% slower on very short inputs) - the set creation cost is amortized over longer processing

Impact on workloads:
Based on the function references, sanitize_filename is called in file download/upload workflows (download_file, rename_file, create_named_temporary_file). These are likely I/O-bound operations where even microsecond improvements in filename processing can reduce overall latency, especially when processing many files or files with complex names containing special characters.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 82 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import string  # used for generating large-scale test cases

# imports
import pytest  # used for our unit tests
from skyvern.forge.sdk.api.files import sanitize_filename

# unit tests

# -------------------------
# 1. Basic Test Cases
# -------------------------

def test_basic_alphanumeric():
    # Should keep only alphanumeric characters
    codeflash_output = sanitize_filename("abc123") # 1.33μs -> 1.46μs (9.03% slower)
    codeflash_output = sanitize_filename("FileName2024") # 980ns -> 1.05μs (7.02% slower)

def test_basic_allowed_symbols():
    # Should keep allowed symbols: -, _, ., %, and space
    codeflash_output = sanitize_filename("file-name_1.0% completed.txt") # 2.47μs -> 2.28μs (8.41% faster)
    codeflash_output = sanitize_filename("a_b-c.d%e f") # 1.12μs -> 982ns (13.8% faster)

def test_basic_mixed_allowed_and_disallowed():
    # Should remove disallowed symbols but keep allowed ones
    codeflash_output = sanitize_filename("doc$%^&*()[]{};:'\",<>/?|\\") # 2.58μs -> 2.13μs (21.0% faster)
    codeflash_output = sanitize_filename("safe_file@!#name.txt") # 1.47μs -> 1.30μs (12.9% faster)

def test_basic_empty_string():
    # Empty string should remain empty
    codeflash_output = sanitize_filename("") # 926ns -> 781ns (18.6% faster)

def test_basic_only_disallowed():
    # Only disallowed characters should return empty string
    codeflash_output = sanitize_filename("!@#$^&*()[]{};:'\",<>/?|\\") # 2.49μs -> 1.85μs (34.0% faster)

def test_basic_only_allowed_symbols():
    # Only allowed symbols should be preserved
    codeflash_output = sanitize_filename("-_.% ") # 1.40μs -> 1.48μs (5.55% slower)

# -------------------------
# 2. Edge Test Cases
# -------------------------

def test_edge_unicode_and_non_ascii():
    # Unicode and non-ASCII characters should be removed
    codeflash_output = sanitize_filename("ファイル名.txt") # 2.10μs -> 2.13μs (1.55% slower)
    codeflash_output = sanitize_filename("fílè_nâmé.txt") # 1.37μs -> 1.36μs (0.587% faster)
    # Emoji and symbols
    codeflash_output = sanitize_filename("file😀name.txt") # 1.36μs -> 1.27μs (7.08% faster)
    codeflash_output = sanitize_filename("file💾name%20.txt") # 1.15μs -> 1.01μs (14.0% faster)

def test_edge_control_and_whitespace_characters():
    # Control characters (like \n, \t, etc.) should be removed
    codeflash_output = sanitize_filename("file\nname\t.txt") # 1.72μs -> 1.62μs (6.24% faster)
    # Multiple spaces should be preserved as spaces
    codeflash_output = sanitize_filename("file   name.txt") # 1.27μs -> 1.09μs (16.7% faster)
    # Leading/trailing spaces should be preserved
    codeflash_output = sanitize_filename("  file.txt  ") # 1.02μs -> 892ns (14.8% faster)

def test_edge_mixed_case_and_symbols():
    # Case sensitivity: both upper and lower should be preserved
    codeflash_output = sanitize_filename("FiLe-Name_2024.txt") # 1.80μs -> 1.83μs (1.69% slower)
    # Mixed allowed/disallowed symbols
    codeflash_output = sanitize_filename("F!i@l#e$%_N^a&m*e(2)0_2+4.txt") # 2.07μs -> 1.73μs (20.2% faster)

def test_edge_only_spaces():
    # Only spaces should be preserved
    codeflash_output = sanitize_filename("     ") # 1.32μs -> 1.37μs (4.08% slower)

def test_edge_filename_with_dot_variations():
    # Multiple dots and leading/trailing dots should be preserved
    codeflash_output = sanitize_filename("...file...name...") # 1.96μs -> 1.93μs (1.60% faster)

def test_edge_filename_with_percent():
    # Percent symbol is allowed
    codeflash_output = sanitize_filename("file%20name.txt") # 1.75μs -> 1.69μs (4.15% faster)

def test_edge_filename_with_dash_and_underscore():
    # Dashes and underscores are allowed
    codeflash_output = sanitize_filename("_-file-_name-.txt") # 1.88μs -> 1.91μs (1.31% slower)

def test_edge_filename_with_only_one_char():
    # Single allowed character
    codeflash_output = sanitize_filename("a") # 1.05μs -> 1.12μs (6.09% slower)
    # Single disallowed character
    codeflash_output = sanitize_filename("?") # 807ns -> 517ns (56.1% faster)

def test_edge_filename_with_all_ascii_printable():
    # Should keep only allowed ones from all printable ASCII
    all_ascii = "".join(chr(i) for i in range(32, 127))
    expected = "".join(c for c in all_ascii if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(all_ascii) # 4.41μs -> 4.20μs (5.02% faster)

# -------------------------
# 3. Large Scale Test Cases
# -------------------------

def test_large_scale_long_filename():
    # Very long filename with mixed allowed/disallowed characters
    base = "file_name-2024.% " * 50  # 900 chars, all allowed
    noise = "!@#$^&*()[]{};:'\",<>/?|\n\t" * 30  # 900 chars, none allowed
    long_input = base + noise + base
    expected = base + base
    codeflash_output = sanitize_filename(long_input) # 96.2μs -> 63.0μs (52.7% faster)

def test_large_scale_only_disallowed():
    # Long string of only disallowed characters
    disallowed = "".join(chr(i) for i in range(33, 127) if not (chr(i).isalnum() or chr(i) in ["-", "_", ".", "%", " "]))
    long_disallowed = disallowed * 10  # <1000 chars
    codeflash_output = sanitize_filename(long_disallowed) # 16.0μs -> 8.69μs (83.8% faster)

def test_large_scale_all_allowed():
    # Long string of only allowed characters
    allowed = string.ascii_letters + string.digits + "-_.% "
    long_allowed = allowed * 10  # <1000 chars
    codeflash_output = sanitize_filename(long_allowed) # 20.3μs -> 16.4μs (24.2% faster)

def test_large_scale_mixed_unicode_ascii():
    # Mix of allowed ASCII and large amount of Unicode
    allowed = "file_name-2024.% "
    unicode_noise = "ファイル名😀💾" * 100  # 400 chars
    large_input = allowed * 40 + unicode_noise
    expected = allowed * 40
    codeflash_output = sanitize_filename(large_input) # 61.2μs -> 50.9μs (20.2% faster)

def test_large_scale_performance():
    # Performance: should not take too long for near-1000 char string
    long_input = ("abc-_.% " * 100) + ("💩" * 100)  # 800 + 100 = 900 chars
    expected = "abc-_.% " * 100
    codeflash_output = sanitize_filename(long_input) # 39.0μs -> 29.4μs (32.8% faster)

# -------------------------
# 4. Mutation-sensitive (robustness) Test Cases
# -------------------------

def test_mutation_sensitive_removes_only_disallowed():
    # If function does not remove all disallowed, this will fail
    codeflash_output = sanitize_filename("file@name#2024!.txt") # 2.06μs -> 2.09μs (1.34% slower)

def test_mutation_sensitive_keeps_all_allowed():
    # If function removes any allowed, this will fail
    allowed = "abc-_.% 123"
    codeflash_output = sanitize_filename(allowed) # 1.84μs -> 1.79μs (2.80% faster)

def test_mutation_sensitive_does_not_add_characters():
    # Function should not add any characters
    codeflash_output = sanitize_filename("file.txt") # 1.43μs -> 1.50μs (4.21% slower)
    codeflash_output = sanitize_filename("file.txt!") # 938ns -> 793ns (18.3% faster)

def test_mutation_sensitive_order_preserved():
    # Output order must match input order of kept characters
    codeflash_output = sanitize_filename("1a2b3c-_.% ") # 1.79μs -> 1.79μs (0.000% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import string

# imports
import pytest
from skyvern.forge.sdk.api.files import sanitize_filename

# unit tests

# --- Basic Test Cases ---

def test_basic_alpha():
    # Only alphabetic characters should remain unchanged
    codeflash_output = sanitize_filename("filename") # 1.51μs -> 1.70μs (11.6% slower)
    codeflash_output = sanitize_filename("FileName") # 687ns -> 651ns (5.53% faster)

def test_basic_numeric():
    # Numeric characters should remain unchanged
    codeflash_output = sanitize_filename("file123") # 1.29μs -> 1.50μs (13.5% slower)
    codeflash_output = sanitize_filename("1234567890") # 895ns -> 827ns (8.22% faster)

def test_basic_mixed():
    # Alphanumeric and allowed symbols should remain
    codeflash_output = sanitize_filename("file-1_2.3% 4") # 1.91μs -> 1.82μs (5.23% faster)

def test_basic_disallowed_symbols():
    # Disallowed characters should be removed
    codeflash_output = sanitize_filename("file@name.txt") # 1.74μs -> 1.67μs (3.83% faster)
    codeflash_output = sanitize_filename("file!name#$.txt") # 1.12μs -> 1.00μs (11.6% faster)
    codeflash_output = sanitize_filename("file:name|.txt") # 816ns -> 674ns (21.1% faster)

def test_basic_spaces():
    # Spaces are allowed
    codeflash_output = sanitize_filename("file name.txt") # 1.61μs -> 1.64μs (1.77% slower)
    codeflash_output = sanitize_filename("   leading and trailing   ") # 1.84μs -> 1.61μs (14.1% faster)

# --- Edge Test Cases ---

def test_empty_string():
    # Empty string should return empty string
    codeflash_output = sanitize_filename("") # 970ns -> 813ns (19.3% faster)

def test_all_disallowed():
    # String of only disallowed characters should return empty string
    codeflash_output = sanitize_filename("@#$^&*()+=[]{};:'\",<>?/\\|~`") # 2.74μs -> 2.06μs (32.9% faster)

def test_only_allowed_symbols():
    # Only allowed symbols should remain
    codeflash_output = sanitize_filename("-_ .%") # 1.41μs -> 1.46μs (3.50% slower)

def test_unicode_characters():
    # Unicode alphabetic/numeric characters are not considered alnum() in all cases
    # But Python's str.isalnum() returns True for unicode letters/numbers
    # So, these should be preserved
    codeflash_output = sanitize_filename("文件名") # 1.62μs -> 1.79μs (9.33% slower)
    codeflash_output = sanitize_filename("файл123") # 1.06μs -> 1.14μs (6.70% slower)
    codeflash_output = sanitize_filename("résumé.txt") # 1.18μs -> 1.06μs (10.3% faster)

def test_control_characters():
    # Control characters should be removed
    codeflash_output = sanitize_filename("file\nname\t.txt") # 1.78μs -> 1.67μs (6.69% faster)

def test_leading_trailing_spaces():
    # Spaces are preserved, including leading/trailing
    codeflash_output = sanitize_filename("  file name  ") # 1.76μs -> 1.73μs (1.73% faster)

def test_mixed_allowed_and_disallowed():
    # Mix of allowed and disallowed, ensure only allowed remain
    codeflash_output = sanitize_filename("my_file@2023!.txt") # 1.99μs -> 1.98μs (0.555% faster)

def test_dot_handling():
    # Dots are allowed anywhere
    codeflash_output = sanitize_filename("...file...name...") # 2.15μs -> 1.95μs (9.94% faster)

def test_percent_handling():
    # Percent is allowed
    codeflash_output = sanitize_filename("100%_real.txt") # 1.78μs -> 1.70μs (4.72% faster)

def test_dash_and_underscore():
    # Dashes and underscores are allowed
    codeflash_output = sanitize_filename("file-name_with-dash_and_underscore") # 2.42μs -> 2.40μs (0.793% faster)

def test_filename_with_multiple_spaces():
    # Multiple consecutive spaces are preserved
    codeflash_output = sanitize_filename("file   name  with   spaces") # 2.46μs -> 2.22μs (11.2% faster)

def test_filename_with_emoji():
    # Emojis are not alnum, so should be removed
    codeflash_output = sanitize_filename("file😀name.txt") # 2.15μs -> 2.19μs (1.60% slower)

def test_filename_with_surrogate_pairs():
    # Surrogate pairs (high unicode) should be removed if not alnum
    codeflash_output = sanitize_filename("file\U0001F600name.txt") # 1.99μs -> 1.89μs (5.17% faster)

def test_filename_with_newlines_and_tabs():
    # Newlines and tabs are not allowed
    codeflash_output = sanitize_filename("file\nname\t.txt") # 1.82μs -> 1.70μs (6.82% faster)

def test_filename_with_slashes():
    # Slashes are not allowed
    codeflash_output = sanitize_filename("file/name\\test.txt") # 1.96μs -> 1.74μs (12.5% faster)

def test_filename_with_quotes():
    # Quotes are not allowed
    codeflash_output = sanitize_filename('file"name\'test.txt') # 1.91μs -> 1.81μs (5.76% faster)

def test_filename_with_mixed_scripts():
    # Mixed scripts, only alnum and allowed symbols remain
    codeflash_output = sanitize_filename("文件-file_123.txt") # 2.47μs -> 2.42μs (2.11% faster)

def test_filename_with_tricky_unicode():
    # Combining marks are not alnum, so should be removed
    codeflash_output = sanitize_filename("file\u0301name.txt") # 2.06μs -> 1.97μs (4.31% faster)

# --- Large Scale Test Cases ---

def test_very_long_filename():
    # Very long filename with only allowed characters
    long_name = "a" * 1000 + ".txt"
    codeflash_output = sanitize_filename(long_name) # 26.6μs -> 21.6μs (23.5% faster)

def test_very_long_filename_with_disallowed():
    # Very long filename with interleaved disallowed characters
    allowed = "a" * 500 + ".txt"
    disallowed = "@#$%^&*" * 125
    mixed = "".join(a + b for a, b in zip(allowed, disallowed)) + allowed
    # All disallowed should be removed
    expected = allowed * 2
    codeflash_output = sanitize_filename(mixed) # 52.9μs -> 34.0μs (55.5% faster)

def test_large_variety_of_characters():
    # Filename with all printable ASCII chars
    all_chars = "".join(chr(i) for i in range(32, 127))
    expected = "".join(c for c in all_chars if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(all_chars) # 4.46μs -> 4.33μs (3.12% faster)

def test_large_unicode_filename():
    # Large filename with a mix of unicode alnum and disallowed
    unicode_allowed = "文件" * 250  # 500 chars
    disallowed = "@#$%^&*" * 125  # 875 chars
    mixed = "".join(a + b for a, b in zip(unicode_allowed, disallowed))
    expected = unicode_allowed
    codeflash_output = sanitize_filename(mixed) # 46.8μs -> 30.3μs (54.2% faster)

def test_filename_with_all_ascii_letters_and_digits():
    # All ascii letters and digits should be preserved
    s = string.ascii_letters + string.digits
    codeflash_output = sanitize_filename(s) # 2.97μs -> 2.82μs (5.28% faster)

def test_filename_with_repeated_allowed_and_disallowed():
    # Mix of allowed and disallowed, repeated many times
    pattern = "abc-_.% 123@#$"
    expected = "abc-_.% 123" * 50
    test_str = pattern * 50
    codeflash_output = sanitize_filename(test_str) # 29.9μs -> 21.0μs (42.1% faster)

def test_filename_with_spaces_and_specials_large():
    # Large filename with spaces and special chars
    s = ("file name!@# " * 50).strip()
    expected = ("file name " * 50).strip()
    codeflash_output = sanitize_filename(s) # 26.7μs -> 17.8μs (50.5% faster)

# --- Defensive/Mutation Tests ---

def test_mutation_removal_of_dot():
    # If "." is removed from allowed, extension would be lost
    codeflash_output = sanitize_filename("file.name.txt") # 1.66μs -> 1.73μs (3.82% slower)

def test_mutation_removal_of_dash():
    # If "-" is removed from allowed, dash would be lost
    codeflash_output = sanitize_filename("file-name.txt") # 1.65μs -> 1.66μs (0.841% slower)

def test_mutation_removal_of_underscore():
    # If "_" is removed from allowed, underscore would be lost
    codeflash_output = sanitize_filename("file_name.txt") # 1.67μs -> 1.64μs (2.20% faster)

def test_mutation_removal_of_percent():
    # If "%" is removed from allowed, percent would be lost
    codeflash_output = sanitize_filename("file%name.txt") # 1.66μs -> 1.62μs (2.46% faster)

def test_mutation_removal_of_space():
    # If space is removed from allowed, spaces would be lost
    codeflash_output = sanitize_filename("file name.txt") # 1.67μs -> 1.63μs (2.02% faster)

def test_mutation_incorrectly_allowing_disallowed():
    # If a disallowed char is incorrectly allowed, it would appear in output
    codeflash_output = sanitize_filename("file@name.txt") # 1.76μs -> 1.65μs (6.85% faster)
    codeflash_output = sanitize_filename("file#name.txt") # 995ns -> 890ns (11.8% faster)

def test_mutation_incorrectly_removing_unicode_alnum():
    # If unicode alnum is incorrectly removed, output would be wrong
    codeflash_output = sanitize_filename("文件.txt") # 1.79μs -> 1.83μs (2.18% slower)

def test_mutation_incorrectly_allowing_control_chars():
    # If control chars are allowed, output would be wrong
    codeflash_output = sanitize_filename("file\nname.txt") # 1.65μs -> 1.61μs (2.30% faster)
    codeflash_output = sanitize_filename("file\tname.txt") # 973ns -> 838ns (16.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-sanitize_filename-mi7cfqef and push.

Codeflash

The optimization achieves a **31% speedup** by making two key changes that improve both membership checking and string construction performance:

**What was optimized:**
1. **Set-based membership checking**: Replaced the inline list `["-", "_", ".", "%", " "]` with a pre-created set `{"-", "_", ".", "%", " "}`, improving character lookup from O(n) to O(1) complexity.
2. **List comprehension over generator expression**: Changed from a generator expression to a list comprehension inside `str.join()`, reducing function call overhead.

**Why this leads to speedup:**
- **Set lookup efficiency**: When checking `c in allowed`, Python must scan through each element in a list (5 comparisons worst-case), but with a set it uses hash-based lookup (constant time). For longer filenames with many characters to filter, this compounds significantly.
- **Reduced generator overhead**: List comprehensions avoid the repeated function call overhead of generator expressions when used with `str.join()`, providing a small but consistent performance gain.

**Performance patterns from tests:**
- **Best gains** on files with many disallowed characters (34-84% faster) - the set lookup optimization shines when frequently rejecting characters
- **Largest scale benefits** on long filenames (24-56% faster) - O(1) vs O(n) lookup scales better with input size
- **Minimal overhead** on simple cases (some 1-13% slower on very short inputs) - the set creation cost is amortized over longer processing

**Impact on workloads:**
Based on the function references, `sanitize_filename` is called in file download/upload workflows (`download_file`, `rename_file`, `create_named_temporary_file`). These are likely I/O-bound operations where even microsecond improvements in filename processing can reduce overall latency, especially when processing many files or files with complex names containing special characters.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 20, 2025 11:24
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant