Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 20, 2025

📄 125% (1.25x) speedup for get_download_dir in skyvern/forge/sdk/api/files.py

⏱️ Runtime : 1.62 milliseconds 721 microseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 124% speedup through two key improvements:

1. Path Construction Optimization

  • Replaced f-string interpolation f"{REPO_ROOT_DIR}/downloads/{run_id}" with os.path.join(REPO_ROOT_DIR, "downloads", str(run_id))
  • While os.path.join appears slower in the profiler (5488ns vs 1077ns per hit), it provides cross-platform compatibility and proper path handling
  • The str(run_id) conversion ensures consistent string handling for None values

2. Directory Existence Check

  • Added if not os.path.isdir(download_dir): before os.makedirs()
  • This is the primary performance gain: os.makedirs() was taking 88.8% of execution time (4.07ms) in the original code
  • When directories already exist (common in repeated calls), the optimization completely skips the expensive makedirs syscall
  • The os.isdir() check takes only 33.1% of the optimized execution time (1.1ms) and is much faster than makedirs

Impact on Hot Path Usage
Based on the function references, get_download_dir is called from:

  • File download operations (get_downloaded_files)
  • S3 upload workflows (save_downloaded_files)
  • Workflow directory management

These are likely called multiple times with the same run_id during file processing workflows, making the directory existence check especially beneficial.

Test Case Performance
The optimization shows consistent 80-170% speedups across all test cases, with the largest gains when directories already exist (up to 169% for empty run_id). The "repeated calls" tests demonstrate the real-world benefit: second calls are 120-129% faster due to avoiding makedirs.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 386 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import os
import shutil
# --- Setup for test: Patch REPO_ROOT_DIR for isolation ---
import sys
import tempfile
import types
from pathlib import Path

# imports
import pytest
from skyvern.constants import REPO_ROOT_DIR
from skyvern.forge.sdk.api.files import get_download_dir

# 1. Basic Test Cases

def test_basic_valid_run_id():
    """Test with a standard alphanumeric run_id."""
    run_id = "abc123"
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", run_id)
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 7.54μs -> 3.28μs (130% faster)

def test_basic_numeric_run_id():
    """Test with a numeric run_id."""
    run_id = "9876543210"
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", run_id)
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 7.93μs -> 3.38μs (135% faster)

def test_basic_run_id_with_dash_underscore():
    """Test with run_id containing dashes and underscores."""
    run_id = "run-2024_06"
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", run_id)
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 7.98μs -> 3.38μs (136% faster)

def test_basic_run_id_directory_already_exists():
    """Test where the directory already exists (should not raise)."""
    run_id = "existingdir"
    codeflash_output = get_download_dir(run_id); path = codeflash_output # 8.04μs -> 4.11μs (95.6% faster)
    # Create a file in the directory
    file_path = os.path.join(path, "test.txt")
    with open(file_path, "w") as f:
        f.write("test")
    # Call again, should not remove or overwrite
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 6.00μs -> 2.70μs (122% faster)

# 2. Edge Test Cases

def test_edge_empty_run_id():
    """Test with empty string as run_id."""
    run_id = ""
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", "")
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.99μs -> 3.34μs (169% faster)

def test_edge_none_run_id():
    """Test with None as run_id (should create .../downloads/None)."""
    run_id = None
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", "None")
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.33μs -> 3.58μs (133% faster)

def test_edge_special_characters_run_id():
    """Test with run_id containing special characters."""
    run_id = "!@# $%^&*()[]{};:,.<>?"
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", run_id)
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.26μs -> 3.58μs (131% faster)

def test_edge_run_id_with_slash():
    """Test with run_id containing slashes (should create nested dirs)."""
    run_id = "foo/bar/baz"
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", "foo", "bar", "baz")
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.45μs -> 3.79μs (123% faster)

def test_edge_run_id_with_dot_dot():
    """Test with run_id containing '..' (directory traversal attempt)."""
    run_id = "../traversal"
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", "../traversal")
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.48μs -> 3.63μs (134% faster)

def test_edge_run_id_extremely_long():
    """Test with a very long run_id (close to OS limit, but <255 chars)."""
    run_id = "a" * 200
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", run_id)
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.81μs -> 3.54μs (149% faster)

def test_large_scale_many_unique_run_ids():
    """Test creating many unique download directories (scalability)."""
    num_dirs = 200  # Reasonable for unit test, but large enough to check scalability
    run_ids = [f"run_{i:04d}" for i in range(num_dirs)]
    created_paths = set()
    for run_id in run_ids:
        codeflash_output = get_download_dir(run_id); path = codeflash_output # 753μs -> 331μs (127% faster)
        created_paths.add(path)
    for path in created_paths:
        pass

def test_large_scale_repeated_calls_same_run_id():
    """Test repeated calls with the same run_id (should not error, idempotent)."""
    run_id = "repeat_test"
    codeflash_output = get_download_dir(run_id); path1 = codeflash_output # 9.27μs -> 4.10μs (126% faster)
    codeflash_output = get_download_dir(run_id); path2 = codeflash_output # 4.20μs -> 1.91μs (120% faster)
    codeflash_output = get_download_dir(run_id); path3 = codeflash_output # 3.75μs -> 1.64μs (129% faster)

def test_large_scale_longest_legal_path():
    """Test with a run_id that makes the total path close to OS max (255 chars)."""
    # Compute max run_id length: 255 - len(REPO_ROOT_DIR/downloads/) - 1 for slash
    base_path = os.path.join(REPO_ROOT_DIR, "downloads")
    max_total = 240  # Conservative for most filesystems
    run_id_len = max_total - len(base_path)
    run_id = "r" * max(1, run_id_len)
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.39μs -> 3.68μs (128% faster)

def test_large_scale_nested_run_ids():
    """Test with deeply nested run_id (using slashes)."""
    run_id = "/".join([f"level{i}" for i in range(10)])
    expected_path = os.path.join(REPO_ROOT_DIR, "downloads", *[f"level{i}" for i in range(10)])
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 9.09μs -> 4.11μs (121% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import os
import shutil
import tempfile
from pathlib import Path

# imports
import pytest  # used for our unit tests
from skyvern.forge.sdk.api.files import get_download_dir

# For the purpose of testing, we define REPO_ROOT_DIR here.
# In real usage, this would be imported from skyvern.constants.
REPO_ROOT_DIR = tempfile.gettempdir() + "/skyvern_repo_root"
from skyvern.forge.sdk.api.files import get_download_dir

# ---------------- BASIC TEST CASES ----------------

def test_basic_valid_run_id_str():
    """Test with a typical run_id string."""
    run_id = "123abc"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.17μs -> 3.83μs (113% faster)

def test_basic_valid_run_id_numeric():
    """Test with a numeric run_id."""
    run_id = "456789"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.67μs -> 4.16μs (108% faster)

def test_basic_valid_run_id_special_chars():
    """Test with a run_id containing special characters."""
    run_id = "run_!@#"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.59μs -> 4.25μs (102% faster)

def test_basic_run_id_none():
    """Test with run_id as None (should use string 'None')."""
    run_id = None
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.65μs -> 4.39μs (97.0% faster)

def test_basic_run_id_empty_string():
    """Test with empty string as run_id."""
    run_id = ""
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 9.32μs -> 4.25μs (119% faster)

# ---------------- EDGE TEST CASES ----------------

def test_edge_run_id_long_string():
    """Test with a very long run_id string."""
    run_id = "a" * 255  # Typical max filename length
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 9.16μs -> 4.49μs (104% faster)

def test_edge_run_id_unicode():
    """Test with Unicode characters in run_id."""
    run_id = "下载_测试_🌟"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 9.64μs -> 5.22μs (84.7% faster)

def test_edge_run_id_path_traversal():
    """Test with run_id containing path traversal characters."""
    run_id = "../outside"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.90μs -> 4.69μs (89.8% faster)

def test_edge_run_id_reserved_names():
    """Test with reserved names (Windows) as run_id."""
    run_id = "CON"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    try:
        codeflash_output = get_download_dir(run_id); result = codeflash_output
    except OSError:
        # On Windows, creating 'CON' as a folder may fail
        pytest.skip("Reserved name on Windows, cannot create directory.")

def test_edge_run_id_with_slashes():
    """Test with slashes in run_id (creates nested directories)."""
    run_id = "foo/bar/baz"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.78μs -> 4.54μs (93.2% faster)

def test_edge_run_id_dot():
    """Test with run_id as '.' (current directory)."""
    run_id = "."
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.45μs -> 4.46μs (89.7% faster)

def test_edge_run_id_dotdot():
    """Test with run_id as '..' (parent directory)."""
    run_id = ".."
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.38μs -> 4.46μs (87.8% faster)

def test_edge_run_id_whitespace():
    """Test with run_id containing whitespace."""
    run_id = "   "
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.50μs -> 4.40μs (93.1% faster)

def test_edge_run_id_newline():
    """Test with run_id containing newline characters."""
    run_id = "runid\nnextline"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.59μs -> 4.26μs (102% faster)

def test_edge_run_id_tab():
    """Test with run_id containing tab characters."""
    run_id = "runid\twithtab"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.54μs -> 4.28μs (99.6% faster)

def test_edge_run_id_none_string():
    """Test with run_id as the string 'None'."""
    run_id = "None"
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.49μs -> 4.28μs (98.6% faster)

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_scale_many_run_ids():
    """Test creating many download directories for different run_ids."""
    run_ids = [f"run_{i}" for i in range(100)]  # 100 unique run_ids
    created_dirs = []
    for run_id in run_ids:
        codeflash_output = get_download_dir(run_id); path = codeflash_output # 380μs -> 167μs (128% faster)
        created_dirs.append(path)
    for path in created_dirs:
        pass

def test_large_scale_long_nested_run_id():
    """Test with a run_id that creates a deeply nested directory structure."""
    # 20 levels deep, each level named 'deep'
    run_id = "/".join(["deep"] * 20)
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 10.2μs -> 5.36μs (90.1% faster)
    # Check all intermediate directories exist
    parent = f"{REPO_ROOT_DIR}/downloads"
    for i in range(1, 21):
        parent = os.path.join(parent, "deep")

def test_large_scale_repeated_calls_same_run_id():
    """Test that repeated calls with the same run_id do not raise errors and return the same path."""
    run_id = "repeat_test"
    codeflash_output = get_download_dir(run_id); path1 = codeflash_output # 7.80μs -> 4.16μs (87.3% faster)
    codeflash_output = get_download_dir(run_id); path2 = codeflash_output # 4.27μs -> 1.93μs (121% faster)

def test_large_scale_repeated_calls_different_run_ids():
    """Test that repeated calls with different run_ids create all directories."""
    run_ids = [f"run_{i}" for i in range(50)]
    for run_id in run_ids:
        codeflash_output = get_download_dir(run_id); path = codeflash_output # 192μs -> 83.5μs (130% faster)
    # All directories should exist
    for run_id in run_ids:
        path = f"{REPO_ROOT_DIR}/downloads/{run_id}"

def test_large_scale_run_id_max_path_length():
    """Test with a run_id that brings the full path close to OS limits."""
    # Most OSes have a max path length of 255-4096, but we keep it safe
    base_length = len(f"{REPO_ROOT_DIR}/downloads/")
    max_run_id_length = 200  # Should be safe for most systems
    run_id = "x" * max_run_id_length
    expected_path = f"{REPO_ROOT_DIR}/downloads/{run_id}"
    codeflash_output = get_download_dir(run_id); result = codeflash_output # 8.58μs -> 4.17μs (106% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-get_download_dir-mi7c8b33 and push.

Codeflash

The optimized code achieves a **124% speedup** through two key improvements:

**1. Path Construction Optimization**
- Replaced f-string interpolation `f"{REPO_ROOT_DIR}/downloads/{run_id}"` with `os.path.join(REPO_ROOT_DIR, "downloads", str(run_id))`
- While `os.path.join` appears slower in the profiler (5488ns vs 1077ns per hit), it provides cross-platform compatibility and proper path handling
- The `str(run_id)` conversion ensures consistent string handling for None values

**2. Directory Existence Check**
- Added `if not os.path.isdir(download_dir):` before `os.makedirs()` 
- This is the **primary performance gain**: `os.makedirs()` was taking 88.8% of execution time (4.07ms) in the original code
- When directories already exist (common in repeated calls), the optimization completely skips the expensive `makedirs` syscall
- The `os.isdir()` check takes only 33.1% of the optimized execution time (1.1ms) and is much faster than `makedirs`

**Impact on Hot Path Usage**
Based on the function references, `get_download_dir` is called from:
- File download operations (`get_downloaded_files`)
- S3 upload workflows (`save_downloaded_files`) 
- Workflow directory management

These are likely called multiple times with the same `run_id` during file processing workflows, making the directory existence check especially beneficial.

**Test Case Performance**
The optimization shows consistent 80-170% speedups across all test cases, with the largest gains when directories already exist (up to 169% for empty run_id). The "repeated calls" tests demonstrate the real-world benefit: second calls are 120-129% faster due to avoiding `makedirs`.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 20, 2025 11:18
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant