Skip to content

Conversation

@Cunxiao2002
Copy link
Contributor

@Cunxiao2002 Cunxiao2002 commented Oct 3, 2025

Summary by CodeRabbit

  • New Features

    • GEMM example now runs an integrated benchmark and prints measured latency.
    • Profiler benchmarking adds selectable backends (including CUPTI), quieter profiling output, configurable quantiles, and selectable return modes (min/max/mean/median).
  • Tests

    • Added end-to-end profiler tests that run and validate GEMM profiling across backends.
  • Documentation

    • Adapter description updated to indicate the adapter uses cython rather than ctypes.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Walkthrough

Adds runtime benchmarking to the GEMM example and extends the profiler with two backends ("event", "cupti"), new options (backend, quantiles, return_mode), CUPTI stdout/stderr suppression, and corresponding tests and minor docstring wording change.

Changes

Cohort / File(s) Summary
Example benchmarking usage
examples/gemm/example_gemm.py
Adds runtime benchmarking: obtains kernel profiler, calls do_bench on gemm_kernel, and prints measured latency (keeps an alternative commented call).
Profiler public surface
tilelang/profiler/__init__.py
Adds Literal import and expands do_bench/Profiler.do_bench signatures to accept and propagate backend, quantiles, return_mode.
Benchmark engines & utilities
tilelang/profiler/bench.py
Adds suppress_stdout_stderr utility, implements backend dispatch for do_bench with "event" and "cupti" paths, adds _bench_with_cuda_events and _bench_with_cupti, introduces L2-flush buffer and quantiles/return-mode handling, and updates returns/errors accordingly.
Tests: profiler integration
testing/python/profiler/test_tilelang_profiler.py
New test module: defines a tiled GEMM kernel, validates correctness against PyTorch, and exercises both cupti and event backends via the kernel's profiler, printing latency.
Docs / wording
tilelang/jit/adapter/cython/adapter.py
Minor docstring wording update: replace "ctypes" with "cython".

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User as Example Script
  participant K as gemm_kernel
  participant P as Profiler
  participant E as EventBench
  participant C as CUPTIBench
  participant CUDA as CUDA / NSYS Runtime

  User->>K: build kernel
  User->>P: profiler = kernel.get_profiler()
  User->>P: profiler.do_bench(func=K, backend=...)
  alt backend == "event"
    P->>E: _bench_with_cuda_events(func=K, ...)
    E->>CUDA: warmup + timed iterations (CUDA events)
    CUDA-->>E: event timings
    E-->>P: aggregated latency (quantiles/return_mode)
  else backend == "cupti"
    P->>C: _bench_with_cupti(func=K, ...)
    C->>CUDA: run iterations (stdout/stderr suppressed)
    CUDA-->>C: per-kernel records
    C-->>P: per-kernel latency aggregated (quantiles/return_mode)
  end
  P-->>User: latency value(s)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Suggested reviewers

  • LeiWang1999
  • tzj-fxz

Poem

I hopped through kernels, soft and spry,
Timers ticking as the flops fly by.
Event or cupti, quiet and keen,
I gather latencies, tidy and clean.
Carrot-count printed — a small performance scene. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly and concisely conveys the primary change—adding CUPTI support to the profiler component—without including unnecessary details or vague wording, making it immediately understandable to reviewers and future readers.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/profiler/__init__.py (1)

218-233: Fix API typing + defaults: avoid overriding dynamic warmup/repeat.

  • input_tensors should be Optional.
  • kernel_names should be Optional[Union[str, Tuple[str, ...]]].
  • trace_path should be Optional[str].
  • Return type isn’t always float (event quantiles => List[float]; kineto multi-kernel => Tuple[float, ...]).
  • n_warmup/n_repeat should default to 0 to enable dynamic estimation (current 1 forces single-iteration timing).

Apply:

-from typing import List, Optional, Callable, Any, Literal
+from typing import List, Optional, Callable, Any, Literal, Union, Tuple
@@
     def do_bench(
         self,
         func: Optional[Callable] = None,
         warmup: int = 25,
         rep: int = 100,
-        n_warmup: int = 1,
-        n_repeat: int = 1,
-        input_tensors: List[torch.Tensor] = None,
-        kernel_names=None,
+        n_warmup: int = 0,
+        n_repeat: int = 0,
+        input_tensors: Optional[List[torch.Tensor]] = None,
+        kernel_names: Optional[Union[str, Tuple[str, ...]]] = None,
         grad_to_none: Optional[List[torch.Tensor]] = None,
         quantiles: Optional[List[float]] = None,
         return_mode: Literal["min", "max", "mean", "median"] = "mean",
-        trace_path: str = None,
+        trace_path: Optional[str] = None,
         suppress_kineto_output: bool = False,
         with_multiple_kernels: bool = False,
-    ) -> float:
+    ) -> Union[float, List[float], Tuple[float, ...]]:

Also update the docstring “Returns” accordingly. [Based on learnings]

🧹 Nitpick comments (6)
tilelang/profiler/__init__.py (1)

255-275: Forward parity and doc correctness when using Kineto path.

  • do_bench may return a tuple when kernel_names is a tuple; update doc/typing as suggested above.
  • If you add grad clearing to Kineto (recommended for backward benches), forward grad_to_none to do_bench_kineto.

If Kineto supports grad clearing:

-            ) if kernel_names is None else do_bench_kineto(
+            ) if kernel_names is None else do_bench_kineto(
                 bench_func,
                 kernel_names=kernel_names,
                 warmup=warmup,
                 rep=rep,
                 _n_warmup=n_warmup,
                 _n_repeat=n_repeat,
                 return_mode=return_mode,
                 trace_path=trace_path,
                 suppress_kineto_output=suppress_kineto_output,
                 with_multiple_kernels=with_multiple_kernels,
+                grad_to_none=grad_to_none,
             )
tilelang/profiler/bench.py (4)

114-148: Make stdout/stderr suppression robust on non-filelike streams.

sys.stdout/err may lack fileno() (e.g., Jupyter), raising exceptions. Fall back gracefully instead of crashing.

+import io
@@
 class suppress_stdout_stderr:
 
     def __enter__(self):
-        self.outnull_file = open(os.devnull, 'w')
-        self.errnull_file = open(os.devnull, 'w')
-
-        self.old_stdout_fileno_undup = sys.stdout.fileno()
-        self.old_stderr_fileno_undup = sys.stderr.fileno()
-
-        self.old_stdout_fileno = os.dup(sys.stdout.fileno())
-        self.old_stderr_fileno = os.dup(sys.stderr.fileno())
-
-        self.old_stdout = sys.stdout
-        self.old_stderr = sys.stderr
-
-        os.dup2(self.outnull_file.fileno(), self.old_stdout_fileno_undup)
-        os.dup2(self.errnull_file.fileno(), self.old_stderr_fileno_undup)
-
-        sys.stdout = self.outnull_file
-        sys.stderr = self.errnull_file
-        return self
+        self._disabled = False
+        try:
+            self.outnull_file = open(os.devnull, 'w')
+            self.errnull_file = open(os.devnull, 'w')
+
+            self.old_stdout_fileno_undup = sys.stdout.fileno()
+            self.old_stderr_fileno_undup = sys.stderr.fileno()
+
+            self.old_stdout_fileno = os.dup(sys.stdout.fileno())
+            self.old_stderr_fileno = os.dup(sys.stderr.fileno())
+
+            self.old_stdout = sys.stdout
+            self.old_stderr = sys.stderr
+
+            os.dup2(self.outnull_file.fileno(), self.old_stdout_fileno_undup)
+            os.dup2(self.errnull_file.fileno(), self.old_stderr_fileno_undup)
+
+            sys.stdout = self.outnull_file
+            sys.stderr = self.errnull_file
+        except (AttributeError, io.UnsupportedOperation, OSError):
+            # Fall back gracefully when fileno() is unavailable
+            self._disabled = True
+        return self
 
     def __exit__(self, *_):
-        sys.stdout = self.old_stdout
-        sys.stderr = self.old_stderr
-
-        os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup)
-        os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup)
-
-        os.close(self.old_stdout_fileno)
-        os.close(self.old_stderr_fileno)
-
-        self.outnull_file.close()
-        self.errnull_file.close()
+        if getattr(self, "_disabled", False):
+            return
+        sys.stdout = self.old_stdout
+        sys.stderr = self.old_stderr
+        os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup)
+        os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup)
+        os.close(self.old_stdout_fileno)
+        os.close(self.old_stderr_fileno)
+        self.outnull_file.close()
+        self.errnull_file.close()

200-209: Support gradient clearing in Kineto path (parity with event timing).

Avoid gradient accumulation for backward benches.

-        with profiler:
+        with profiler:
             for _i in range(2):
                 for _ in range(n_repeat):
                     cache.zero_()
+                    # clear grads if provided (backward compatibility with event path)
+                    # grad_to_none: Optional[List[torch.Tensor]]
+                    try:
+                        grad_list = grad_to_none  # if parameter is added
+                    except NameError:
+                        grad_list = None
+                    if grad_list is not None:
+                        for x in grad_list:
+                            x.grad = None
                     fn()

Note: If you adopt this, add grad_to_none: Optional[List[torch.Tensor]] = None to the function signature and forward it from the caller.


221-229: Remove unused noqa and harden kernel name validation.

  • The # noqa: SIM101 is flagged as unused by Ruff. Remove it.
  • Consider accepting Sequence[str] instead of only tuple for ergonomics (optional).
-    assert isinstance(kernel_names, str) or isinstance(kernel_names, tuple)  # noqa: SIM101
+    assert isinstance(kernel_names, str) or isinstance(kernel_names, tuple)

236-251: Expand unit parsing to include seconds and nanoseconds.

Covers more profiler formats and avoids silent mis-scaling.

-    units = {'ms': 1e3, 'us': 1e6}
+    units = {'s': 1, 'ms': 1e3, 'us': 1e6, 'ns': 1e9}
examples/gemm/example_gemm.py (1)

54-59: Suppress profiler chatter in example output.

Default Kineto may print noisy lines in some setups. Suppress for a cleaner example.

-    latency = profiler.do_bench(kernel_names="gemm_kernel")
+    latency = profiler.do_bench(kernel_names="gemm_kernel", suppress_kineto_output=True)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ccac4f and e5e77eb.

📒 Files selected for processing (3)
  • examples/gemm/example_gemm.py (1 hunks)
  • tilelang/profiler/__init__.py (4 hunks)
  • tilelang/profiler/bench.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/gemm/example_gemm.py (2)
tilelang/jit/kernel.py (1)
  • get_profiler (360-376)
tilelang/profiler/__init__.py (1)
  • do_bench (218-296)
tilelang/profiler/__init__.py (1)
tilelang/profiler/bench.py (2)
  • do_bench_event (9-102)
  • do_bench_kineto (151-253)
tilelang/profiler/bench.py (1)
tilelang/env.py (1)
  • get (136-139)
🪛 Ruff (0.13.2)
tilelang/profiler/__init__.py

230-230: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

tilelang/profiler/bench.py

160-160: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


221-221: Unused noqa directive (non-enabled: SIM101)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-amd
  • GitHub Check: bot-task

Comment on lines 150 to 164
# from https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py
def do_bench_kineto(
fn: Callable,
kernel_names,
warmup: float = 25,
rep: float = 100,
_n_warmup: int = 0,
_n_repeat: int = 0,
fast_flush: bool = True,
return_mode: Literal["min", "max", "mean", "median"] = "mean",
trace_path: str = None,
suppress_kineto_output: bool = False,
with_multiple_kernels: bool = False,
) -> Union[float, List[float]]:
"""Benchmarks the runtime of a PyTorch function using Kineto."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Correct Kineto API typing and Optional usage (RUF013).

  • kernel_names supports str or tuple; return can be float or tuple.
  • trace_path should be Optional[str].
-from typing import Callable, List, Literal, Optional, Union
+from typing import Callable, List, Literal, Optional, Union, Tuple
@@
-def do_bench_kineto(
-    fn: Callable,
-    kernel_names,
+def do_bench_kineto(
+    fn: Callable,
+    kernel_names: Union[str, Tuple[str, ...]],
@@
-    return_mode: Literal["min", "max", "mean", "median"] = "mean",
-    trace_path: str = None,
+    return_mode: Literal["min", "max", "mean", "median"] = "mean",
+    trace_path: Optional[str] = None,
     suppress_kineto_output: bool = False,
     with_multiple_kernels: bool = False,
-) -> Union[float, List[float]]:
+) -> Union[float, Tuple[float, ...]]:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# from https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py
def do_bench_kineto(
fn: Callable,
kernel_names,
warmup: float = 25,
rep: float = 100,
_n_warmup: int = 0,
_n_repeat: int = 0,
fast_flush: bool = True,
return_mode: Literal["min", "max", "mean", "median"] = "mean",
trace_path: str = None,
suppress_kineto_output: bool = False,
with_multiple_kernels: bool = False,
) -> Union[float, List[float]]:
"""Benchmarks the runtime of a PyTorch function using Kineto."""
from typing import Callable, List, Literal, Optional, Union, Tuple
# from https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py
def do_bench_kineto(
fn: Callable,
kernel_names: Union[str, Tuple[str, ...]],
warmup: float = 25,
rep: float = 100,
_n_warmup: int = 0,
_n_repeat: int = 0,
fast_flush: bool = True,
return_mode: Literal["min", "max", "mean", "median"] = "mean",
trace_path: Optional[str] = None,
suppress_kineto_output: bool = False,
with_multiple_kernels: bool = False,
) -> Union[float, Tuple[float, ...]]:
"""Benchmarks the runtime of a PyTorch function using Kineto."""
🧰 Tools
🪛 Ruff (0.13.2)

160-160: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

🤖 Prompt for AI Agents
In tilelang/profiler/bench.py around lines 150 to 164, the function signature
uses overly narrow typing: update kernel_names to accept either a string or a
tuple of strings (e.g., Union[str, Tuple[str, ...]]), change the return
annotation to reflect that it may return a float or a tuple of floats (e.g.,
Union[float, Tuple[float, ...], List[float]]), and make trace_path
Optional[str]; also add any missing typing imports (Optional, Tuple, Union) at
top of the file. Ensure no runtime behavior changes, only update the type hints
and imports.

@LeiWang1999
Copy link
Member

would be better if we can combine do_bench_kineto and do_bench_event into a same api do_bench.

@Cunxiao2002 Cunxiao2002 closed this Oct 3, 2025
@Cunxiao2002 Cunxiao2002 reopened this Oct 3, 2025
@Cunxiao2002 Cunxiao2002 marked this pull request as draft October 3, 2025 17:38
@Cunxiao2002 Cunxiao2002 marked this pull request as ready for review October 4, 2025 09:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tilelang/profiler/bench.py (2)

159-160: Fix Optional typing for trace_path.

trace_path defaults to None, so the annotation must be Optional[str] to satisfy PEP 484 and Ruff (RUF013). Update the signature accordingly.


270-270: Restore trailing newline.

CI is failing with Ruff W292 because the file lacks a trailing newline. Please add the terminal newline so formatting passes.

tilelang/profiler/__init__.py (1)

230-231: Align trace_path annotation with default.

trace_path defaults to None; annotate it as Optional[str] to silence Ruff (RUF013) and accurately describe the API.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5e77eb and 17bdaba.

📒 Files selected for processing (4)
  • examples/gemm/example_gemm.py (1 hunks)
  • tilelang/jit/adapter/cython/adapter.py (1 hunks)
  • tilelang/profiler/__init__.py (3 hunks)
  • tilelang/profiler/bench.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tilelang/jit/adapter/cython/adapter.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/gemm/example_gemm.py
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/profiler/bench.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (218-285)
🪛 GitHub Actions: CI Test on AMD
tilelang/profiler/bench.py

[error] 270-270: Ruff: W292 No newline at end of file. 1 fixable with the --fix option.

🪛 Ruff (0.13.3)
tilelang/profiler/bench.py

159-159: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


247-247: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

tilelang/profiler/__init__.py

227-227: Unused method argument: grad_to_none

(ARG002)


228-228: Unused method argument: quantiles

(ARG002)


230-230: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: format-check

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tilelang/profiler/bench.py (1)

159-159: Fix implicit Optional type hint.

The trace_path parameter should be explicitly typed as Optional[str] to comply with PEP 484.

Apply this diff:

-    trace_path: str = None,
+    trace_path: Optional[str] = None,
🧹 Nitpick comments (2)
tilelang/profiler/bench.py (2)

151-162: Enhance docstring to document Kineto-specific behavior.

The docstring is minimal. Consider documenting the Kineto/CUPTI-specific behavior, the purpose of trace_path, suppress_kineto_output, and the handling of the DG_NSYS_PROFILING environment variable.


236-270: Add docstring to document dispatcher behavior.

The do_bench function now acts as a dispatcher but lacks a docstring. Consider adding documentation that explains:

  • The backend parameter and its effect on profiling method
  • Which parameters are specific to each backend (e.g., grad_to_none and quantiles only apply to "event" backend, trace_path and suppress_kineto_output only apply to "cupti" backend)
  • The return value and how it differs between backends

Example docstring structure:

def do_bench(
    fn: Callable,
    warmup: float = 25,
    rep: float = 100,
    _n_warmup: int = 0,
    _n_repeat: int = 0,
    grad_to_none: Optional[List[torch.Tensor]] = None,
    quantiles: Optional[List[float]] = None,
    fast_flush: bool = True,
    backend: Literal["event", "cupti"] = "event",
    return_mode: Literal["min", "max", "mean", "median"] = "mean",
    trace_path: Optional[str] = None,
    suppress_kineto_output: bool = False,
) -> Union[float, List[float]]:
    """Benchmarks the runtime of a PyTorch function.
    
    Dispatches to either event-based or Kineto/CUPTI-based profiling
    depending on the backend parameter.
    
    Args:
        fn: Function to benchmark
        warmup: Target warmup time in milliseconds
        rep: Target number of repetitions
        _n_warmup: Override for number of warmup iterations
        _n_repeat: Override for number of timing iterations
        grad_to_none: Tensors whose gradients should be cleared (event backend only)
        quantiles: Performance percentiles to compute (event backend only)
        fast_flush: Whether to use faster L2 cache flushing
        backend: Profiling backend ("event" for CUDA events, "cupti" for Kineto/CUPTI)
        return_mode: How to aggregate timing results
        trace_path: Path to save chrome trace (cupti backend only)
        suppress_kineto_output: Whether to suppress Kineto output (cupti backend only)
        
    Returns:
        Aggregated runtime in milliseconds, or list of quantiles if requested
    """
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17bdaba and 5b721cc.

📒 Files selected for processing (1)
  • tilelang/profiler/bench.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/profiler/bench.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (218-285)
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py

159-159: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


247-247: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-test-amd
  • GitHub Check: format-check
  • GitHub Check: bot-task
🔇 Additional comments (4)
tilelang/profiler/bench.py (4)

5-6: LGTM!

The new imports are necessary for the output suppression functionality.


9-102: LGTM!

The refactoring to make this an internal function is appropriate. The enhanced docstring provides clear documentation of the function's behavior and parameters.


105-147: LGTM!

The context managers are well-implemented. The suppress_stdout_stderr class properly manages file descriptors with duplication and restoration. The empty_suppress no-op context manager is a clean pattern for conditional suppression.


215-216: Clarify NSYS profiling return value. Returning a hardcoded 1 when using_nsys is enabled is ambiguous; document this sentinel value or replace it with a meaningful indicator (e.g., float('nan')) or raise NotImplementedError.

fast_flush: bool = True,
backend: Literal["event", "cupti"] = "event",
return_mode: Literal["min", "max", "mean", "median"] = "mean",
trace_path: str = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix implicit Optional type hint.

The trace_path parameter should be explicitly typed as Optional[str] to comply with PEP 484.

Apply this diff:

-    trace_path: str = None,
+    trace_path: Optional[str] = None,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
trace_path: str = None,
trace_path: Optional[str] = None,
🧰 Tools
🪛 Ruff (0.13.3)

247-247: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

🤖 Prompt for AI Agents
In tilelang/profiler/bench.py around line 247, the parameter trace_path is
currently declared as `trace_path: str = None`, which creates an implicit
Optional; change its annotation to `trace_path: Optional[str] = None` and ensure
`from typing import Optional` is imported at the top of the module (add the
import if missing). This makes the optional nature explicit and compliant with
PEP 484.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tilelang/profiler/bench.py (2)

9-17: Consider using contextlib.nullcontext for the no-op case.

While the custom empty_suppress implementation is correct, Python's standard library provides contextlib.nullcontext for this exact use case, which would eliminate the need for custom code.

Apply this diff to use the standard library:

-# from https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py
-class empty_suppress:
-
-    def __enter__(self):
-        return self
-
-    def __exit__(self, *_):
-        pass
+from contextlib import nullcontext as empty_suppress

Then update line 157 and 163 to use it:

-        suppress = suppress_stdout_stderr if not using_nsys else empty_suppress
+        suppress = suppress_stdout_stderr if not using_nsys else empty_suppress
         with suppress():
             schedule = torch.profiler.schedule(
-                wait=1, warmup=0, active=1, repeat=1) if not using_nsys else None
+                wait=1, warmup=0, active=1, repeat=1) if not using_nsys else None
             profiler = torch.profiler.profile(
                 activities=[torch.profiler.ProfilerActivity.CUDA],
-                schedule=schedule) if not using_nsys else empty_suppress()
+                schedule=schedule) if not using_nsys else empty_suppress()

189-190: Consider extracting error message to comply with TRY003.

While the error handling is functionally correct, the static analysis tool flags the long error message inline. This is a minor style issue.

Apply this diff to comply with TRY003:

+    UNKNOWN_BACKEND_MSG = "Unknown profiler backend: {}"
+
     else:
-        raise ValueError(f"Unknown profiler backend: {backend}")
+        raise ValueError(UNKNOWN_BACKEND_MSG.format(backend))
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b721cc and 4cfa04d.

📒 Files selected for processing (2)
  • tilelang/profiler/__init__.py (3 hunks)
  • tilelang/profiler/bench.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/profiler/bench.py (1)
tilelang/env.py (1)
  • get (136-139)
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py

190-190: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: format-check
🔇 Additional comments (7)
tilelang/profiler/bench.py (4)

5-6: LGTM!

The os and sys imports are necessary for the stdout/stderr suppression utilities introduced in this PR.


19-52: LGTM!

The suppress_stdout_stderr implementation correctly redirects stdout and stderr to /dev/null and properly restores the original state. File descriptor management and resource cleanup are handled correctly.


64-64: LGTM!

The backend parameter addition with Literal["event", "cupti"] provides clear type safety for backend selection.


126-153: LGTM!

The event backend preserves the original timing logic with CUDA events. The implementation correctly handles gradient clearing, L2 cache flushing, and quantile computation.

tilelang/profiler/__init__.py (3)

3-3: LGTM!

The Literal import addition supports the new backend and return_mode parameter typing.


226-229: LGTM!

The new parameter additions (backend, grad_to_none, quantiles, return_mode) are properly typed and align with the updated do_bench signature in bench.py.


252-262: LGTM! Past review feedback addressed.

All parameters (grad_to_none, quantiles, backend, return_mode) are now properly forwarded to the underlying do_bench call, resolving the previous review concern about missing parameter propagation.

n_repeat: int = 1,
input_tensors: List[torch.Tensor] = None,
backend: Literal["event", "cupti"] = "event",
grad_to_none: Optional[List[torch.Tensor]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need parameter grad_to_none?

return getattr(torch, return_mode)(times).item()

elif backend == "cupti":
using_nsys = int(os.environ.get('DG_NSYS_PROFILING', 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DG means DeepGEMM, would be better to remove this env and always with using_nsys = False.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4cfa04d and f0509ab.

📒 Files selected for processing (1)
  • tilelang/profiler/bench.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py

175-175: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: format-check

Comment on lines 117 to 172
if backend == "event":
start_event = [torch.cuda.Event(enable_timing=True) for i in range(n_repeat)]
end_event = [torch.cuda.Event(enable_timing=True) for i in range(n_repeat)]
for i in range(n_repeat):
# we don't want `fn` to accumulate gradient values
# if it contains a backward pass. So we clear the
# provided gradients
if grad_to_none is not None:
for x in grad_to_none:
x.grad = None
# we clear the L2 cache before each run
cache.zero_()
# record time of `fn`
start_event[i].record()
fn()
end_event[i].record()
# Record clocks
torch.cuda.synchronize()
times = torch.tensor(
[s.elapsed_time(e) for s, e in zip(start_event, end_event)],
dtype=torch.float,
)
if quantiles is not None:
ret = torch.quantile(times, torch.tensor(quantiles, dtype=torch.float)).tolist()
if len(ret) == 1:
ret = ret[0]
return ret
return getattr(torch, return_mode)(times).item()

elif backend == "cupti":
with suppress_stdout_stderr():
schedule = torch.profiler.schedule(
wait=1, warmup=0, active=1, repeat=1)
profiler = torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
schedule=schedule)
with profiler:
for _i in range(2):
for _ in range(n_repeat):
cache.zero_()
fn()

profiler.step()

# Return average kernel times
total_cuda_time = 0.0
excluded_time = 0.0
excluded_kernels = "at::native::vectorized_elementwise"
for e in profiler.key_averages():
total_cuda_time += e.self_device_time_total
if excluded_kernels in e.key:
excluded_time += e.self_device_time_total

kernel_time = (total_cuda_time - excluded_time) / n_repeat

return kernel_time * 1e-3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Honor grad/aggregation options for the CUPTI path.

When backend="cupti", we drop the gradient reset (Line 124), quantile handling (Line 139), and return_mode aggregation (Line 144). Any caller supplying grad_to_none or return_mode="min"/"max"/"median" still gets the CUPTI branch, but those settings are silently ignored, so backward benches will accumulate grads and the returned statistic is always the mean. This diverges from the API contract established by the event backend and will yield wrong results for existing call sites. Please either implement the equivalent handling in the CUPTI branch or reject unsupported combinations up front (e.g. raise with a clear message when grad_to_none/quantiles/non-mean return modes are requested).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tilelang/profiler/bench.py (2)

155-155: Consider making the excluded kernel configurable.

The hardcoded exclusion of "at::native::vectorized_elementwise" kernels should be documented or made into a parameter. Users may want to:

  • Include these kernels in their measurements
  • Exclude different kernels

Option 1: Document why this specific kernel is excluded:

-        excluded_kernels = "at::native::vectorized_elementwise"
+        # Exclude vectorized_elementwise as it's an overhead kernel not part of the measured computation
+        excluded_kernels = "at::native::vectorized_elementwise"

Option 2: Add a parameter for excluded kernels:

def do_bench(
    fn: Callable,
    ...
    backend: Literal["event", "cupti"] = "event",
    excluded_kernels: Optional[List[str]] = None,
    ...
):

166-166: Minor: Consider extracting the error message.

Static analysis suggests avoiding long messages inside exception constructors (TRY003). While this is a minor style issue, you could extract it to a constant if the message grows.

+UNKNOWN_BACKEND_ERROR = "Unknown profiler backend: {backend}. Supported backends: 'event', 'cupti'"
 else:
-    raise ValueError(f"Unknown profiler backend: {backend}")
+    raise ValueError(UNKNOWN_BACKEND_ERROR.format(backend=backend))
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0509ab and 7e64b8e.

📒 Files selected for processing (2)
  • tilelang/profiler/__init__.py (3 hunks)
  • tilelang/profiler/bench.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py

166-166: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)
tilelang/profiler/bench.py (1)

9-43: LGTM on the suppression utility.

The suppress_stdout_stderr context manager correctly duplicates file descriptors, redirects stdout/stderr to /dev/null, and restores them on exit. This is a clean approach for silencing profiler output.

tilelang/profiler/__init__.py (1)

226-260: LGTM! Parameters are correctly propagated.

The new backend, quantiles, and return_mode parameters are properly:

  1. Added to the function signature with appropriate types
  2. Propagated to the underlying do_bench call
  3. Aligned with the implementation in bench.py

The parameter forwarding is complete and correct.

Comment on lines 139 to 163
elif backend == "cupti":
with suppress_stdout_stderr():
schedule = torch.profiler.schedule(wait=1, warmup=0, active=1, repeat=1)
profiler = torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA], schedule=schedule)
with profiler:
for _i in range(2):
for _ in range(n_repeat):
cache.zero_()
fn()

profiler.step()

# Return average kernel times
total_cuda_time = 0.0
excluded_time = 0.0
excluded_kernels = "at::native::vectorized_elementwise"
for e in profiler.key_averages():
total_cuda_time += e.self_device_time_total
if excluded_kernels in e.key:
excluded_time += e.self_device_time_total

kernel_time = (total_cuda_time - excluded_time) / n_repeat

return kernel_time * 1e-3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Fix unit conversion and honor ignored parameters in the CUPTI backend.

The CUPTI backend has several critical issues:

  1. Unit mismatch (Line 163): Returns kernel_time * 1e-3 (converts microseconds to milliseconds), but the event backend returns milliseconds directly. This creates an inconsistency where the CUPTI backend returns values 1000× smaller than expected.
  2. Ignored quantiles parameter: Callers requesting specific quantiles will get the mean instead, breaking the API contract.
  3. Ignored return_mode parameter: Callers requesting "min", "max", or "median" will always get "mean", again breaking the contract.
  4. Undocumented loop (Line 145): The outer loop running twice corresponds to the profiler schedule's wait and active phases but lacks explanation.

Fix the unit conversion:

-        return kernel_time * 1e-3
+        return kernel_time * 1e-3  # Convert microseconds to milliseconds

Wait, let me recalculate: self_device_time_total is in microseconds. The event backend returns milliseconds via elapsed_time(end). So kernel_time (in µs) should be divided by 1000, not multiplied:

-        return kernel_time * 1e-3
+        return kernel_time / 1000  # Convert microseconds to milliseconds

Honor quantiles and return_mode or reject them up front:

Option 1: Validate and reject unsupported combinations at the start of the function:

if backend == "cupti":
    if quantiles is not None:
        raise ValueError("quantiles parameter is not supported with cupti backend")
    if return_mode != "mean":
        raise ValueError(f"cupti backend only supports return_mode='mean', got '{return_mode}'")

Option 2: Implement support by collecting per-iteration times in the CUPTI path (more complex but provides feature parity).

Document the profiler schedule loop:

         with profiler:
+            # Run twice: once for profiler wait phase, once for active measurement
             for _i in range(2):

Based on learnings: The reviewer asked if do_bench_kineto and do_bench_event could be combined into a single API. This PR achieves that by adding a backend parameter to do_bench.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/profiler/bench.py (1)

57-129: API contract violation: CUPTI backend silently ignores quantiles and return_mode parameters.

When backend="cupti" is selected, the function accepts quantiles and return_mode parameters but silently ignores them (Line 126 calls _bench_with_cupti which doesn't accept these parameters). This breaks the API contract:

  • Callers requesting quantiles=[0.5, 0.95] will get a single mean value
  • Callers requesting return_mode="min" will get the mean instead

This creates silent behavioral divergence from the event backend.

Solution 1 (recommended): Validate parameters up front and reject unsupported combinations:

     assert return_mode in ["min", "max", "mean", "median"], \
         f"Invalid return_mode: {return_mode}"
+    
+    # Validate CUPTI backend constraints
+    if backend == "cupti":
+        if quantiles is not None:
+            raise ValueError("quantiles parameter is not supported with backend='cupti'")
+        if return_mode != "mean":
+            raise ValueError(f"backend='cupti' only supports return_mode='mean', got '{return_mode}'")

Solution 2 (ideal but more complex): Implement full parameter support in _bench_with_cupti by collecting per-iteration times and computing quantiles/aggregations.

Based on learnings: This addresses the past review comment about honoring quantiles and return_mode for the CUPTI path.

🧹 Nitpick comments (1)
tilelang/profiler/bench.py (1)

180-184: Document the profiler schedule loop.

The outer loop running twice corresponds to the profiler's wait (1) and active (1) phases from the schedule, but this isn't documented.

Add a comment:

         with profiler:
+            # Loop twice: once for wait phase, once for active measurement phase
             for _ in range(2):
                 for _ in range(n_repeat):
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e64b8e and 9f55d3b.

📒 Files selected for processing (1)
  • tilelang/profiler/bench.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/profiler/bench.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (218-281)
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py

128-128: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-amd
  • GitHub Check: bot-task
🔇 Additional comments (3)
tilelang/profiler/bench.py (3)

10-54: LGTM! Clean implementation of stdout/stderr suppression.

The context manager correctly:

  • Saves original file descriptors using os.dup
  • Redirects streams to /dev/null
  • Restores everything in __exit__

This is a standard pattern for fd redirection.


131-163: LGTM! Solid CUDA events implementation.

The function correctly:

  • Creates timing events per iteration
  • Clears L2 cache between runs
  • Synchronizes before collecting measurements
  • Handles both quantile and aggregation modes

189-194: Make excluded kernel pattern configurable and use regex

  • Replace the hardcoded "at::native::vectorized_elementwise" with a user-provided list of regex patterns (e.g. via a CLI flag --exclude-kernel-regex, defaulting to r"vectorized_elementwise")
  • Match only on the substring vectorized_elementwise (or other future names) rather than the full namespace to tolerate PyTorch renames
  • Optionally document how to verify the regex against profiler.key_averages() output across PyTorch releases

@Cunxiao2002
Copy link
Contributor Author

some benchmarks for two profiler

  1. example/gemm/example_gemm.py
  • event Latency: 0.027775999158620834ms
  • cupti Latency: 0.023777000000000045ms
  1. example/flash_attention/example_mha_fwd_bhsd.py
  • event Latency: 0.01020800042897462845ms
  • cupti Latency: 0.00630400000000008731ms
  1. example/grouped_gemm/example_grouped_gemm_fwd.py
  • event Latency: 0.4819520115852356ms
  • cupti Latency: 0.4754229999999989ms

LeiWang1999
LeiWang1999 previously approved these changes Oct 6, 2025
@LeiWang1999 LeiWang1999 merged commit 91d5ef5 into tile-ai:main Oct 6, 2025
4 of 5 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f55d3b and aa70ebd.

📒 Files selected for processing (1)
  • testing/python/profiler/test_tilelang_profiler.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/profiler/test_tilelang_profiler.py (7)
src/tl_templates/cuda/reduce.h (1)
  • T (75-147)
tilelang/jit/__init__.py (1)
  • jit (240-313)
tilelang/jit/kernel.py (2)
  • out_idx (446-447)
  • get_profiler (360-376)
tilelang/language/allocate.py (2)
  • alloc_shared (21-36)
  • alloc_fragment (53-64)
tilelang/language/fill.py (1)
  • clear (24-48)
tilelang/language/pipeline.py (1)
  • Pipelined (9-46)
tilelang/language/copy.py (1)
  • copy (84-152)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-amd
  • GitHub Check: bot-task

Comment on lines +35 to +50
a = torch.randn(1024, 1024).cuda().half()
b = torch.randn(1024, 1024).cuda().half()

c = kernel(a, b)
ref_c = a @ b
torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)

# benchmark
profiler = kernel.get_profiler()

# use cupti backend
cupti_latency = profiler.do_bench(backend="cupti")

# use event backend
event_latency = profiler.do_bench(backend="event")
print(f"cupti Latency: {cupti_latency}ms")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard the test when CUDA is unavailable.

torch.randn(...).cuda() will raise immediately on CPU-only hosts, so CI without GPUs will fail before exercising the profiler. Add a CUDA-availability check (e.g., skip via pytest.skip / pytest.mark.skipif) before moving tensors to CUDA.

Apply this diff (pytest is commonly available in the test suite):

+    import pytest
+    if not torch.cuda.is_available():
+        pytest.skip("Profiler test requires CUDA.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
a = torch.randn(1024, 1024).cuda().half()
b = torch.randn(1024, 1024).cuda().half()
c = kernel(a, b)
ref_c = a @ b
torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
# benchmark
profiler = kernel.get_profiler()
# use cupti backend
cupti_latency = profiler.do_bench(backend="cupti")
# use event backend
event_latency = profiler.do_bench(backend="event")
print(f"cupti Latency: {cupti_latency}ms")
import pytest
if not torch.cuda.is_available():
pytest.skip("Profiler test requires CUDA.")
a = torch.randn(1024, 1024).cuda().half()
b = torch.randn(1024, 1024).cuda().half()
c = kernel(a, b)
ref_c = a @ b
torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
# benchmark
profiler = kernel.get_profiler()
# use cupti backend
cupti_latency = profiler.do_bench(backend="cupti")
# use event backend
event_latency = profiler.do_bench(backend="event")
print(f"cupti Latency: {cupti_latency}ms")
🤖 Prompt for AI Agents
testing/python/profiler/test_tilelang_profiler.py lines 35-50: the test
unconditionally calls .cuda() which fails on CPU-only hosts; add a guard at the
top of this block that checks torch.cuda.is_available() and skips the test when
False (use pytest.skip or @pytest.mark.skipif) before creating tensors or
calling kernel.get_profiler(), so no tensors are moved to CUDA on CI without
GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants