- 
                Notifications
    You must be signed in to change notification settings 
- Fork 292
[Profiler]Adds CUPTI profiler support #936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| WalkthroughAdds runtime benchmarking to the GEMM example and extends the profiler with two backends ("event", "cupti"), new options ( Changes
 Sequence Diagram(s)sequenceDiagram
  autonumber
  actor User as Example Script
  participant K as gemm_kernel
  participant P as Profiler
  participant E as EventBench
  participant C as CUPTIBench
  participant CUDA as CUDA / NSYS Runtime
  User->>K: build kernel
  User->>P: profiler = kernel.get_profiler()
  User->>P: profiler.do_bench(func=K, backend=...)
  alt backend == "event"
    P->>E: _bench_with_cuda_events(func=K, ...)
    E->>CUDA: warmup + timed iterations (CUDA events)
    CUDA-->>E: event timings
    E-->>P: aggregated latency (quantiles/return_mode)
  else backend == "cupti"
    P->>C: _bench_with_cupti(func=K, ...)
    C->>CUDA: run iterations (stdout/stderr suppressed)
    CUDA-->>C: per-kernel records
    C-->>P: per-kernel latency aggregated (quantiles/return_mode)
  end
  P-->>User: latency value(s)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
 Suggested reviewers
 Poem
 Pre-merge checks and finishing touches❌ Failed checks (1 warning)
 ✅ Passed checks (2 passed)
 ✨ Finishing touches
 🧪 Generate unit tests (beta)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
| 👋 Hi! Thank you for contributing to the TileLang project. Please remember to run  We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
tilelang/profiler/__init__.py (1)
218-233: Fix API typing + defaults: avoid overriding dynamic warmup/repeat.
- input_tensors should be Optional.
- kernel_names should be Optional[Union[str, Tuple[str, ...]]].
- trace_path should be Optional[str].
- Return type isn’t always float (event quantiles => List[float]; kineto multi-kernel => Tuple[float, ...]).
- n_warmup/n_repeat should default to 0 to enable dynamic estimation (current 1 forces single-iteration timing).
Apply:
-from typing import List, Optional, Callable, Any, Literal +from typing import List, Optional, Callable, Any, Literal, Union, Tuple @@ def do_bench( self, func: Optional[Callable] = None, warmup: int = 25, rep: int = 100, - n_warmup: int = 1, - n_repeat: int = 1, - input_tensors: List[torch.Tensor] = None, - kernel_names=None, + n_warmup: int = 0, + n_repeat: int = 0, + input_tensors: Optional[List[torch.Tensor]] = None, + kernel_names: Optional[Union[str, Tuple[str, ...]]] = None, grad_to_none: Optional[List[torch.Tensor]] = None, quantiles: Optional[List[float]] = None, return_mode: Literal["min", "max", "mean", "median"] = "mean", - trace_path: str = None, + trace_path: Optional[str] = None, suppress_kineto_output: bool = False, with_multiple_kernels: bool = False, - ) -> float: + ) -> Union[float, List[float], Tuple[float, ...]]:Also update the docstring “Returns” accordingly. [Based on learnings]
🧹 Nitpick comments (6)
tilelang/profiler/__init__.py (1)
255-275: Forward parity and doc correctness when using Kineto path.
- do_bench may return a tuple when kernel_names is a tuple; update doc/typing as suggested above.
- If you add grad clearing to Kineto (recommended for backward benches), forward grad_to_none to do_bench_kineto.
If Kineto supports grad clearing:
- ) if kernel_names is None else do_bench_kineto( + ) if kernel_names is None else do_bench_kineto( bench_func, kernel_names=kernel_names, warmup=warmup, rep=rep, _n_warmup=n_warmup, _n_repeat=n_repeat, return_mode=return_mode, trace_path=trace_path, suppress_kineto_output=suppress_kineto_output, with_multiple_kernels=with_multiple_kernels, + grad_to_none=grad_to_none, )tilelang/profiler/bench.py (4)
114-148: Make stdout/stderr suppression robust on non-filelike streams.sys.stdout/err may lack fileno() (e.g., Jupyter), raising exceptions. Fall back gracefully instead of crashing.
+import io @@ class suppress_stdout_stderr: def __enter__(self): - self.outnull_file = open(os.devnull, 'w') - self.errnull_file = open(os.devnull, 'w') - - self.old_stdout_fileno_undup = sys.stdout.fileno() - self.old_stderr_fileno_undup = sys.stderr.fileno() - - self.old_stdout_fileno = os.dup(sys.stdout.fileno()) - self.old_stderr_fileno = os.dup(sys.stderr.fileno()) - - self.old_stdout = sys.stdout - self.old_stderr = sys.stderr - - os.dup2(self.outnull_file.fileno(), self.old_stdout_fileno_undup) - os.dup2(self.errnull_file.fileno(), self.old_stderr_fileno_undup) - - sys.stdout = self.outnull_file - sys.stderr = self.errnull_file - return self + self._disabled = False + try: + self.outnull_file = open(os.devnull, 'w') + self.errnull_file = open(os.devnull, 'w') + + self.old_stdout_fileno_undup = sys.stdout.fileno() + self.old_stderr_fileno_undup = sys.stderr.fileno() + + self.old_stdout_fileno = os.dup(sys.stdout.fileno()) + self.old_stderr_fileno = os.dup(sys.stderr.fileno()) + + self.old_stdout = sys.stdout + self.old_stderr = sys.stderr + + os.dup2(self.outnull_file.fileno(), self.old_stdout_fileno_undup) + os.dup2(self.errnull_file.fileno(), self.old_stderr_fileno_undup) + + sys.stdout = self.outnull_file + sys.stderr = self.errnull_file + except (AttributeError, io.UnsupportedOperation, OSError): + # Fall back gracefully when fileno() is unavailable + self._disabled = True + return self def __exit__(self, *_): - sys.stdout = self.old_stdout - sys.stderr = self.old_stderr - - os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup) - os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup) - - os.close(self.old_stdout_fileno) - os.close(self.old_stderr_fileno) - - self.outnull_file.close() - self.errnull_file.close() + if getattr(self, "_disabled", False): + return + sys.stdout = self.old_stdout + sys.stderr = self.old_stderr + os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup) + os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup) + os.close(self.old_stdout_fileno) + os.close(self.old_stderr_fileno) + self.outnull_file.close() + self.errnull_file.close()
200-209: Support gradient clearing in Kineto path (parity with event timing).Avoid gradient accumulation for backward benches.
- with profiler: + with profiler: for _i in range(2): for _ in range(n_repeat): cache.zero_() + # clear grads if provided (backward compatibility with event path) + # grad_to_none: Optional[List[torch.Tensor]] + try: + grad_list = grad_to_none # if parameter is added + except NameError: + grad_list = None + if grad_list is not None: + for x in grad_list: + x.grad = None fn()Note: If you adopt this, add
grad_to_none: Optional[List[torch.Tensor]] = Noneto the function signature and forward it from the caller.
221-229: Remove unused noqa and harden kernel name validation.
- The
# noqa: SIM101is flagged as unused by Ruff. Remove it.- Consider accepting Sequence[str] instead of only tuple for ergonomics (optional).- assert isinstance(kernel_names, str) or isinstance(kernel_names, tuple) # noqa: SIM101 + assert isinstance(kernel_names, str) or isinstance(kernel_names, tuple)
236-251: Expand unit parsing to include seconds and nanoseconds.Covers more profiler formats and avoids silent mis-scaling.
- units = {'ms': 1e3, 'us': 1e6} + units = {'s': 1, 'ms': 1e3, 'us': 1e6, 'ns': 1e9}examples/gemm/example_gemm.py (1)
54-59: Suppress profiler chatter in example output.Default Kineto may print noisy lines in some setups. Suppress for a cleaner example.
- latency = profiler.do_bench(kernel_names="gemm_kernel") + latency = profiler.do_bench(kernel_names="gemm_kernel", suppress_kineto_output=True)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
- examples/gemm/example_gemm.py(1 hunks)
- tilelang/profiler/__init__.py(4 hunks)
- tilelang/profiler/bench.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/gemm/example_gemm.py (2)
tilelang/jit/kernel.py (1)
get_profiler(360-376)tilelang/profiler/__init__.py (1)
do_bench(218-296)
tilelang/profiler/__init__.py (1)
tilelang/profiler/bench.py (2)
do_bench_event(9-102)
do_bench_kineto(151-253)
tilelang/profiler/bench.py (1)
tilelang/env.py (1)
get(136-139)
🪛 Ruff (0.13.2)
tilelang/profiler/__init__.py
230-230: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
tilelang/profiler/bench.py
160-160: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
221-221: Unused noqa directive (non-enabled: SIM101)
Remove unused noqa directive
(RUF100)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-amd
- GitHub Check: bot-task
        
          
                tilelang/profiler/bench.py
              
                Outdated
          
        
      | # from https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py | ||
| def do_bench_kineto( | ||
| fn: Callable, | ||
| kernel_names, | ||
| warmup: float = 25, | ||
| rep: float = 100, | ||
| _n_warmup: int = 0, | ||
| _n_repeat: int = 0, | ||
| fast_flush: bool = True, | ||
| return_mode: Literal["min", "max", "mean", "median"] = "mean", | ||
| trace_path: str = None, | ||
| suppress_kineto_output: bool = False, | ||
| with_multiple_kernels: bool = False, | ||
| ) -> Union[float, List[float]]: | ||
| """Benchmarks the runtime of a PyTorch function using Kineto.""" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Correct Kineto API typing and Optional usage (RUF013).
- kernel_names supports str or tuple; return can be float or tuple.
- trace_path should be Optional[str].
-from typing import Callable, List, Literal, Optional, Union
+from typing import Callable, List, Literal, Optional, Union, Tuple
@@
-def do_bench_kineto(
-    fn: Callable,
-    kernel_names,
+def do_bench_kineto(
+    fn: Callable,
+    kernel_names: Union[str, Tuple[str, ...]],
@@
-    return_mode: Literal["min", "max", "mean", "median"] = "mean",
-    trace_path: str = None,
+    return_mode: Literal["min", "max", "mean", "median"] = "mean",
+    trace_path: Optional[str] = None,
     suppress_kineto_output: bool = False,
     with_multiple_kernels: bool = False,
-) -> Union[float, List[float]]:
+) -> Union[float, Tuple[float, ...]]:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # from https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py | |
| def do_bench_kineto( | |
| fn: Callable, | |
| kernel_names, | |
| warmup: float = 25, | |
| rep: float = 100, | |
| _n_warmup: int = 0, | |
| _n_repeat: int = 0, | |
| fast_flush: bool = True, | |
| return_mode: Literal["min", "max", "mean", "median"] = "mean", | |
| trace_path: str = None, | |
| suppress_kineto_output: bool = False, | |
| with_multiple_kernels: bool = False, | |
| ) -> Union[float, List[float]]: | |
| """Benchmarks the runtime of a PyTorch function using Kineto.""" | |
| from typing import Callable, List, Literal, Optional, Union, Tuple | |
| # from https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py | |
| def do_bench_kineto( | |
| fn: Callable, | |
| kernel_names: Union[str, Tuple[str, ...]], | |
| warmup: float = 25, | |
| rep: float = 100, | |
| _n_warmup: int = 0, | |
| _n_repeat: int = 0, | |
| fast_flush: bool = True, | |
| return_mode: Literal["min", "max", "mean", "median"] = "mean", | |
| trace_path: Optional[str] = None, | |
| suppress_kineto_output: bool = False, | |
| with_multiple_kernels: bool = False, | |
| ) -> Union[float, Tuple[float, ...]]: | |
| """Benchmarks the runtime of a PyTorch function using Kineto.""" | 
🧰 Tools
🪛 Ruff (0.13.2)
160-160: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
🤖 Prompt for AI Agents
In tilelang/profiler/bench.py around lines 150 to 164, the function signature
uses overly narrow typing: update kernel_names to accept either a string or a
tuple of strings (e.g., Union[str, Tuple[str, ...]]), change the return
annotation to reflect that it may return a float or a tuple of floats (e.g.,
Union[float, Tuple[float, ...], List[float]]), and make trace_path
Optional[str]; also add any missing typing imports (Optional, Tuple, Union) at
top of the file. Ensure no runtime behavior changes, only update the type hints
and imports.
| would be better if we can combine do_bench_kineto and do_bench_event into a same api  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tilelang/profiler/bench.py (2)
159-160: Fix Optional typing fortrace_path.
trace_pathdefaults toNone, so the annotation must beOptional[str]to satisfy PEP 484 and Ruff (RUF013). Update the signature accordingly.
270-270: Restore trailing newline.CI is failing with Ruff W292 because the file lacks a trailing newline. Please add the terminal newline so formatting passes.
tilelang/profiler/__init__.py (1)
230-231: Aligntrace_pathannotation with default.
trace_pathdefaults toNone; annotate it asOptional[str]to silence Ruff (RUF013) and accurately describe the API.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
- examples/gemm/example_gemm.py(1 hunks)
- tilelang/jit/adapter/cython/adapter.py(1 hunks)
- tilelang/profiler/__init__.py(3 hunks)
- tilelang/profiler/bench.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- tilelang/jit/adapter/cython/adapter.py
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/gemm/example_gemm.py
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/profiler/bench.py (1)
tilelang/profiler/__init__.py (1)
do_bench(218-285)
🪛 GitHub Actions: CI Test on AMD
tilelang/profiler/bench.py
[error] 270-270: Ruff: W292 No newline at end of file. 1 fixable with the --fix option.
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py
159-159: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
247-247: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
tilelang/profiler/__init__.py
227-227: Unused method argument: grad_to_none
(ARG002)
228-228: Unused method argument: quantiles
(ARG002)
230-230: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format-check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tilelang/profiler/bench.py (1)
159-159: Fix implicit Optional type hint.The
trace_pathparameter should be explicitly typed asOptional[str]to comply with PEP 484.Apply this diff:
- trace_path: str = None, + trace_path: Optional[str] = None,
🧹 Nitpick comments (2)
tilelang/profiler/bench.py (2)
151-162: Enhance docstring to document Kineto-specific behavior.The docstring is minimal. Consider documenting the Kineto/CUPTI-specific behavior, the purpose of
trace_path,suppress_kineto_output, and the handling of theDG_NSYS_PROFILINGenvironment variable.
236-270: Add docstring to document dispatcher behavior.The
do_benchfunction now acts as a dispatcher but lacks a docstring. Consider adding documentation that explains:
- The backend parameter and its effect on profiling method
- Which parameters are specific to each backend (e.g.,
grad_to_noneandquantilesonly apply to "event" backend,trace_pathandsuppress_kineto_outputonly apply to "cupti" backend)- The return value and how it differs between backends
Example docstring structure:
def do_bench( fn: Callable, warmup: float = 25, rep: float = 100, _n_warmup: int = 0, _n_repeat: int = 0, grad_to_none: Optional[List[torch.Tensor]] = None, quantiles: Optional[List[float]] = None, fast_flush: bool = True, backend: Literal["event", "cupti"] = "event", return_mode: Literal["min", "max", "mean", "median"] = "mean", trace_path: Optional[str] = None, suppress_kineto_output: bool = False, ) -> Union[float, List[float]]: """Benchmarks the runtime of a PyTorch function. Dispatches to either event-based or Kineto/CUPTI-based profiling depending on the backend parameter. Args: fn: Function to benchmark warmup: Target warmup time in milliseconds rep: Target number of repetitions _n_warmup: Override for number of warmup iterations _n_repeat: Override for number of timing iterations grad_to_none: Tensors whose gradients should be cleared (event backend only) quantiles: Performance percentiles to compute (event backend only) fast_flush: Whether to use faster L2 cache flushing backend: Profiling backend ("event" for CUDA events, "cupti" for Kineto/CUPTI) return_mode: How to aggregate timing results trace_path: Path to save chrome trace (cupti backend only) suppress_kineto_output: Whether to suppress Kineto output (cupti backend only) Returns: Aggregated runtime in milliseconds, or list of quantiles if requested """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- tilelang/profiler/bench.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/profiler/bench.py (1)
tilelang/profiler/__init__.py (1)
do_bench(218-285)
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py
159-159: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
247-247: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: format-check
- GitHub Check: bot-task
🔇 Additional comments (4)
tilelang/profiler/bench.py (4)
5-6: LGTM!The new imports are necessary for the output suppression functionality.
9-102: LGTM!The refactoring to make this an internal function is appropriate. The enhanced docstring provides clear documentation of the function's behavior and parameters.
105-147: LGTM!The context managers are well-implemented. The
suppress_stdout_stderrclass properly manages file descriptors with duplication and restoration. Theempty_suppressno-op context manager is a clean pattern for conditional suppression.
215-216: Clarify NSYS profiling return value. Returning a hardcoded1whenusing_nsysis enabled is ambiguous; document this sentinel value or replace it with a meaningful indicator (e.g.,float('nan')) or raiseNotImplementedError.
        
          
                tilelang/profiler/bench.py
              
                Outdated
          
        
      | fast_flush: bool = True, | ||
| backend: Literal["event", "cupti"] = "event", | ||
| return_mode: Literal["min", "max", "mean", "median"] = "mean", | ||
| trace_path: str = None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix implicit Optional type hint.
The trace_path parameter should be explicitly typed as Optional[str] to comply with PEP 484.
Apply this diff:
-    trace_path: str = None,
+    trace_path: Optional[str] = None,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| trace_path: str = None, | |
| trace_path: Optional[str] = None, | 
🧰 Tools
🪛 Ruff (0.13.3)
247-247: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
🤖 Prompt for AI Agents
In tilelang/profiler/bench.py around line 247, the parameter trace_path is
currently declared as `trace_path: str = None`, which creates an implicit
Optional; change its annotation to `trace_path: Optional[str] = None` and ensure
`from typing import Optional` is imported at the top of the module (add the
import if missing). This makes the optional nature explicit and compliant with
PEP 484.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tilelang/profiler/bench.py (2)
9-17: Consider usingcontextlib.nullcontextfor the no-op case.While the custom
empty_suppressimplementation is correct, Python's standard library providescontextlib.nullcontextfor this exact use case, which would eliminate the need for custom code.Apply this diff to use the standard library:
-# from https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py -class empty_suppress: - - def __enter__(self): - return self - - def __exit__(self, *_): - pass +from contextlib import nullcontext as empty_suppressThen update line 157 and 163 to use it:
- suppress = suppress_stdout_stderr if not using_nsys else empty_suppress + suppress = suppress_stdout_stderr if not using_nsys else empty_suppress with suppress(): schedule = torch.profiler.schedule( - wait=1, warmup=0, active=1, repeat=1) if not using_nsys else None + wait=1, warmup=0, active=1, repeat=1) if not using_nsys else None profiler = torch.profiler.profile( activities=[torch.profiler.ProfilerActivity.CUDA], - schedule=schedule) if not using_nsys else empty_suppress() + schedule=schedule) if not using_nsys else empty_suppress()
189-190: Consider extracting error message to comply with TRY003.While the error handling is functionally correct, the static analysis tool flags the long error message inline. This is a minor style issue.
Apply this diff to comply with TRY003:
+ UNKNOWN_BACKEND_MSG = "Unknown profiler backend: {}" + else: - raise ValueError(f"Unknown profiler backend: {backend}") + raise ValueError(UNKNOWN_BACKEND_MSG.format(backend))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
- tilelang/profiler/__init__.py(3 hunks)
- tilelang/profiler/bench.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/profiler/bench.py (1)
tilelang/env.py (1)
get(136-139)
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py
190-190: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format-check
🔇 Additional comments (7)
tilelang/profiler/bench.py (4)
5-6: LGTM!The
osandsysimports are necessary for the stdout/stderr suppression utilities introduced in this PR.
19-52: LGTM!The
suppress_stdout_stderrimplementation correctly redirects stdout and stderr to/dev/nulland properly restores the original state. File descriptor management and resource cleanup are handled correctly.
64-64: LGTM!The
backendparameter addition withLiteral["event", "cupti"]provides clear type safety for backend selection.
126-153: LGTM!The event backend preserves the original timing logic with CUDA events. The implementation correctly handles gradient clearing, L2 cache flushing, and quantile computation.
tilelang/profiler/__init__.py (3)
3-3: LGTM!The
Literalimport addition supports the newbackendandreturn_modeparameter typing.
226-229: LGTM!The new parameter additions (
backend,grad_to_none,quantiles,return_mode) are properly typed and align with the updateddo_benchsignature inbench.py.
252-262: LGTM! Past review feedback addressed.All parameters (
grad_to_none,quantiles,backend,return_mode) are now properly forwarded to the underlyingdo_benchcall, resolving the previous review concern about missing parameter propagation.
        
          
                tilelang/profiler/__init__.py
              
                Outdated
          
        
      | n_repeat: int = 1, | ||
| input_tensors: List[torch.Tensor] = None, | ||
| backend: Literal["event", "cupti"] = "event", | ||
| grad_to_none: Optional[List[torch.Tensor]] = None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need parameter grad_to_none?
        
          
                tilelang/profiler/bench.py
              
                Outdated
          
        
      | return getattr(torch, return_mode)(times).item() | ||
|  | ||
| elif backend == "cupti": | ||
| using_nsys = int(os.environ.get('DG_NSYS_PROFILING', 0)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DG means DeepGEMM, would be better to remove this env and always with using_nsys = False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- tilelang/profiler/bench.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py
175-175: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format-check
        
          
                tilelang/profiler/bench.py
              
                Outdated
          
        
      | if backend == "event": | ||
| start_event = [torch.cuda.Event(enable_timing=True) for i in range(n_repeat)] | ||
| end_event = [torch.cuda.Event(enable_timing=True) for i in range(n_repeat)] | ||
| for i in range(n_repeat): | ||
| # we don't want `fn` to accumulate gradient values | ||
| # if it contains a backward pass. So we clear the | ||
| # provided gradients | ||
| if grad_to_none is not None: | ||
| for x in grad_to_none: | ||
| x.grad = None | ||
| # we clear the L2 cache before each run | ||
| cache.zero_() | ||
| # record time of `fn` | ||
| start_event[i].record() | ||
| fn() | ||
| end_event[i].record() | ||
| # Record clocks | ||
| torch.cuda.synchronize() | ||
| times = torch.tensor( | ||
| [s.elapsed_time(e) for s, e in zip(start_event, end_event)], | ||
| dtype=torch.float, | ||
| ) | ||
| if quantiles is not None: | ||
| ret = torch.quantile(times, torch.tensor(quantiles, dtype=torch.float)).tolist() | ||
| if len(ret) == 1: | ||
| ret = ret[0] | ||
| return ret | ||
| return getattr(torch, return_mode)(times).item() | ||
|  | ||
| elif backend == "cupti": | ||
| with suppress_stdout_stderr(): | ||
| schedule = torch.profiler.schedule( | ||
| wait=1, warmup=0, active=1, repeat=1) | ||
| profiler = torch.profiler.profile( | ||
| activities=[torch.profiler.ProfilerActivity.CUDA], | ||
| schedule=schedule) | ||
| with profiler: | ||
| for _i in range(2): | ||
| for _ in range(n_repeat): | ||
| cache.zero_() | ||
| fn() | ||
|  | ||
| profiler.step() | ||
|  | ||
| # Return average kernel times | ||
| total_cuda_time = 0.0 | ||
| excluded_time = 0.0 | ||
| excluded_kernels = "at::native::vectorized_elementwise" | ||
| for e in profiler.key_averages(): | ||
| total_cuda_time += e.self_device_time_total | ||
| if excluded_kernels in e.key: | ||
| excluded_time += e.self_device_time_total | ||
|  | ||
| kernel_time = (total_cuda_time - excluded_time) / n_repeat | ||
|  | ||
| return kernel_time * 1e-3 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honor grad/aggregation options for the CUPTI path.
When backend="cupti", we drop the gradient reset (Line 124), quantile handling (Line 139), and return_mode aggregation (Line 144). Any caller supplying grad_to_none or return_mode="min"/"max"/"median" still gets the CUPTI branch, but those settings are silently ignored, so backward benches will accumulate grads and the returned statistic is always the mean. This diverges from the API contract established by the event backend and will yield wrong results for existing call sites. Please either implement the equivalent handling in the CUPTI branch or reject unsupported combinations up front (e.g. raise with a clear message when grad_to_none/quantiles/non-mean return modes are requested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tilelang/profiler/bench.py (2)
155-155: Consider making the excluded kernel configurable.The hardcoded exclusion of
"at::native::vectorized_elementwise"kernels should be documented or made into a parameter. Users may want to:
- Include these kernels in their measurements
- Exclude different kernels
Option 1: Document why this specific kernel is excluded:
- excluded_kernels = "at::native::vectorized_elementwise" + # Exclude vectorized_elementwise as it's an overhead kernel not part of the measured computation + excluded_kernels = "at::native::vectorized_elementwise"Option 2: Add a parameter for excluded kernels:
def do_bench( fn: Callable, ... backend: Literal["event", "cupti"] = "event", excluded_kernels: Optional[List[str]] = None, ... ):
166-166: Minor: Consider extracting the error message.Static analysis suggests avoiding long messages inside exception constructors (TRY003). While this is a minor style issue, you could extract it to a constant if the message grows.
+UNKNOWN_BACKEND_ERROR = "Unknown profiler backend: {backend}. Supported backends: 'event', 'cupti'" else: - raise ValueError(f"Unknown profiler backend: {backend}") + raise ValueError(UNKNOWN_BACKEND_ERROR.format(backend=backend))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
- tilelang/profiler/__init__.py(3 hunks)
- tilelang/profiler/bench.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py
166-166: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
tilelang/profiler/bench.py (1)
9-43: LGTM on the suppression utility.The
suppress_stdout_stderrcontext manager correctly duplicates file descriptors, redirects stdout/stderr to/dev/null, and restores them on exit. This is a clean approach for silencing profiler output.tilelang/profiler/__init__.py (1)
226-260: LGTM! Parameters are correctly propagated.The new
backend,quantiles, andreturn_modeparameters are properly:
- Added to the function signature with appropriate types
- Propagated to the underlying
do_benchcall- Aligned with the implementation in
bench.pyThe parameter forwarding is complete and correct.
        
          
                tilelang/profiler/bench.py
              
                Outdated
          
        
      | elif backend == "cupti": | ||
| with suppress_stdout_stderr(): | ||
| schedule = torch.profiler.schedule(wait=1, warmup=0, active=1, repeat=1) | ||
| profiler = torch.profiler.profile( | ||
| activities=[torch.profiler.ProfilerActivity.CUDA], schedule=schedule) | ||
| with profiler: | ||
| for _i in range(2): | ||
| for _ in range(n_repeat): | ||
| cache.zero_() | ||
| fn() | ||
|  | ||
| profiler.step() | ||
|  | ||
| # Return average kernel times | ||
| total_cuda_time = 0.0 | ||
| excluded_time = 0.0 | ||
| excluded_kernels = "at::native::vectorized_elementwise" | ||
| for e in profiler.key_averages(): | ||
| total_cuda_time += e.self_device_time_total | ||
| if excluded_kernels in e.key: | ||
| excluded_time += e.self_device_time_total | ||
|  | ||
| kernel_time = (total_cuda_time - excluded_time) / n_repeat | ||
|  | ||
| return kernel_time * 1e-3 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Fix unit conversion and honor ignored parameters in the CUPTI backend.
The CUPTI backend has several critical issues:
- Unit mismatch (Line 163): Returns kernel_time * 1e-3(converts microseconds to milliseconds), but the event backend returns milliseconds directly. This creates an inconsistency where the CUPTI backend returns values 1000× smaller than expected.
- Ignored quantilesparameter: Callers requesting specific quantiles will get the mean instead, breaking the API contract.
- Ignored return_modeparameter: Callers requesting "min", "max", or "median" will always get "mean", again breaking the contract.
- Undocumented loop (Line 145): The outer loop running twice corresponds to the profiler schedule's wait and active phases but lacks explanation.
Fix the unit conversion:
-        return kernel_time * 1e-3
+        return kernel_time * 1e-3  # Convert microseconds to millisecondsWait, let me recalculate: self_device_time_total is in microseconds. The event backend returns milliseconds via elapsed_time(end). So kernel_time (in µs) should be divided by 1000, not multiplied:
-        return kernel_time * 1e-3
+        return kernel_time / 1000  # Convert microseconds to millisecondsHonor quantiles and return_mode or reject them up front:
Option 1: Validate and reject unsupported combinations at the start of the function:
if backend == "cupti":
    if quantiles is not None:
        raise ValueError("quantiles parameter is not supported with cupti backend")
    if return_mode != "mean":
        raise ValueError(f"cupti backend only supports return_mode='mean', got '{return_mode}'")Option 2: Implement support by collecting per-iteration times in the CUPTI path (more complex but provides feature parity).
Document the profiler schedule loop:
         with profiler:
+            # Run twice: once for profiler wait phase, once for active measurement
             for _i in range(2):Based on learnings: The reviewer asked if do_bench_kineto and do_bench_event could be combined into a single API. This PR achieves that by adding a backend parameter to do_bench.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
tilelang/profiler/bench.py (1)
57-129: API contract violation: CUPTI backend silently ignoresquantilesandreturn_modeparameters.When
backend="cupti"is selected, the function acceptsquantilesandreturn_modeparameters but silently ignores them (Line 126 calls_bench_with_cuptiwhich doesn't accept these parameters). This breaks the API contract:
- Callers requesting
quantiles=[0.5, 0.95]will get a single mean value- Callers requesting
return_mode="min"will get the mean insteadThis creates silent behavioral divergence from the event backend.
Solution 1 (recommended): Validate parameters up front and reject unsupported combinations:
assert return_mode in ["min", "max", "mean", "median"], \ f"Invalid return_mode: {return_mode}" + + # Validate CUPTI backend constraints + if backend == "cupti": + if quantiles is not None: + raise ValueError("quantiles parameter is not supported with backend='cupti'") + if return_mode != "mean": + raise ValueError(f"backend='cupti' only supports return_mode='mean', got '{return_mode}'")Solution 2 (ideal but more complex): Implement full parameter support in
_bench_with_cuptiby collecting per-iteration times and computing quantiles/aggregations.Based on learnings: This addresses the past review comment about honoring quantiles and return_mode for the CUPTI path.
🧹 Nitpick comments (1)
tilelang/profiler/bench.py (1)
180-184: Document the profiler schedule loop.The outer loop running twice corresponds to the profiler's wait (1) and active (1) phases from the schedule, but this isn't documented.
Add a comment:
with profiler: + # Loop twice: once for wait phase, once for active measurement phase for _ in range(2): for _ in range(n_repeat):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- tilelang/profiler/bench.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/profiler/bench.py (1)
tilelang/profiler/__init__.py (1)
do_bench(218-281)
🪛 Ruff (0.13.3)
tilelang/profiler/bench.py
128-128: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-amd
- GitHub Check: bot-task
🔇 Additional comments (3)
tilelang/profiler/bench.py (3)
10-54: LGTM! Clean implementation of stdout/stderr suppression.The context manager correctly:
- Saves original file descriptors using
os.dup- Redirects streams to
/dev/null- Restores everything in
__exit__This is a standard pattern for fd redirection.
131-163: LGTM! Solid CUDA events implementation.The function correctly:
- Creates timing events per iteration
- Clears L2 cache between runs
- Synchronizes before collecting measurements
- Handles both quantile and aggregation modes
189-194: Make excluded kernel pattern configurable and use regex
- Replace the hardcoded
"at::native::vectorized_elementwise"with a user-provided list of regex patterns (e.g. via a CLI flag--exclude-kernel-regex, defaulting tor"vectorized_elementwise")- Match only on the substring
vectorized_elementwise(or other future names) rather than the full namespace to tolerate PyTorch renames- Optionally document how to verify the regex against
profiler.key_averages()output across PyTorch releases
| some benchmarks for two profiler 
 
 
 
 
 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- testing/python/profiler/test_tilelang_profiler.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/profiler/test_tilelang_profiler.py (7)
src/tl_templates/cuda/reduce.h (1)
T(75-147)tilelang/jit/__init__.py (1)
jit(240-313)tilelang/jit/kernel.py (2)
out_idx(446-447)
get_profiler(360-376)tilelang/language/allocate.py (2)
alloc_shared(21-36)
alloc_fragment(53-64)tilelang/language/fill.py (1)
clear(24-48)tilelang/language/pipeline.py (1)
Pipelined(9-46)tilelang/language/copy.py (1)
copy(84-152)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-amd
- GitHub Check: bot-task
| a = torch.randn(1024, 1024).cuda().half() | ||
| b = torch.randn(1024, 1024).cuda().half() | ||
|  | ||
| c = kernel(a, b) | ||
| ref_c = a @ b | ||
| torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2) | ||
|  | ||
| # benchmark | ||
| profiler = kernel.get_profiler() | ||
|  | ||
| # use cupti backend | ||
| cupti_latency = profiler.do_bench(backend="cupti") | ||
|  | ||
| # use event backend | ||
| event_latency = profiler.do_bench(backend="event") | ||
| print(f"cupti Latency: {cupti_latency}ms") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard the test when CUDA is unavailable.
torch.randn(...).cuda() will raise immediately on CPU-only hosts, so CI without GPUs will fail before exercising the profiler. Add a CUDA-availability check (e.g., skip via pytest.skip / pytest.mark.skipif) before moving tensors to CUDA.
Apply this diff (pytest is commonly available in the test suite):
+    import pytest
+    if not torch.cuda.is_available():
+        pytest.skip("Profiler test requires CUDA.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| a = torch.randn(1024, 1024).cuda().half() | |
| b = torch.randn(1024, 1024).cuda().half() | |
| c = kernel(a, b) | |
| ref_c = a @ b | |
| torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2) | |
| # benchmark | |
| profiler = kernel.get_profiler() | |
| # use cupti backend | |
| cupti_latency = profiler.do_bench(backend="cupti") | |
| # use event backend | |
| event_latency = profiler.do_bench(backend="event") | |
| print(f"cupti Latency: {cupti_latency}ms") | |
| import pytest | |
| if not torch.cuda.is_available(): | |
| pytest.skip("Profiler test requires CUDA.") | |
| a = torch.randn(1024, 1024).cuda().half() | |
| b = torch.randn(1024, 1024).cuda().half() | |
| c = kernel(a, b) | |
| ref_c = a @ b | |
| torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2) | |
| # benchmark | |
| profiler = kernel.get_profiler() | |
| # use cupti backend | |
| cupti_latency = profiler.do_bench(backend="cupti") | |
| # use event backend | |
| event_latency = profiler.do_bench(backend="event") | |
| print(f"cupti Latency: {cupti_latency}ms") | 
🤖 Prompt for AI Agents
testing/python/profiler/test_tilelang_profiler.py lines 35-50: the test
unconditionally calls .cuda() which fails on CPU-only hosts; add a guard at the
top of this block that checks torch.cuda.is_available() and skips the test when
False (use pytest.skip or @pytest.mark.skipif) before creating tensors or
calling kernel.get_profiler(), so no tensors are moved to CUDA on CI without
GPUs.
Summary by CodeRabbit
New Features
Tests
Documentation