Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Prerequisite for AutoScheduler #362

Merged
merged 57 commits into from
Dec 22, 2024
Merged

Refactor: Prerequisite for AutoScheduler #362

merged 57 commits into from
Dec 22, 2024

Conversation

hikettei
Copy link
Owner

@hikettei hikettei commented Dec 20, 2024

  • CI: matmul gflops test
  • TODO: Create and plot a heatmap of tiling size. Tile sizes should be passed as an dynamic shape for compilation speed.
  • Create caten/codegen/auto-scheduler
  • Remove caten/polyhedral
  • create caten/codegen/auto-scheduler
  • Goal: 150 Gflops on gemm and M3 Pro + ArmNeon
  • Compute GFLOPs automatically with PROFILE=1
  • OpenMP Directive First Class Support? Only the parallel should be used.
  • Implement Loop Collapse (2) (it is effective for batch_size=1) fuse tile loop and outer loop
  • Implement Vectorize (by Caten Layer)
  • Implement Unroll
  • Different type has different optimal tile size
  • Loop Collapse by ISL?https://inria.hal.science/hal-01581081/document
  • New OP: VECTORIZE_EXPR
  • Move Configuration to caten/codegen/polyhedral
  • How to apply unrolling after applying vectorizing?
  • Dump more instructions into polyhedral space (e.g.: MAX)
  • The number of flops is computable from (getattr item :items) (just count the number of EXPR and multiply the space)
    • Create FLOPS Measurer (compute flop first)
  • Update Memory Planner
  • TODO_1, define a variable name followed by the unroll dims for scalar.
  • TODO_2, create a copy of expr
  • Metal: Use TensorCore
  • Simple Unrolling + Parallelize = 30~80GFlops on Clang without adding VECTORIZE
    • so 150 GFlops without vectorize.lisp is doable? (just only relying on GCC)
    • Tiling is slow?

@hikettei
Copy link
Owner Author

hikettei commented Dec 21, 2024

Note: any decent way to compute the reminder part of piecewise schedule?
When parsing AstFor marked as UNROLL, repeat and copy the body for n_unroll_size = valid unrolled blueprint. (the inner loop always have n_unroll iteration size if (mod loop_size n_unroll) == 0, otherwise I want to create a reminder part
image

@hikettei hikettei changed the title Draft: Auto Scheduler Draft: Auto Scheduler (Matmul > 100 GFlops on CPU!) Dec 21, 2024
@hikettei hikettei changed the title Draft: Auto Scheduler (Matmul > 100 GFlops on CPU!) Refactor: Prerequisite for AutoScheduler Dec 22, 2024
@hikettei hikettei marked this pull request as ready for review December 22, 2024 07:42
@hikettei hikettei merged commit fc36a21 into main Dec 22, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant