Skip to content
View Miroier's full-sized avatar
💻
Learning
💻
Learning

Highlights

  • Pro

Block or report Miroier

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.

Starred repositories

Showing results

Quantized Attention on GPU

Python 16 Updated Nov 1, 2024

A Python-embedded modeling language for convex optimization problems.

C++ 5,437 1,066 Updated Nov 1, 2024

MLIR For Beginners tutorial

C++ 804 66 Updated Sep 30, 2024

The most Obsidian-native PDF annotation, viewing & editing tool ever. Comes with optional Vim keybindings.

TypeScript 788 15 Updated Oct 22, 2024

DLRover: An Automatic Distributed Deep Learning System

Python 1,262 163 Updated Nov 1, 2024

Next-Token Prediction is All You Need

Python 1,741 64 Updated Oct 24, 2024

Organize Your GitHub Stars With Ease

PHP 3,213 143 Updated Aug 11, 2024
Cuda 32 13 Updated May 21, 2021

Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library

C++ 1,540 92 Updated Sep 29, 2024

compilerbook

48 28 Updated Apr 25, 2021

Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.

C++ 22 1 Updated Oct 12, 2024

A Minimal, Header only Modern c++ library for terminal goodies 💄✨

C++ 1,498 143 Updated Jul 23, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,375 126 Updated Nov 2, 2024

Examples of CUDA implementations by Cutlass CuTe

Makefile 80 12 Updated Oct 31, 2024
C++ 282 29 Updated Oct 29, 2024

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 551 110 Updated Oct 30, 2024

Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Python 351 14 Updated Oct 24, 2024

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 97 13 Updated Sep 10, 2024

Bloaty: a size profiler for binaries

C++ 4,769 345 Updated Oct 1, 2024

Puzzles for learning Triton

Jupyter Notebook 1,055 72 Updated Sep 25, 2024

my cs notes

Jupyter Notebook 26 2 Updated Oct 14, 2024

PyTorch bindings for CUTLASS grouped GEMM.

Cuda 49 38 Updated Oct 31, 2024

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

C++ 221 23 Updated Sep 30, 2024

FlagGems is an operator library for large language models implemented in Triton Language.

Python 320 33 Updated Nov 2, 2024

A fast communication-overlapping library for tensor parallelism on GPUs.

C++ 211 16 Updated Oct 30, 2024

Open deep learning compiler stack for Kendryte AI accelerators ✨

C# 748 181 Updated Nov 1, 2024

Fast inference from large lauguage models via speculative decoding

Python 553 57 Updated Aug 22, 2024

Tile primitives for speedy kernels

Cuda 1,610 62 Updated Nov 1, 2024
Next