- Santa Clara, California
-
17:52
(UTC -08:00) - https://yzhaiustc.github.io/
Stars
Fully open reproduction of DeepSeek-R1
Puzzles for learning Triton, play it with minimal environment configuration!
Development repository for the Triton language and compiler
A simple pip-installable Python tool to generate your own HTML citation world map from your Google Scholar ID.
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
A fast communication-overlapping library for tensor parallelism on GPUs.
You like pytorch? You like micrograd? You love tinygrad! ❤️
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
A retargetable MLIR-based machine learning compiler and runtime toolkit.
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Standalone Flash Attention v2 kernel without libtorch dependency
Fast and memory-efficient exact attention
Making large AI models cheaper, faster and more accessible
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) systems. It is developed as part of the U.S. Department of Ene…
Source code for Twitter's Recommendation Algorithm
C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Code and documentation to train Stanford's Alpaca models, and generate the data.