Build demos of AI models running on the edge. We're looking for creative use of limited hardware resources, whether that's running on a Raspberry Pi or a phone or a laptop. Bonus points for going off the beaten path and working on architectures that are not natively supported on existing frameworks.
We do not provide devices for this challenge.
Here are some resources to get started:
and some interesting smaller open-source models to build demos with and deploy:
- Diffusion (linear attention): Sana
- Language models: Llama 3.2 (your safe bet)
- Language models: StripedHyena (convolution-attention hybrid)
- Distilled reasoning: DeepSeek R1-Distill
- Audio: WhisperX
- Audio: Zonos (available as both Transformer and linear attention-attention hybrid)
- Protein language models: ESM 2
- Diffusion models: StableDiffusion
- Genome generation: Evo
Talk to us for more ideas.
This challenges tests your ability to think cross-functionally across numerics and hardware. You will be optimizing some core computational primitives used in the design space of Liquid AI models, as described in the opening presentation.
Throughout this challenge, Executorch will be your friend. We will be using Executorch to transform PyTorch and / or C++ kernels into a representation that can be run on our target device: Samsung Galaxy S24 Ultra.
It could be useful to keep the following resources open as you get started:
Any optimization is allowed as long as (a) the results are numerically correct (absolute tolerance of 1e-3 on random inputs) and (b) the custom operator can be run via our profiling script directly on device (one of the Samsung Galaxy S24 Ultra phones provided by the team).
See below for more details on how submissions will be scored.
Our build system supports both Linux and MacOS (no Windows!).
The dependencies for this repository could be split into the following groups, don't run anything yet, just read.
- The git submodules
- It is mainly the executorch repository + its submodules.
make setup-submodules
pulls the git submodules but does not attempt to sync them or update. There are patches applied to the repositories which make the automated management complicated. In any case, you see a problem with a submodule, delete it and re-run that target.
- Build tools
make check-tools
verifies the basic tools required and suggest how to obtain those
- Python dependencies
make setup-python-deps
installs the python dependencies declared for that repo, including pytorch, excluding executorch. The purpose is to have a minimal set of tools for compilation-only needs.- Dependencies live in
requirements.txt
andrequirements-torch.txt
- Executorch as a python dependency
make setup-executorch
installs executorch and the python dependencies. This callsinstall_executorch.sh
(see official documentation). You may need to update your compiler versions and fix your paths.- If you are making changes to the executorch repository
./vendor/executorch
you would need to re-run that step to install them. - The executorch repo can't be installed as editable.
With your preferred virtual environment (e.g., with conda conda create -n treehacks python=3.11
), install the dependencies and build the project.
make setup
If make setup-submodules
fails for you, another option is to clone executorch and move it to vendor/executorch
before calling make setup
again.
To customize your executorch build, refer to this page. Our setup step will automatically install Executorch with XNNPACK backend enabled (which could be useful to you). If you don't need the backend, you can also install the wheels with pip.
The pte is a representation of your code that can be ingested by a runner. We provide pre-compiled pte runner binaries for Android (runner/android-arm64-v8a
) and MacOS (runner/macos-arm64
). You will need these to profile your code (check example usage in quickstart.py
). These are stored using git lfs
.
We also provide pte runner binaries for Linux (runner/linux
). If you are on a different OS and need to use the pte runner locally, or if you wish to customize it to your needs, you can recompile the Executorch runner following the official instructions.
We provide a simple mathematical description of the primitives under consideration (references/math_ref.pdf
). There are multiple approaches to improve your implementation, including purely algorithmic tweaks (e.g., switching from FFT to direct convolution, or Winograd methods) and hardware-specific optimizations (e.g., minimizing overheads, using different data types). Convolutions and recurrences are a cornerstone of numerical computing, and have been implemented and optimized over the years on many different platforms. You can find excellent resources online to get ideas.
See examples/fftconv
for an end-to-end example of how to register a custom operator with executorch and produce the corresponding pte file. To run the notebook, python setup.py install
in the ops folder. To export, you will need to adapt to the "out" convention (see kernel registration docs in Executorch), linked above.
An example of convolution is provided under examples/direct_conv
. A convolution implementation using FFT and custom kernel registration is provided under examples/fftconv
.
Some examples using linear scan algorithmsare provided under examples/linear_scan
.
You can gain a lot of insight by measuring the latency and memory usage of your implementation locally (on your laptop, using CPUs). If you want accurate numbers, we also provide phones.
To measure latency on your machine (assuming you have the right pte runner!), use PYTHONPATH=profiling PTE_RUNNER_PATH=runner/macos-arm64/pte_runner python quickstart.py
. You should see something like this:
ProfilingResults(raw=[0.035606, 0.001095, 0.000502, 0.00056], p10=0.0005193999999999999, p50=0.0008275, p90=0.025252700000000003, min=0.000502, avg=0.00944075, max=0.035606)
these are the latency measurements.
adb
is required to run on the phone. See the docs and follow the steps. The phones have USB debugging mode already enabled.
To measure on the phone, connect your device to one of the provided Samsung devices and run PYTHONPATH=profiling python pte_android.py --pte your_pte_file.pte --runner_path runner/android-arm64-v8a/pte_runner
The objective is to minimize latency across pre-specified input shapes, with random inputs (we will measure and average over 100 random inputs, 10 runs, for each setting).
Target input shapes (batch size, channels, sequence length) for both recurrences and convolutions:
-
1, 512, 64
-
1, 512, 256
-
1, 512, 1024
-
1, 512, 2048
-
1, 512, 4096
-
1, 512, 16384
-
1, 512, 65536
-
1, 2048, 64
-
1, 2048, 256
-
1, 2048, 1024
-
1, 2048, 2048
-
1, 2048, 4096
-
1, 2048, 16384
-
1, 2048, 65536
In addition, for convolutions, we will measure filter sizes: 4, 32, 128, and as long as the input sequence (max 65536), resulting in 4 times the number of measurements (we will still average across filter size settings). You can provide a pte that accepts dynamic shapes, or different ptes for each shape, allowing you to optimize for the best performance.
We will check for numerical correctness with absolute tolerance of 1e-3 on random and structured inputs (e.g., a tensor of ones).
These are generally resolved by making sure gcc
is up-to-date, or switching to a different toolchain, for example llvm
on arm64.
OSError: [Errno 8] Exec format error
This is generally due to a mismatch between the pte runner and the architecture of the machine you are running on. To measure latency locally, you can use compile and use Executorch's native runner.