Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions collections/_posts/2025-10-23-Kolmogorov–Arnold-Networks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
---
layout: review
title: "KAN: Kolmogorov–Arnold Networks"
tags: interpretable-ai neural-scaling-laws
author: " Nathan Hutin"
cite:
authors: "Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, Max Tegmark"
title: "KAN: Kolmogorov–Arnold Networks"
venue: "ICLR 2025"
pdf: "https://arxiv.org/pdf/2404.19756"
---

## Highlights

* Kolmogorov–Arnold Networks (KANs) are proposed as a **promising alternative to Multi-Layer Perceptrons (MLPs)**.
* Every weight parameter in a KAN is replaced by a **univariate function** parameterized as a spline, meaning activation functions are **learnable** and placed on the edges ("weights") instead of fixed on the nodes ("neurons").
* KANs offer **superior interpretability** compared to MLPs, particularly for small-scale AI + Science tasks, facilitating the (re)discovery of mathematical and physical laws.
* They demonstrate **better accuracy** and possess **faster neural scaling laws** (NSLs) than MLPs.
* The corresponding code is available on the official [GitHub repository](https://github.com/KindXiaoming/pykan).

# Introduction

## Introduction and MLP Reminder

### Multi-Layer Perceptrons (MLPs)
- Positive Point :
- MLP are the fundamental building blocks of most modern deep learning models, their expressive power guaranteed by the [**Universal Approximation Theorem**](https://en.wikipedia.org/wiki/Universal_approximation_theorem)
- Negative Point :
- less intepretable
- need to retrain MLPs if it's not adapted to dataset

### Kolmogorov-Arnold Network
- Positive Point :
- based on [**Kolmogorov-Arnold Representation theorem**](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_representation_theorem): they established that if f is a multivariate continuous function, then f can be written as a finite composition of continuous functions of a single variable and the binary operation of addition.
- can change the finesse of the network after training
- typically require a much smaller computational graph (fewer neurons and layers) than MLPs.

## Motivation: Overcoming MLP and Spline Limitations

KANs are designed to integrate the best qualities of both splines and MLPs.

[**Splines**](https://fr.wikipedia.org/wiki/B-spline) are highly accurate for low-dimensional functions and offer local adjustability, but they suffer severely from the [**Curse of Dimensionality (COD)**](https://en.wikipedia.org/wiki/Curse_of_dimensionality) because they cannot exploit compositional structures. Conversely, **MLPs** are less affected by COD due to their feature learning capabilities, but they are often less accurate than splines when approximating simple univariate functions in low dimensions.

The standard Universal Approximation Theorem, which justifies MLPs, itself struggles with COD, suggesting that the number of required neurons can grow exponentially with input dimension $$d$$. The authors show that KANs combine MLPs on the exterior (to learn compositional structure) and splines on the interior (to accurately approximate univariate functions). Theoretically, KANs can **beat the COD** if the target function admits a smooth Kolmogorov-Arnold representation.

## B-Spline
3 hyperparameters :
- n : polynome degrees
- m+1 : number of nodes $$(t_0, ..t_m)$$ $$0 \leq t_0 \leq t_1 \leq \dots \leq t_m \leq 1$$ (call grid in KAN)
- $$P_i$$ : control polynomial, the number of control points is equal to m-n

The B-spline definition sets: $$\mathbf{S} : [0, 1] \to \mathbb{R}^d $$

The curve is defined by $$\mathbf{S}(t) = \sum_{i=0}^{m-n-1} b_{i,n}(t) \, \mathbf{P}_i, \quad t \in [t_n, t_{m-n}]$$.
The m-n degree B-spline functions are defined by recurrence (Cox-de Boor recurrence) on the lower degree:
$$b_{j,0}(t) := \begin{cases}
1 & \text{if } t_j \leq t < t_{j+1} \\
0 & \text{else}
\end{cases}
$$

$$b_{j,n}(t) := \frac{t - t_j}{t_{j+n} - t_j} b_{j,n-1}(t) + \frac{t_{j+n+1} - t}{t_{j+n+1} - t_{j+1}} b_{j+1,n-1}(t)$$

![](/collections/images/Kolmogorov-Arnold-Networks/BSpline1D_illustration.png)
<p style="text-align: center;font-style:italic">Figure 1. B-Spline 1D illustration .</p>

[Small exemple for 2D B-Spline](https://www.bibmath.net/dico/index.php?action=affiche&quoi=./b/bspline.html)

# Kolmogorov–Arnold Networks (KAN)

The KAN architecture generalizes the original Kolmogorov–Arnold representation (a fixed depth-2, width-(2n+1) structure) to **arbitrary widths and depths**.

![](/collections/images/Kolmogorov-Arnold-Networks/mlp_vs_kan.png)
<p style="text-align: center;font-style:italic">Figure 2. Multi-Layer Perceptrons (MLPs) vs. Kolmogorov-Arnold Networks (KANs) .</p>

### KAN Architecture

In a KAN, nodes perform a simple **summation of incoming signals** without applying any non-linearity. The activation of the $$j$$-th neuron in layer $$l+1$$, $$x_{l+1,j}$$, is defined as the sum of the post-activations of the univariate functions $$\phi_{l,j,i}$$ applied to the inputs $$x_{l,i}$$:
$$x_{l+1,j} = \sum_{i=1}^{n_l} \phi_{l,j,i}(x_{l,i})$$


Each activation function $$\phi(x)$$ is parameterized as a sum of a basis function $$b(x)$$ and a [B-spline](https://fr.wikipedia.org/wiki/B-spline) function:
$$\phi(x) = w_b b(x) + w_s \text{spline}(x)$$
where $$b(x)$$ is typically the SiLU function ($$b(x) = x / (1 + \exp^{-x})$$).
$$w_b$$, $$w_s$$ and B-spline control points are the **parameters that are learned during training**.

$$
\mathbf{x}_{l+1} =
\underbrace{
\begin{pmatrix}
\phi_{l,1,1}(\cdot) & \phi_{l,1,2}(\cdot) & \cdots & \phi_{l,1,n_l}(\cdot) \\
\phi_{l,2,1}(\cdot) & \phi_{l,2,2}(\cdot) & \cdots & \phi_{l,2,n_l}(\cdot) \\
\vdots & \vdots & \ddots & \vdots \\
\phi_{l,n_{l+1},1}(\cdot) & \phi_{l,n_{l+1},2}(\cdot) & \cdots & \phi_{l,n_{l+1},n_l}(\cdot)
\end{pmatrix}
}_{\Phi_l}
\,\,
\mathbf{x}_l
$$


$$
\text{KAN}(\mathbf{x}) = (\Phi_{L-1} \circ \Phi_{L-2} \circ \cdots \circ \Phi_1 \circ \Phi_0) \mathbf{x}.
$$




## Approximation Capabilities and Scaling Laws


**Theorem (Approximation theory, Kolmogorov-Arnold Theorem).**
Let $$\mathbf{x} = (x_1, x_2, \dots, x_n)$$.
Suppose that a function $$f(\mathbf{x})$$ admits a representation

$$f = (\Phi_{L-1} \circ \Phi_{L-2} \circ \cdots \circ \Phi_1 \circ \Phi_0) \mathbf{x}$$

as in, where each one of the $$\Phi_{l,i,j}$$ are $$(k+1)$$-times continuously differentiable.
Then there exists a constant $$C$$ depending on $$f$$ and its representation, such that we have the following approximation bound in terms of the grid size $$G$$:
there exist $$k$$-th order B-spline functions $$\Phi_{l,i,j}^G$$ such that for any $$0 \leq m \leq k$$, we have the bound

$$
\| f - (\Phi_{L-1}^G \circ \Phi_{L-2}^G \circ \cdots \circ \Phi_1^G \circ \Phi_0^G) \mathbf{x} \|_{C^m} \leq C G^{-k-1+m} $$


## Accuracy : Grid extension
- have a finer grid from $$\{t_0,t_1,...,t_{G_1}\}$$ to $$\{t_{-k},...,t_{-1},t_0,...,t_{G_1},t_{G_1+1},...,t_{G_1+k}\}$$


- KAN can start training with fewer parameter, then extend it


- Small KAN generalizes better


![](/collections/images/Kolmogorov-Arnold-Networks/resultats.png)
<p style="text-align: center;font-style:italic">Figure 3. We can make KANs more accurate by grid extension (fine-graining spline grids). Top left (right):
training dynamics of a [2, 5, 1] ([2, 1, 1]) KAN. Both models display staircases in their loss curves, i.e., loss
suddently drops then plateaus after grid extension. Bottom left: test RMSE follows scaling laws against grid
size G. Bottom right: training time scales favorably with grid size G.</p>


## Simplifying KANs and Making them interactive


![](/collections/images/Kolmogorov-Arnold-Networks/simplification.png)
<p style="text-align: center;font-style:italic">Figure 4. An example of how to do symbolic regression with KAN.</p>

1. Visualise: check magnitude of activation function $$\| \phi \|_{1} \equiv \frac{1}{N_p} \sum_{s=1}^{N_p} \| \phi(x^{(s)}) \|$$
2. Prune: delete activation functions with less importance
3. Symbolification: if the activation function resembles a known function, it can be replaced.(ex: y = cf(ax+b)+d)



# Discussion
![](/collections/images/Kolmogorov-Arnold-Networks/shouldIUseKan.png)
<p style="text-align: center;font-style:italic">Figure 5. Should I use KANs or MLPs?.</p>


- B-Spline is set only between 0 and 1. How did they handle it ?


[comment]: <> (les parties pas traitée : KAN accurate 6 pages, KAN interpretable 10 pages)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.