BallTree KDE is a perf oriented implementation of kernel density estimation.
While performing prediction of probabilities on a lot en entries,
I realized classical implementation does not scale well with the number of points.
This implementation takes advantage of the sparsity of the distribution to address this problematic.
This implementation use the BallTree property to 'fuzzy index' the distribution with a custom metric in order to cluster points together.
It reduce the number of queries and ponderate the result by the size of each cluster.
This approximation is valid only if the points are dense enough regarding to the kernel bandwidth.
It is especially adapted for prediction on sparse distributions.
Only the Gaussian kernel is implemented at the moment.
Performing a KDE is actually equivalent to convolve the distribution by the kernel.
When points of the distribution are close enough regarding to the kernel bandwidth,
We can compute only one ponderated kernel and assume the result will be almost equal to the sum of the 2 kernels.
- Green: exact curve
- Blue: approximated curve
- Red: absolute error
Note: Here is a 1D example, in N dimensions, the closeness of points is evaluated by an ellipse equation.
It allows to use different bandwidth on each dimensions.
This is also true for aggregation of N points. In such a case, the kernel would be ponderated by N.
As always, the tradeoff we are making here depends on the nature of the data,
but having a few concentrated clusters make this approximation suitable.
Depending on the precision needed, it is also possible to tweak the metric so it aggregate smaller clusters.
At this cost, we dropped the perf by a factor ~20 (mode details on the demo notebook).
This experiment highlight the versatility of BallTree which allow bunch of application with a custom metric. (PS: it is awesome !)
We can also note, it is possible to use different bandwidth on each dimensions.