Implementation of an algorithm to quickly calculate string similarity Gram Kernel matrices.
Given a set of strings
Where
Then the Gram Kernel Matrix is defined as
This gets computationally very expensive as
for i1, s1 in enumerate(S):
for i2, s2 in enumerate(S):
for w in length_m_words:
kernel[i1, i2] += count(s1, w)*count(s2, w)
This repository hosts an alogirthm for computing the exact matrix
It makes use of the fact that for larger
Using these details, the algorithm can perform depth first search for every
While iterating through the input strings and fetching the
Now the algorithm can be described as follows:
for i, s in enumerate(input_strings):
for a in alphabet:
x_ia = depth_first_search(s, a, alphabet, m)
for (word, countS) in x_ia:
for (j, countJ) in lookup_cache[word]:
K[i, j] += countS*countJ
Where x_ia
is lookup_cache
is K
is the upper triangular form for
Finally
This works by taking advantage of the fact that
A less complicated algorithm makes use of matrix multiplication instead. Since
The problem with this approach, is that as a matrix
This issue is what has led to the algortihm in this repository.
The algorithm detailed above is implemented in src/kernelmatrix.cpp
in highly (un)optimized C++ with python bindings created as well.
In bench/
there are two python implementations, one in python_dfs.py
is the same algorithm explained above jus tin Python. The other in python_matrix.py
is a more traditional algorithm using matrix multiplication - it uses sparse matrices and also uses DFS to find non zero entries in the matrix
Using a dataset of 4198 molecules (found in bench.words.txt), with an alphabet of 33 characters, the FSGM algorithm was compared with the Python implementation of FSGM and the standard matrix multiplication method. This was carried out on an Intel Core i7 CPU (4 cores) with 8GB of RAM.
For values of
It requires the header files for Eigen to be present in the include path.
It also requires pybind11 to be installed in order to create the Python wrapper function.
It can be built locally using
c++ -O3 -Wall -shared -std=c++17 -fPIC $(python3 -m pybind11 --includes) src/*.cpp -o fsgm$(python3-config --extension-suffix)
Then in the same directory as the created .so
file, it should be possible to open up python3 and import the function.
$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from fsgm import compute_kernel_matrix
The python function accepts two lists of strings (inputs, and the alphabet) and an integer for numpy.ndarray
.