You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Where $w$ is a word made from $A$ of length $m$ and $count(s_i, w)$ is the
number of occurences of the substring $w$ in $s_i$ with overlaps.
Then the Gram Kernel Matrix is defined as
$G^m_{ij} = K_m(s_i, s_j)$
This gets computationally very expensive as $m$ gets larger, since the search space $\lbrace w \in A:|w|=m \rbrace$ has $A^m$ elements. A very naive algorithm can take a huge amount of compute resoures, e.g.
This repository hosts an alogirthm for computing the exact matrix $G^m_{ij}$ in a faster set of steps and crucially, it does not require exponentially more compute resources as $m$ increases.
It makes use of the fact that for larger $m$, the probability that, for a given $w$, $count(s_i, w) > 0$ is $len(s_i)/|A|^m$, which gets smaller as $m$ increases.
Secodly, it's clear that if $count(s, w) = 0$, then for any $a \in A$ we also have $count(s, w+a) = 0$. Thirdly $G$ is symmetric, so computing only the upper trainglur form is sufficient to compute all of $G$.
Using these details, the algorithm can perform depth first search for every $s_i \in S$ for every individual character $a \in A$. That is, it creates the set
$$X_{i}^{a} = \lbrace (w, count(s_i, w))\ |\ count(s_i, w) > 0,\ w\ starts\ with\ a,\ |w| = m \rbrace$$
With DFS, words are built up starting from $a$ to words of length $m$. If at any point $count(s_i, w') = 0$ for $|w'| < m$, then all words with $w'$ as a prefix are ignored.
While iterating through the input strings and fetching the $X$ sets above, it also caches a lookup map of visited substrings $w$ of length $m$ where a non zero count was found, storing both the string index and the count, i.e
fori, sinenumerate(input_strings):
forainalphabet:
x_ia=depth_first_search(s, a, alphabet, m)
for (word, countS) inx_ia:
for (j, countJ) inlookup_cache[word]:
K[i, j] +=countS*countJ
Where x_ia is $X_{i}^{a}$, lookup_cache is $V$ and K is the upper triangular form for $G$.
Finally $G = K + K^T - Diag(K)$.
This works by taking advantage of the fact that $K$ is upper triangular, it only ever needs to compute $K[i, j]$ where $j < i$. Using this, it is certain that the lookup cache $V$ has all of the information required at each outer loop $s_i$, as it has already computed all non zero $count(s_j, w)$ with $j < i$ due to the ordering of the outer loop.
Other implementations
A less complicated algorithm makes use of matrix multiplication instead. Since $G$ is symmetric positive semi definite it can be written as $G = XX^T$, where $X_{ij} = count(s_i, w_j)$. A similar depth first search can be used here to find each entry making the algorithm also very quick.
The problem with this approach, is that as a matrix $X$ can be enormous, of size $(|S|, |A|^m)$ so very quickly the memory requirements become an obstacle, even though $G$ itself is only of size $(|S|, |S|)$. Sparse matrices can be used instead, however there will still be a point where performing matrix multiplication isn't possible anymore due to the size of $|A|^m$ as the memory requirements are just too great. Most libraries need to be able to allocate an array of length $|A|^m$ which gets extremely large.
This issue is what has led to the algortihm in this repository.
Implementation
The algorithm detailed above is implemented in src/kernelmatrix.cpp in highly (un)optimized C++ with python bindings created as well.
Benchmarks
In bench/ there are two python implementations, one in python_dfs.py is the same algorithm explained above jus tin Python. The other in python_matrix.py is a more traditional algorithm using matrix multiplication - it uses sparse matrices and also uses DFS to find non zero entries in the matrix $X_ij = count(s_i, w_j)$.
Using a dataset of 4198 molecules (found in bench.words.txt), with an alphabet of 33 characters, the FSGM algorithm was compared with the Python implementation of FSGM and the standard matrix multiplication method. This was carried out on an Intel Core i7 CPU (4 cores) with 8GB of RAM.
For values of $m$ larger than 5, the matrix multiplication method requested too much memory to complete, wheres both the C++ and Python FSGM implementations could compute the Kernel Matrix.
The C++ implementation of FGSM was faster than both other algorithms, and a bar chart plot below shows the compute times for increasing values of $m$.
Dependencies:
It requires the header files for Eigen to be present in the include path.
It also requires pybind11 to be installed in order to create the Python wrapper function.
Then in the same directory as the created .so file, it should be possible to open up python3 and import the function.
$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license"for more information.
>>> from fsgm import compute_kernel_matrix
The python function accepts two lists of strings (inputs, and the alphabet) and an integer for $m$, and returns a numpy.ndarray.