Experiments in diff algorithms

(currently only Levenshtein distance is computed, no edit scripts)

Here n denotes the total length of the inputs and d the Levenshtein distance between them.

Three algorithms are implemented:

the classic O(n^2) dynamic programming algorithm
Myers' O(nd) greedy search, this one is used in practice (e.g. in GNU diff)
An O(n+d^2) algorithm which improves on the previous one by skipping common substrings in constant time. This relies on the suffix and LCP arrays and an RMQ structure on the latter. The constant factors are terrible. This idea appears in (Myers, 1986) as well as (Landau and Vishkin, 1988).

See: An O(ND) Difference Algorithm and its Variations, Eugene W. Myers

Possible future work:

The naive LCP computation in the O(nd) algorithm could most likely be micro-optimized using techniques you would use for strcmp and the like: SIMD, rep instructions, unrolling, eliding bounds checks (sentinel characters at the end...).
Building the suffix array and subsequent structures takes a long time. We don't need all the suffixes: We check the first few common characters naively anyway, as it's faster than an RMQ. Therefore, we could proceed naively until we encounter two suffixes that are in the partial suffix array. Difference cover sampling should work.
UPDATE: See paper "Practical Performance of Space Efficient Data Structures for Longest Common Extensions".

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
libsais		libsais
scripts		scripts
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
build.rs		build.rs

Provide feedback