Having a bunch of strings, can I print some substrings which appear K times ? Can I know which is longest substrings ?
The project is built as PyPI package: https://pypi.org/project/commonstrings/ . Code for PyPI package is moved to: https://github.com/phamthivan2996/commonstrings
Make sure Cython
is installed properly !
Building from source:
cd python
python setup.py install
from common_multiple_strings import PyCommon_multiple_strings
tree = PyCommon_multiple_strings() # init
Build from list of str:
tree.from_strings(list_str)
or build from file:
tree.from_path(<path_to_file>)
It is noted that sentences in file are presentened line-by-line. Each line is a sentence.
Sample code:
>>> from common_multiple_strings import PyCommon_multiple_strings
>>> tree = PyCommon_multiple_strings() # init
>>> list_str = [
'abc',
'abcxa',
'xamnb',
'yamnc',
'abcd'
]
>>> tree.from_strings(list_str)
This library introduces 4 types of query:
tree.query(times=TIMES)
Ouput is a dictionary with key is an integer K, value is a list of some substrings which appear exactly K times.
Sample code:
>>> print(tree.query(times=(2, None)))
{2: ['amn', 'xa', 'n', 'mn'], 3: ['bc', 'abc'], 4: ['c', 'b'], 5: ['a']}
>>> print(tree.query(times=(3, 3)))
{3: ['bc', 'abc']}
tree.length_longest_substring(times=TIMES)
Ouput is an integer.
Sample code:
>>> print(tree.length_longest_substring(times=(2, None)))
3
tree.lengths_longest_substrings(times=TIMES)
Ouput is a dictionary with key is an integer K, value is the length of the longest substring which appear exactly K times.
Sample code:
>>> print(tree.lengths_longest_substrings(times=(2, None)))
{2: 3, 3: 3, 4: 1, 5: 1}
tree.filter_substrings_by_length(length_input=L, times=TIMES)
Ouput is a dictionary with key is an integer K, value is a list of some substrings which appear exactly K times and have length of L.
Sample code:
>>> print(tree.filter_substrings_by_length(length_input=2, times=(2, None)))
{2: ['xa', 'mn'], 3: ['bc']}
>>> print(tree.filter_substrings_by_length(length_input=10, times=(2, None)))
{}
times
, default (None, None)
is a tuple represents the minimum and maximum appearances of the desired output.
Query substrings appear exactly N times, then times=(N, N)
.
Query substrings appear more than N times, then times=(N, None)
.
Query substrings appear less than N times, then times=(None, N)
.
Query substrings appear less than N times and more than M times, then times=(M, N)
.
length_input
is an integer, represents the length of substrings
- This library accepts various utf8 characters including punctuations, numbers, upper-case characters, ... For more details, checkout alphabet.cpp
file. Thanks coccoc-tokenizer for providing these available lists
Data structure suffix tree is the core of this library. It encourages fast query and efficient storing.
If the total length of all input strings are L
, then the average complexity should be L * log(L)
.
Algorithm from the lecture of String Algorithms and Algrorithms in Computational Biology - Gusfield