Implements various non-exact searching and matching features

* docs: update Changelog * docs: document the new options * docs: add new settings to example config * docs: remove short-hands * docs: document the new settings * docs: document the new command-line options in the man-page * ci: mark missing coverage on optional library checks This code is in fact covered by the `no-optionals` CI run but this is not picked up by the report of the `coverage` job. * refactor: extract ListCommand._RESERVED_FIELDS * fix: remove order dependence for search command tests * refactor: reorder test functions again Due to the configuration reset, this order is relevant! * test: --fuzziness in the list command * test: --decode-latex and --decode-unicode in the list command * refactor: reorder unittest methods * refactor: new argument short-hands - replaces `-f` with `-z` as the short-hand for `--fuzziness` - the idea here is, that `-f` is more likely to come in handy in the future (think of `formatting` or `file`-related arguments) - removes the short-hands for `--(no-)decode-latex` and `--(no-)decode-unicode` in the `list` command - I think these will be less commonly used (compared to the `search` command, where they are more relevant) and this avoids conflicts with `-l` already taken up by `--limit` * feat: expose non-exact filter matching via list command * test: unittest the new Entry.matches arguments * feat: extend non-exact matching to Entry.matches * refactor: make extra Entry.search arguments keyword-only * test: more timeout exception handling in the ISBNParser tests * meta: properly test optional dependencies in CI * [wip] fix: add optional dependency into tox The unittests should run at least once without it installed. There must also be a better way of linking to the optional dependencies listed in the pyproject.toml. * feat: basic fuzzy searching This is achieved via an alternate `regex` package which is a new optional dependency of coBib. * Lint * feat: permit LaTeX decoding during search * feat: permit Unicode decoding during search * refactor: inline internal Entry._search method Turns out, we don't need to re-use the code for the file grep highlights because we don't want to post-process them any further since grep already returns them in chunks with the correct context. * fix: re-enable query highlight for file matches * refactor: support multiple Span inside Match * refactor: loop merging * fix: mypy * refactor: move match module to cobib.utils * fix: Entry.search unittests * fix: search command unittests * refactor: extract internal regex searching method * refactor: extract Match.stylize * refactor: track spans from re.Match objects during search This refactors the handling of search results inside of `Entry.search`. In the near future, I plan to add the `regex` library as an optional dependency to support fuzzy regex matching. This will result in the current word highlighting to fail. In fact, the current approach already fails to highlight properly for regex searches. Instead, in this new approach, we avoid multiple repetitions of identical regex searches and, instead, parse the matches from the first search to extract all the relevant spanning data we may need.
mrossinek · May 25, 2024 · 6596170 · 6596170
1 parent 2229497
commit 6596170
Show file tree

Hide file tree

Showing 16 changed files with 1,165 additions and 136 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -60,6 +60,18 @@ test:
         reports:
             junit: tests/report-py$PYTHON_VERSION.xml
 
+no-optionals:
+    stage: test
+    script:
+        - tox -e no-optionals
+    artifacts:
+        when: always
+        expire_in: 30 days
+        paths:
+            - tests/report-no-optionals.xml
+        reports:
+            junit: tests/report-no-optinals.xml
+
 plugin:
     stage: test
     script:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+- non-exact (or fuzzy) filter matching and search functionality (#107,#130,!177)
+  - the `list` and `search` commands now support the following features to
+    perform non-exact filter matching and searching, respectively:
+    - LaTeX sequences can be decoded to Unicode characters:
+      - using `--decode-latex` from the command-line
+      - setting `config.commands.list_.decode_latex = True`
+      - setting `config.commands.search.decode_latex = True`
+    - Unicode characters can be converted to a close ASCII equivalent:
+      - using `--decode-unicode` from the command-line
+      - setting `config.commands.list_.decode_unicode = True`
+      - setting `config.commands.search.decode_unicode = True`
+    - a number of fuzzy errors can be set (this requires the optional dependency
+      [`regex`](https://pypi.org/project/regex/) to be installed):
+      - using `--fuzziness <int>` from the command-line
+      - setting `config.commands.list_.fuzzines` to some integer
+      - setting `config.commands.search.fuzzines` to some integer
+- (DEV) the following method arguments have been converted to be accepted only
+  as keyword arguments:
+  - in `cobib.database.Entry.matches`: `ignore_case`
+  - in `cobib.database.Entry.search`: `context`, `ignore_case`, and `skip_files`
+- (DEV) the return-type of `cobib.database.Entry.search` has been changed
+
 
 ## [5.0.1] - 2024-05-01
 

diff --git a/cobib.1 b/cobib.1
@@ -409,6 +409,39 @@ Makes the entry matching case-sensitive.
 This takes precedence over the \fIconfig.commands.list_.ignore_case\fR setting.
 .PP
 .in +8n
+.BR \-\-decode\-latex
+.in +4n
+Makes the entry matching decode all LaTeX sequences.
+This takes precedence over the \fIconfig.commands.list_.decode_latex\fR setting.
+.PP
+.in +8n
+.BR \-\-no\-decode\-latex
+.in +4n
+Makes the entry matching preserve all LaTeX sequences.
+This takes precedence over the \fIconfig.commands.list_.decode_latex\fR setting.
+.PP
+.in +8n
+.BR \-\-decode\-unicode
+.in +4n
+Makes the entry matching decode all Unicode characters.
+This takes precedence over the \fIconfig.commands.list_.decode_unicode\fR
+setting.
+.PP
+.in +8n
+.BR \-\-no\-decode\-unicode
+.in +4n
+Makes the entry matching preserve all Unicode characters.
+This takes precedence over the \fIconfig.commands.list_.decode_unicode\fR
+setting.
+.PP
+.in +8n
+.BR \-z ", " \-\-fuzziness " " \fI<int>\fI
+.in +4n
+Specifies how many fuzzy errors to allow during entry matching.
+The default value is 0 but can be configured via
+\fIconfig.commands.list_.fuzziness\fR.
+.PP
+.in +8n
 .BR \-x ", " \-\-or
 .in +4n
 Concatenate the filters using logical \fIOR\fR rather than the default
@@ -438,7 +471,42 @@ This takes precedence over the \fIconfig.commands.search.ignore_case\fR setting.
 .BR \-I ", " \-\-no\-ignore\-case
 .in +4n
 Makes the search case-insensitive.
-This takes precedence over the \fIconfig.commands.list_.ignore_case\fR setting.
+This takes precedence over the \fIconfig.commands.search.ignore_case\fR setting.
+.PP
+.in +8n
+.BR \-l ", " \-\-decode\-latex
+.in +4n
+Makes the search decode all LaTeX sequences.
+This takes precedence over the \fIconfig.commands.search.decode_latex\fR
+setting.
+.PP
+.in +8n
+.BR \-L ", " \-\-no\-decode\-latex
+.in +4n
+Makes the search preserve all LaTeX sequences.
+This takes precedence over the \fIconfig.commands.search.decode_latex\fR
+setting.
+.PP
+.in +8n
+.BR \-u ", " \-\-decode\-unicode
+.in +4n
+Makes the search decode all Unicode characters.
+This takes precedence over the \fIconfig.commands.search.decode_unicode\fR
+setting.
+.PP
+.in +8n
+.BR \-U ", " \-\-no\-decode\-unicode
+.in +4n
+Makes the search preserve all Unicode characters.
+This takes precedence over the \fIconfig.commands.search.decode_unicode\fR
+setting.
+.PP
+.in +8n
+.BR \-z ", " \-\-fuzziness " " \fI<int>\fI
+.in +4n
+Specifies how many fuzzy errors to allow during search.
+The default value is 0 but can be configured via
+\fIconfig.commands.search.fuzziness\fR.
 .PP
 .in +8n
 .BR \-\-skip\-files
@@ -746,6 +814,16 @@ Specifies the default columns displayed during the \fIlist\fR command.
 .IR config.commands.list_.ignore_case = False
 Specifies whether filter matching should be performed case-insensitive.
 .TP
+.IR config.commands.list_.decode_unicode = False
+Specifies whether filter matching should decode all Unicode characters.
+.TP
+.IR config.commands.list_.decode_latex = False
+Specifies whether filter matching should decode all LaTeX sequences.
+.TP
+.IR config.commands.list_.fuzziness = 0
+Specifies the amount of fuzzy errors to allow for filter matching. Using this
+feature requires the optional \fIregex\fR dependency to be installed.
+.TP
 .IR config.commands.modify.preserve_files = False
 Specifies whether associates files should be preserved during renaming.
 .TP
@@ -769,6 +847,16 @@ Allows the specification of additional arguments for the \fIgrep\fR command.
 .IR config.commands.search.ignore_case = False
 This boolean setting indicates whether search defaults to be case-insensitive.
 .TP
+.IR config.commands.search.decode_unicode = False
+Specifies whether searches should decode all Unicode characters.
+.TP
+.IR config.commands.search.decode_latex = False
+Specifies whether searches should decode all LaTeX sequences.
+.TP
+.IR config.commands.search.fuzziness = 0
+Specifies the amount of fuzzy errors to allow for searches. Using this feature
+requires the optional \fIregex\fR dependency to be installed.
+.TP
 .IR config.commands.show.encode_latex = True
 This boolean setting indicates whether non-ASCII characters should be encoded
 using LaTeX sequences during rendering via the \fIshow\fR command.

diff --git a/dev-requirements.txt b/dev-requirements.txt
@@ -10,3 +10,4 @@ ruff==0.4.5
 typos==1.21.0
 types-beautifulsoup4==4.12.0.20240511
 types-requests==2.32.0.20240523
+types-regex==2024.4.28.20240430
diff --git a/pyproject.toml b/pyproject.toml
@@ -64,6 +64,10 @@ yaml = "cobib.parsers.yaml:YAMLParser"
 [project.scripts]
 cobib = "cobib.__main__:_main"
 
+[project.optional-dependencies]
+all = ["cobib[fuzzy]"]
+fuzzy = ["regex"]
+
 [project.urls]
 Homepage = "https://gitlab.com/cobib/cobib"
 Documentation = "https://cobib.gitlab.io/cobib/cobib.html"