Skip to content

Add search by metadata #284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 32 commits into from
Closed

Add search by metadata #284

wants to merge 32 commits into from

Conversation

lunaro-4
Copy link

Context

Related to issue #272

Features / Changes

  • Major refactoring over core.library.Library.search_library method
  • Add method for parsing keys for file metadata from search query
  • Add support for negative query

Usage example

In a new version, specifying all tags with whitespace as usual will return result
from AND search type, regardless of one selected in UI:

tag1 tag2 - return all entries, tagged with 'tag1' AND 'tag2'

If you want to use OR, you need to specify OR search type in UI, and
provide tags, separated by '|' symbol

tag1 | tag2 - (with OR search type) return all entries, tagged with 'tag1' OR 'tag2'

tag1 tag2 | tag3 - (with OR search type) return all entries, tagged with 'tag1' and 'tag2' OR with 'tag3'

Search by non-tag metadata is also available by 'key: value' syntax:

description: desc - return all entries, which description contains 'desc'

tag1 ; description: desc - return all entries, tagged with 'tag1' AND which description contains 'desc'

description: desc | description: smth:

  • (with OR search type) return all entries, which description contains either 'desc' or 'smth'
  • (with AND search type) return all entries, which description contains both 'desc' and 'smth'

Negative query can be set by specifying lookup value as 'null':

tag1; description: null - return all entries, tagged with 'tag1' and not having a description (or having a null description)

Original negative query tags like empty or no author also supported:

no author tag1; description: desc - return all entries without author, tagged with 'tag1' and description containing 'desc'

Refactoring note

For ease of maintenance, many functions of original code were extracted to new methods in library.Library class.
To avoid clutter, I extracted them even further to a new class 'library.Filter'. This should help maintain the codebase
in future, but due to lack of tests, I decided to extract to 'Filter' only new methods, and import any required variables
and methods, to aviod breaking other functionality of 'library.Library' class:

class Filter:
    def __init__(self, lib: Library) -> None:
        # self.lib = lib
        self.ext_list = lib.ext_list
        self.entries = lib.entries
        self.is_exclude_list = lib.is_exclude_list
        self.get_field_attr = lib.get_field_attr
        self._tag_strings_to_id_map = lib._tag_strings_to_id_map
        self.get_tag_cluster = lib.get_tag_cluster
        self.library_dir = lib.library_dir
        self._field_id_to_name_map = lib._field_id_to_name_map
        self.get_field_obj = lib.get_field_obj
        self.missing_files = lib.missing_files

This class was added in last commit, so this can be reverted with ease.

Other possible side effects

Changed code works fine at a first glance, but it is not clear for me how exactly
original code worked on some edge cases, especially with coalitions.

I can write tests for it, but I would like to have test cases, for deeper understanding on how should it work

@lunaro-4 lunaro-4 changed the title Add search be metadata Add search by metadata Jun 13, 2024
@CyanVoxel CyanVoxel added Type: Enhancement New feature or request TagStudio: Library Relating to the TagStudio library system labels Jun 13, 2024
@lunaro-4 lunaro-4 marked this pull request as ready for review June 14, 2024 13:23
@lunaro-4 lunaro-4 closed this Jun 14, 2024
@lunaro-4 lunaro-4 reopened this Jun 14, 2024
@lunaro-4 lunaro-4 force-pushed the search branch 3 times, most recently from dc3c668 to 576278f Compare June 14, 2024 14:12
@lunaro-4
Copy link
Author

cherry-piked commit with tests
All github actions passed except for ruff, possibly because of files I didn't touch.
On my machine, I get no errors on files I edited:

# at TagStudio/tagstudio
> ruff check --config ../pyproject.toml src/core/library.py
All checks passed!
>  ruff check --config ../pyproject.toml tests/core/
All checks passed!

PR is ready for review

@CyanVoxel CyanVoxel added the Status: Review Needed A review of this is needed label Jun 14, 2024
@yedpodtrzitko
Copy link
Contributor

yedpodtrzitko commented Jun 15, 2024

@lunaro-4 can you please rebase your branch? There are some changes which are already merged in main branch, eg .github/workflows/pytest.yaml

also fix the formatting in your code, so the ruff workflow pass, thanks

@CyanVoxel CyanVoxel added Priority: Medium An issue that shouldn't be be saved for last Status: Changes Requested Changes are quested to this labels Jul 3, 2024
@samuellieberman
Copy link
Contributor

Words with capital letters in text fields cannot be searched.

If the description is Banker goes to the circus then searching description: goes to the circus returns the entry, but description: Banker does not. Only the capitalization in the original field matters, capitalization in the search still does not matter. Searching description: GOES to THE CiRcUs still returns the entry.

I assume this bug applies to the Title, Author, Artist, URL, Description, and Notes fields, though I did not check them all.

@samuellieberman
Copy link
Contributor

samuellieberman commented Jul 14, 2024

A ; no author search does not return an entry unless the entry has at least one tag, even though author: null and artist: null return the entry.

-description: foo does not return any entries no matter what I've tried with my entries. -description: foo; description: bar also does not appear to work.

@CyanVoxel CyanVoxel added this to the Alpha 9.5 milestone Jul 19, 2024
@CyanVoxel CyanVoxel removed the TagStudio: Library Relating to the TagStudio library system label Jul 20, 2024
@lunaro-4 lunaro-4 marked this pull request as draft July 21, 2024 12:01
lunaro-4 added 2 commits July 21, 2024 15:29
nitpicks; switch `lower()` to `casefold()` in `filter_results` method; simplify return in `SearchMode.AND` case; DRYed `preprocess_tag_terms` method;
@lunaro-4
Copy link
Author

this query still doesnt work as I'd expect. Am I doing something wrong?

#284 (review)

Sorry for not adressing this before.
You actually need to choose OR search type in GUI for it to work as you expect. Otherwise, if you leave it as AND, which is default value, you should recive about the same results as if you were using only ;

@lunaro-4
Copy link
Author

@samuellieberman

Thank you for pointing things out, will fix and add additional tests

lunaro-4 added a commit to lunaro-4/TagStudio that referenced this pull request Jul 26, 2024
lunaro-4 added a commit to lunaro-4/TagStudio that referenced this pull request Jul 26, 2024
@lunaro-4 lunaro-4 marked this pull request as ready for review July 27, 2024 18:02
@lunaro-4
Copy link
Author

Changes overview:

  • Multiple minor changes from code review by @yedpodtrzitko
  • New class KeyNameConstants for constant names, replaces hardcoded field names like "UNBOUND" and "EMPTY"
  • Fix incorrect behavior with multi-word special flags like no author
  • Fix negative prompt parsing

I also had to change search test from test_lib.py file, as --nomatch-- case worked incorrectly with negative prompt implementation.

Thank you for patience, will be waiting for new review.

@yedpodtrzitko yedpodtrzitko mentioned this pull request Aug 23, 2024
26 tasks
Copy link

@eivl eivl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to complex and nested for me to understand what it does. Im happy to approve it once I have had time to understand this PR. @lunaro-4 if you want to explain it to me I can give it another shot later.

Comment on lines +49 to +50
# FILENAME_KEYNAME = "filename"
# TAG_ID_KEYNAME = "tag_id"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused attributes can be deleted.

@CyanVoxel CyanVoxel removed the Priority: Medium An issue that shouldn't be be saved for last label Sep 1, 2024
@lunaro-4
Copy link
Author

@eivl
The intention behind this PR is to give the user the ability to search entries by multiple custom fields. Current version of TagStudio only supports search by tags and filename, with this PR it is possible to filter entries by other fields, such as Description.

At first, there was only one search_libraty method, that did all the work. What I did was creating new method parse_metadata that parses arguments from search query, and extracting functionality from search_library into a separate class, called Filter. On initialization, this class copies all necessary information from Library. After that, Filter.filter_results could be called to get all entries, that fit the parsed search query.

Reading the query

Library.parse_metadata method splits the query into multiple parts, returning a list of dictionaries:

  • list element: query parts, which are separated by | symbol in query
  • dictionary element: query keys, provided inside query parts and separated by ; symbol, and it's value (if : symbol is provided)

Query keys that do not have values added to the 'unbound' dictionary key (name set in KeyNameConstants.UNBOUND_QUERY_ARGUMENTS_KEYNAME constant), and later treated as a whole query in original code.

negative prompt values are extracted from unbound values and put in dictionary key, specified in KeyNameConstants.EMPTY_FIELD_QUERY_KEYNAME. These values filter out entries, containing non-null values of specified query key.

Filtering the entries

Filter.filter_results method contains a nested loop, that iterates over the output from parse_metadata.

It iterates over list elements (for query_part in split_query), then each entry (for entry in self.entries), and then for each entry it iterates over contents of dictionary (for key, value in query_part.items())

If the entry does not break loop, it is considerate valid for query and is added to list filtered_entries.

The filtered_entries list is created for each part of the query, and then added to pre_results list.

The pre_results is a list of lists, and each element pre_results[n] is a representation of parse_metadata element, as if the whole query was just `parse_metadata()[n]'.

pre_results elements are then combined into a single set, depending on SearchMode, selected in GUI.

@lunaro-4
Copy link
Author

lunaro-4 commented Sep 14, 2024

I see a merge conflict, and will try to resolve it this weekend

I also would like to hear some tips to make code more readable. For me, this level of nesting is more or less comfortable and I do not know what to do with if statements in a loop at 2263 (for key, value in query_part.items():)

UPDATE: after further inspection, since the project moved to SQL, I will need a significant amount of work to adapt my changes to new structure, so I will not be able to finish the PR this weekend

@lunaro-4 lunaro-4 marked this pull request as draft September 15, 2024 16:50
@lunaro-4 lunaro-4 closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Changes Requested Changes are quested to this Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants