Skip to content

Feature/boolean search syntax #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

samuellieberman
Copy link
Contributor

Adds Boolean search syntax to the library search bar. For an explanation of how the search syntax works, check out the new "Search Cheat-Sheet" section in README.md.

Some existing features were changed. Tags with spaces now require underscores when being used in searches. Special consideration was made for the upcoming SQLite migration to hopefully reduce necessary refactoring.

I am happy to discuss these details or any other ones if there are any questions or concerns.

@Thesacraft
Copy link
Collaborator

The search feature is really cool but i tested it on a bigger library (about 2.2 million entries) and it makes searching basically unusable, because it took 21 min to search for an entry with a tag. This previously took 1.71s so if you could somehow improve the Performance it would be really nice.

@samuellieberman
Copy link
Contributor Author

The search feature is really cool but i tested it on a bigger library (about 2.2 million entries) and it makes searching basically unusable, because it took 21 min to search for an entry with a tag. This previously took 1.71s so if you could somehow improve the Performance it would be really nice.

Okay, I definitely wasn't expecting that. At it's very worst the syntax parsing should only increase the runtime by a small percentage. I wouldn't have been surprised if it decreased the runtime even.

But I was able to reproduce a 6x slowdown even with only about 1.4k entries, so something has definitely gone wrong. I will try to do some profiling to figure out what is causing this. Thanks.

* improve noticeable performance loss from calling .resolve() in Library.search_library
Remove unimplemented feature search for missing file
@samuellieberman
Copy link
Contributor Author

samuellieberman commented Jul 6, 2024

The search feature is really cool but i tested it on a bigger library (about 2.2 million entries) and it makes searching basically unusable, because it took 21 min to search for an entry with a tag. This previously took 1.71s so if you could somehow improve the Performance it would be really nice.

Okay, I believe I figured out what was causing the slowdown. Do you mind if I ask what you were searching for in your library? I suggest going back to the original TagStudioDev:main branch and trying a search for "missing" or "no file". If my suspicions are correct, then I bet those searches would take about 20 minutes to complete, even in the official project.

The reason for this is that Library.search_library() contains a costly .resolve() operation in order to check an entry's filepath against Library.missing_files. The worst part is that Library tracking missing or displaced files seems to have been broken and unimplemented in for TagStudio a while. The Library.missing_files attribute stays empty, even when I delete files and refresh the library and directories.

My fix is to remove the unimplemented feature from searches for now. I left the rest of the Library.missing_files related code untouched. Please let me know if this resolves your slowdown @Thesacraft. Thanks.

@Thesacraft
Copy link
Collaborator

Thesacraft commented Jul 6, 2024

I searched for a Tag and had one file that was tagged with the tag. I tested it and it made it much faster (34s) but i think if it could be improved it would be really great. And the current search in the main is still much faster with 1.71s so it would be realy nice if you could improve it to like 2-10s. Here is the search and the found file:
grafik

@Qronikarz
Copy link
Contributor

Wasn't the boolean search started in this PR? #284

Also, as far as I know, CyanVoxel didn't decide yet how to handle tags with multiple words - #112

Move str(Path) and os.path.join() operations inside search.py to reduce the chance that they are run
@samuellieberman
Copy link
Contributor Author

I searched for a Tag and had one file that was tagged with the tag. I tested it and it made it much faster (34s) but i think if it could be improved it would be really great. And the current search in the main is still much faster with 1.71s so it would be realy nice if you could improve it to like 2-10s. Here is the search and the found file: grafik

Thank you for trying my project and sharing your searches with me @Thesacraft. I wanted to understand your situation better, so since my last post I have created my own library with 1 million items and performed lots of different searches. Now I have some search results I want to share with you! Have you tried searching for "missing" in the main project like I suggested earlier? I tried that search with the main project, and it took a very similar amount of time to regular searches in my PR before I pushed your performance update. I also wanted to replicate and experiment with your latest result, and I found that I could also get the main project to produce similar times to after my performance update. I suggest also searching "filename: ?" in the main project if you want to replicate this second search for yourself.

I have since pushed another smaller performance update for you to try. I think this will give you another boost to performance, but so far the performance improvements of my updates are all for search dependent reasons, just like in the main project. Right now, the only improvements left that I see involve making the code harder to understand and modify, or starting to sacrifice boolean search syntax entirely. I'd have looked into some of this earlier if I didn't expect non-syntax search performance to change entirely if the project switches to using an SQL database. Non-syntax search performance isn't something I've considered much, so I'm happy to listen to any other performance concerns with my PR.

Where does the syntax come from? If it were up to me, I wouldn't trust a PR without prior convincing me in the issues that the query language is good enough for the target audience to understand and use.

I sense inspiration from some boorus, but I can't say it's a good thing.

Provided readme section seems insufficient for me to grasp it even if project owners consider the syntax good enough to adopt. I can't get through the dense explanation with my ADHD. And slightly lesser issue is improper use of markdown.

Thank you for reading the documentation I wrote for my search syntax and thank you for posting your reply @KillyMXI. I don't want you to feel pressure to trust anything or to be convinced of anything. When I started this feature, I designed it for myself and based on my own needs and preferences. There wasn't a CONTRIBUTING.md and I hadn't even decided I wanted to contribute my code to the main project.

You are discerning to correctly realize that I took inspiration from online image boorus, but my personal search preferences are just a starting point. Now that the feature is working, I have no issue at all with changing the syntax based on feedback. I just see this as an opportunity to help a project that I have already benefited from.

I would also love to collaborate to improve my README cheat-sheet section. Thanks again for your consideration into my syntax and explanation. Does any of this sound like something you would want to help improve?

@Thesacraft
Copy link
Collaborator

Thank you for trying my project and sharing your searches with me @Thesacraft. I wanted to understand your situation better, so since my last post I have created my own library with 1 million items and performed lots of different searches. Now I have some search results I want to share with you! Have you tried searching for "missing" in the main project like I suggested earlier? I tried that search with the main project, and it took a very similar amount of time to regular searches in my PR before I pushed your performance update. I also wanted to replicate and experiment with your latest result, and I found that I could also get the main project to produce similar times to after my performance update. I suggest also searching "filename: ?" in the main project if you want to replicate this second search for yourself.

I have since pushed another smaller performance update for you to try. I think this will give you another boost to performance, but so far the performance improvements of my updates are all for search dependent reasons, just like in the main project. Right now, the only improvements left that I see involve making the code harder to understand and modify, or starting to sacrifice boolean search syntax entirely. I'd have looked into some of this earlier if I didn't expect non-syntax search performance to change entirely if the project switches to using an SQL database. Non-syntax search performance isn't something I've considered much, so I'm happy to listen to any other performance concerns with my PR.

I just tested the searches you suggested and it seems like you are completle right. But with the new peformance improvment a simple Tag search is down to about 5.4s in my library which is pretty good. So i think you shouldn't really sacrifice the boolean syntax to get it faster. I think a better solution is to maybe add a loading bar or some visual indicator (in bigger librarys) and maybe detach the search from the main thread to stop it from blocking the rendering of the application. But this is probably out of scope for this pr and could be added in the future. I really appreciate the work you put in to make it more peformant.

@CyanVoxel CyanVoxel added Type: Enhancement New feature or request TagStudio: Library Relating to the TagStudio library system labels Jul 19, 2024
@CyanVoxel CyanVoxel modified the milestones: Alpha 9.4, Alpha 9.5 Jul 19, 2024
@CyanVoxel CyanVoxel added the TagStudio: Search The TagStudio search engine label Jul 20, 2024
Replacing instances of the term "operand" with "in" and "input" for the sake of brevity and clarity.

Changing _TagNode.match for the sake of brevity and consistency.
@samuellieberman samuellieberman force-pushed the feature/boolean-search-syntax branch from 7d29d10 to b9d9887 Compare July 20, 2024 14:43
…StudioDev#272 and TagStudioDev#325)

Adds ability to check the existence of fields of any type using the following syntax:
```has_<field>```
```has_<field>:<True|False>```
Adds the ability to search the content of text_line and text_box fields using the following syntax:
```<field>:<text>```
(Replace whitespace with underscore _ in `text`)
Updated test_search.py for new behavior.
@CyanVoxel
Copy link
Member

Thank you for the work put into this PR, but unfortunately due to the significant library backend changes in #332 I'll have to close this as it's no longer compatible. I apologize for the long turnaround time and and lack of input on my side about this. This discussions in this PR have greatly helped with the future direction of this feature.

#606 is currently taking up the task of implementing boolean syntax search.

@CyanVoxel CyanVoxel closed this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TagStudio: Library Relating to the TagStudio library system TagStudio: Search The TagStudio search engine Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants