-
-
Notifications
You must be signed in to change notification settings - Fork 394
cleanup the refresh_dir
code, update tests
#494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleanup the refresh_dir
code, update tests
#494
Conversation
6b8f479
to
3bcad8f
Compare
3bcad8f
to
c71d63c
Compare
on second thought, I'm not sure if this is the right approach - maybe we should add all files into the library (no matter if they are in the ignore list), and only filter them when querying the library. Because the ignore list can eventually change, and it's not intuitive that you have to rescan the directory once again. |
5d8119e
to
d4f0597
Compare
refresh_dir
code, update tests
d4f0597
to
c9d61b5
Compare
This was something I mulled over during #486 as well. The initial benefit of keeping the filtering was due to the filter membership check performing significantly faster compared to the rest of the checks being performed per file - but after the optimizations put in place it only accounted for a small fraction of the total runtime. In my test for allowing 200k files vs a 100k/100k filter split, it only increased the scan time from an average of 3.8 seconds to an average of 4.0 seconds. This was in contrast to the previous refresh method that was n times long for each file let through the filter. With those optimizations in place, the pros and cons I determined were as follows:
Benefits of NOT filtering:
I personally chose to continue with filtering in #486, but I'm flexible on the decision. These are the tradeoffs I considered, but maybe requirements like rescanning upon changing the extension list have more weight than I give them credit for. I'd like to hear more of your thoughts on this. Either way, I'll go along with whatever behavior we decide to move forward with. |
Thanks for the thorough response. I will keep thinking about it; right now I'm still leaning towards not filtering on import a bit:
Maybe this is just me, but when I have some library directory, it's mostly "pure" data I want to have there, so the amount of excluded files is very tiny. But I can think of a use-case where having the excluded files saved in library could be useful - flipping the library from excluded to included mode can help me to see the "junk" data at glance, so I can actually evaluate if the data are there purposefully or not. |
As the features for file filtering become more sophisticated, a la
I'm actually going to walk back my statement here. I think for the v9.4.x JSON branch it would be better to keep the existing filtering behavior as the pros swing in its favor, while for the v9.5+ SQL versions we should start removing the filtering step here as the tradeoffs are fewer and it will aide in additional near-future functionality. If this works for you, we can go ahead and move forward with these PRs. |
sure, let's merge this so other things can get into motion. |
refresh_dir()
:- now takesedit: it probably shouldntexclude_list
into considerationPath
parameter in preparation for multiple-directory Library eventuallyunrelated changes
use
uv
to install dependencies even inmypy
job to avoid dependencies download failureLibrary in tests now goes by default to
/dev/null/
instead of/tmp/
to detect when some test is trying to touch the filesystemremove some unused properties from
Library
.extension_list
was getting easily out of sync with database, so it's removed as well and it is just in DB now