cleanup the `refresh_dir` code, update tests #494

yedpodtrzitko · 2024-09-11T02:56:37Z

refresh_dir():

~~- now takes exclude_list into consideration~~ edit: it probably shouldnt

it also accepts new Path parameter in preparation for multiple-directory Library eventually

unrelated changes

use uv to install dependencies even in mypy job to avoid dependencies download failure
Library in tests now goes by default to /dev/null/ instead of /tmp/ to detect when some test is trying to touch the filesystem
remove some unused properties from Library
- .extension_list was getting easily out of sync with database, so it's removed as well and it is just in DB now

yedpodtrzitko · 2024-09-11T04:37:35Z

on second thought, I'm not sure if this is the right approach - maybe we should add all files into the library (no matter if they are in the ignore list), and only filter them when querying the library. Because the ignore list can eventually change, and it's not intuitive that you have to rescan the directory once again.

CyanVoxel · 2024-09-11T07:10:27Z

on second thought, I'm not sure if this is the right approach - maybe we should add all files into the library (no matter if they are in the ignore list), and only filter them when querying the library. Because the ignore list can eventually change, and it's not intuitive that you have to rescan the directory once again.

This was something I mulled over during #486 as well. The initial benefit of keeping the filtering was due to the filter membership check performing significantly faster compared to the rest of the checks being performed per file - but after the optimizations put in place it only accounted for a small fraction of the total runtime. In my test for allowing 200k files vs a 100k/100k filter split, it only increased the scan time from an average of 3.8 seconds to an average of 4.0 seconds. This was in contrast to the previous refresh method that was n times long for each file let through the filter.

With those optimizations in place, the pros and cons I determined were as follows:
Benefits of filtering:

Slightly increases typical rescan time if many files are filtered out
Keeps database file size down
Does not accrue "unlinked file debt" in cases where filtered files are "tracked" but eventually become unlinked before the user may ever want them in their library

Benefits of NOT filtering:

Changing filter does not require a library rescan (even if rescans are faster now)
Full awareness of the directory contents for all available features (duplicate searching, unlinked file searching, etc.)

I personally chose to continue with filtering in #486, but I'm flexible on the decision. These are the tradeoffs I considered, but maybe requirements like rescanning upon changing the extension list have more weight than I give them credit for. I'd like to hear more of your thoughts on this. Either way, I'll go along with whatever behavior we decide to move forward with.

yedpodtrzitko · 2024-09-11T07:45:31Z

Thanks for the thorough response. I will keep thinking about it; right now I'm still leaning towards not filtering on import a bit:

it reduces the code complexity. You might end up with some unwanted files in the database anyway (eg. adding more extensions to Exclude List later on; switching from Exclude to Include list...), so having only one place where to take care of the filtering will be more maintainable
if we are discussing querying the JSON database, then I agree having the excluded files en masse there might be suboptimal, so maybe in that case let's keep it as is
if we're discussing the a proper database, then at the scale we're dealing with (< 1M files), having some extra files wont make real difference

Maybe this is just me, but when I have some library directory, it's mostly "pure" data I want to have there, so the amount of excluded files is very tiny. But I can think of a use-case where having the excluded files saved in library could be useful - flipping the library from excluded to included mode can help me to see the "junk" data at glance, so I can actually evaluate if the data are there purposefully or not.

CyanVoxel · 2024-09-12T07:52:15Z

As the features for file filtering become more sophisticated, a la .gitignore style pattern matching (#14), I could see the need to rescan after each change becoming especially irksome. Having data such as how many files are currently being filtered at a glance would also benefit from/rely on not excluding files from the library at the point of scanning.

"Either way, I'll go along with whatever behavior we decide to move forward with."

I'm actually going to walk back my statement here. I think for the v9.4.x JSON branch it would be better to keep the existing filtering behavior as the pros swing in its favor, while for the v9.5+ SQL versions we should start removing the filtering step here as the tradeoffs are fewer and it will aide in additional near-future functionality. If this works for you, we can go ahead and move forward with these PRs.

yedpodtrzitko · 2024-09-12T16:59:50Z

If this works for you, we can go ahead and move forward with these PRs.

sure, let's merge this so other things can get into motion.

yedpodtrzitko force-pushed the yed/refresh-ignore-list branch 4 times, most recently from 6b8f479 to 3bcad8f Compare September 11, 2024 03:06

feat: take Ignore List into consideration when refreshing directory

c71d63c

yedpodtrzitko force-pushed the yed/refresh-ignore-list branch from 3bcad8f to c71d63c Compare September 11, 2024 03:07

yedpodtrzitko mentioned this pull request Sep 11, 2024

feat: increase file scanning performance #486

Merged

yedpodtrzitko force-pushed the yed/refresh-ignore-list branch from 5d8119e to d4f0597 Compare September 11, 2024 06:33

yedpodtrzitko changed the title ~~feat: take Ignore List into consideration when refreshing directory~~ cleanup the refresh_dir code, update tests Sep 11, 2024

undo the extension check in refresh_dir

c9d61b5

yedpodtrzitko force-pushed the yed/refresh-ignore-list branch from d4f0597 to c9d61b5 Compare September 11, 2024 06:35

CyanVoxel added TagStudio: Library Relating to the TagStudio library system Type: File System File system interactions Type: Tests Tests or testing related Type: CI Continuous Integration / workflows labels Sep 11, 2024

CyanVoxel merged commit 4942d16 into TagStudioDev:main Sep 12, 2024
5 checks passed

yedpodtrzitko deleted the yed/refresh-ignore-list branch November 18, 2024 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

cleanup the `refresh_dir` code, update tests #494

cleanup the `refresh_dir` code, update tests #494

Uh oh!

yedpodtrzitko commented Sep 11, 2024 •

edited

Loading

Uh oh!

yedpodtrzitko commented Sep 11, 2024 •

edited

Loading

Uh oh!

CyanVoxel commented Sep 11, 2024

Uh oh!

yedpodtrzitko commented Sep 11, 2024

Uh oh!

CyanVoxel commented Sep 12, 2024

Uh oh!

yedpodtrzitko commented Sep 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cleanup the refresh_dir code, update tests #494

cleanup the refresh_dir code, update tests #494

Uh oh!

Conversation

yedpodtrzitko commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yedpodtrzitko commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyanVoxel commented Sep 11, 2024

Uh oh!

yedpodtrzitko commented Sep 11, 2024

Uh oh!

CyanVoxel commented Sep 12, 2024

Uh oh!

yedpodtrzitko commented Sep 12, 2024

Uh oh!

Uh oh!

Uh oh!

cleanup the `refresh_dir` code, update tests #494

cleanup the `refresh_dir` code, update tests #494

yedpodtrzitko commented Sep 11, 2024 •

edited

Loading

yedpodtrzitko commented Sep 11, 2024 •

edited

Loading