Skip to content

cleanup the refresh_dir code, update tests #494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 12, 2024

Conversation

yedpodtrzitko
Copy link
Contributor

@yedpodtrzitko yedpodtrzitko commented Sep 11, 2024

refresh_dir():

- now takes exclude_list into consideration edit: it probably shouldnt

  • it also accepts new Path parameter in preparation for multiple-directory Library eventually

unrelated changes

  • use uv to install dependencies even in mypy job to avoid dependencies download failure

  • Library in tests now goes by default to /dev/null/ instead of /tmp/ to detect when some test is trying to touch the filesystem

  • remove some unused properties from Library

    • .extension_list was getting easily out of sync with database, so it's removed as well and it is just in DB now

@yedpodtrzitko yedpodtrzitko force-pushed the yed/refresh-ignore-list branch 4 times, most recently from 6b8f479 to 3bcad8f Compare September 11, 2024 03:06
@yedpodtrzitko yedpodtrzitko force-pushed the yed/refresh-ignore-list branch from 3bcad8f to c71d63c Compare September 11, 2024 03:07
@yedpodtrzitko
Copy link
Contributor Author

yedpodtrzitko commented Sep 11, 2024

on second thought, I'm not sure if this is the right approach - maybe we should add all files into the library (no matter if they are in the ignore list), and only filter them when querying the library. Because the ignore list can eventually change, and it's not intuitive that you have to rescan the directory once again.

@yedpodtrzitko yedpodtrzitko force-pushed the yed/refresh-ignore-list branch from 5d8119e to d4f0597 Compare September 11, 2024 06:33
@yedpodtrzitko yedpodtrzitko changed the title feat: take Ignore List into consideration when refreshing directory cleanup the refresh_dir code, update tests Sep 11, 2024
@yedpodtrzitko yedpodtrzitko force-pushed the yed/refresh-ignore-list branch from d4f0597 to c9d61b5 Compare September 11, 2024 06:35
@CyanVoxel
Copy link
Member

on second thought, I'm not sure if this is the right approach - maybe we should add all files into the library (no matter if they are in the ignore list), and only filter them when querying the library. Because the ignore list can eventually change, and it's not intuitive that you have to rescan the directory once again.

This was something I mulled over during #486 as well. The initial benefit of keeping the filtering was due to the filter membership check performing significantly faster compared to the rest of the checks being performed per file - but after the optimizations put in place it only accounted for a small fraction of the total runtime. In my test for allowing 200k files vs a 100k/100k filter split, it only increased the scan time from an average of 3.8 seconds to an average of 4.0 seconds. This was in contrast to the previous refresh method that was n times long for each file let through the filter.

With those optimizations in place, the pros and cons I determined were as follows:
Benefits of filtering:

  • Slightly increases typical rescan time if many files are filtered out
  • Keeps database file size down
  • Does not accrue "unlinked file debt" in cases where filtered files are "tracked" but eventually become unlinked before the user may ever want them in their library

Benefits of NOT filtering:

  • Changing filter does not require a library rescan (even if rescans are faster now)
  • Full awareness of the directory contents for all available features (duplicate searching, unlinked file searching, etc.)

I personally chose to continue with filtering in #486, but I'm flexible on the decision. These are the tradeoffs I considered, but maybe requirements like rescanning upon changing the extension list have more weight than I give them credit for. I'd like to hear more of your thoughts on this. Either way, I'll go along with whatever behavior we decide to move forward with.

@CyanVoxel CyanVoxel added TagStudio: Library Relating to the TagStudio library system Type: File System File system interactions Type: Tests Tests or testing related Type: CI Continuous Integration / workflows labels Sep 11, 2024
@yedpodtrzitko
Copy link
Contributor Author

Thanks for the thorough response. I will keep thinking about it; right now I'm still leaning towards not filtering on import a bit:

  • it reduces the code complexity. You might end up with some unwanted files in the database anyway (eg. adding more extensions to Exclude List later on; switching from Exclude to Include list...), so having only one place where to take care of the filtering will be more maintainable

  • if we are discussing querying the JSON database, then I agree having the excluded files en masse there might be suboptimal, so maybe in that case let's keep it as is

  • if we're discussing the a proper database, then at the scale we're dealing with (< 1M files), having some extra files wont make real difference

Maybe this is just me, but when I have some library directory, it's mostly "pure" data I want to have there, so the amount of excluded files is very tiny. But I can think of a use-case where having the excluded files saved in library could be useful - flipping the library from excluded to included mode can help me to see the "junk" data at glance, so I can actually evaluate if the data are there purposefully or not.

@CyanVoxel
Copy link
Member

As the features for file filtering become more sophisticated, a la .gitignore style pattern matching (#14), I could see the need to rescan after each change becoming especially irksome. Having data such as how many files are currently being filtered at a glance would also benefit from/rely on not excluding files from the library at the point of scanning.

"Either way, I'll go along with whatever behavior we decide to move forward with."

I'm actually going to walk back my statement here. I think for the v9.4.x JSON branch it would be better to keep the existing filtering behavior as the pros swing in its favor, while for the v9.5+ SQL versions we should start removing the filtering step here as the tradeoffs are fewer and it will aide in additional near-future functionality. If this works for you, we can go ahead and move forward with these PRs.

@yedpodtrzitko
Copy link
Contributor Author

If this works for you, we can go ahead and move forward with these PRs.

sure, let's merge this so other things can get into motion.

@CyanVoxel CyanVoxel merged commit 4942d16 into TagStudioDev:main Sep 12, 2024
5 checks passed
@yedpodtrzitko yedpodtrzitko deleted the yed/refresh-ignore-list branch November 18, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TagStudio: Library Relating to the TagStudio library system Type: CI Continuous Integration / workflows Type: File System File system interactions Type: Tests Tests or testing related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants