Proposal - Automatic file system watching for stash #1965
Replies: 3 comments 1 reply
-
It's a great design and convenient for people who keep adding files at a daily-basis. I am just suggesting that please also provide the option of "not doing the auto scanning" for people like me who only add files to my collections occasionally. |
Beta Was this translation helpful? Give feedback.
-
Another place we could invoke a scan is at the end of the configuration wizard. That sets people up on the right path of scanning in order to index content. |
Beta Was this translation helpful? Give feedback.
-
#2084 Cross reference. Another request in this area. |
Beta Was this translation helpful? Give feedback.
-
Automatic File System watch for Stash
Abstract
This document collects information about automatic file system watches. That is, mechanisms in stash for automatically ingesting new data as it is added to a file system.
The current status of this document is in flux. It can be updated over time.
Background
The current way to have stash analyze metadata is through the "Configuration -> Tasks" page. A user can choose to Scan, either all stash paths or a specific subpath through a Selective Scan. Users tend to have stashes where the data continually changes. They add new data, they reorganize data and they delete data.
Users need to press the Scan and Selective Scan buttons periodically to udate the stash database with the changes. This quickly gets cumbersome for a user since they will have to do this often.
As a result, a common request is: can we have this scan done automatically? However, it turns out that automatically watching a file system is a rather complex affair, so we need an up-front design on how it is going to work.
Goals
The primary goal is to minimize the use of the buttons Scan and Selective Scan. Changes to the filesystem paths used by stash should be picked up automatically, and (re-)scans should be issued by stash without user intervention. However, not all file systems will support these features, so we cannot hope to eliminate manual scanning entirely.
The secondary goal is to limit file system operations. Scanning a complete subdirectory structure with many files can take a lot of time, and we don't want to flood the underlying disk with high IOPS if we can avoid it.
The final goal is to make the change largely seamless. Users should want this by default because it is a better experience than what we have now.
Non-goals
Operating systems and file systems pose limits to the extent of which automated scans can happen. We are not trying to solve such a limitation. Rather, we want to hook into proven technology where it is available, and await availability for missing technology.
In a full solution, it would be natural to have a chain of events like:
That is, a change on the file system, induces a Scan, which induces Identify and Scrape operations followed by a hook to Plugins. We purposefully limit the design to the
FS Event -> Filter -> Scan -> Deep Scan
pathing for now, while keeping the door open for the latter development later on.Duplicates in the file system is a concern, but not part of this design.
UX
A user will want insight into how the automated scanning progresses. This means we need enough instrumentation to tell the user what is going on. Users might want to pause, abort or otherwise manipulate the automation. When a user pushes a button, they are in control of the operation, and decide when it is running. When the operation is automated, the user isn't in direct control. We don't want a user to scratch their head as to why their CPU and Disk is loaded at a 100%.
This means all operations must be present in the UI.
Design
First, we must implement default scan configuration in the UI. This means that the following commands have a default, chosen by the user.
We implement two new CLI commands for stash:
These new commands works as follows:
They read the configuration file, in order to obtain the configuration stash normally works with, and what port stash is running on.
The calls then act as a stash client, calling into the server. The client invokes a GraphQL call:
In the case of a
scan
, paths is set to[]
. In the case of a selective scan, the given paths are checked via the configuration to be valid, and if they are we invoke a scan with those paths. If theminModTime
parameter is given, for instancestash scan --min-mtime=<4h
, The scan can use this as an optimization: only consider files which where changed within 4 hours. In practice, this allows the scan operation to skip any file early and bypass most of the internal scan logic for a majority of files.Avoiding old files hinges on a key observation: Most stashes grow over time, and most files will be older than a relatively small window of a couple of hours or days.
We return the job ID when initiating a scan.
The
scan job info
command lists the current job status as a JSON blob.The intention of these CLI commands is to ease script writing.
Supporting periodic scans
Periodic scans can now be implemented via an entry in a crontab, and the desired periodic scan rate is controlled by a cron implementation. This is far more likely to be useful for users, since it hooks into the operating system tooling. You don't have to separately configure periodic scanning in Stash, but can outsource the problem to a tool dedicated for running things on a schedule.
Supporting watched file system scans
File system watchers are quite intricate to write. There are many concerns which have to be solved:
For these reasons, it is often better to leave this to other tools.
As an example we use
watchman
to scan a file system. Watchman operates as a client/server architecture. Upon the the first invocation, a daemon process is spawned which operates the file system watchers. Further invocations of watchman communicates with the central server.We can watch some directory paths quite easily:
You then pull a clock for said directory:
And feed that into a since call on a regular basis:
This tells us there is a new file,
/fs/path/b/x/foo.mp4
. We can easily process this via scripting$ stash scan selective /fs/path/b/x { "jobID": 34 }
which can severely cut down the directories to scan. You then query periodically for the job with id 34, until it is marked as being completed. At this point, call
watchman since
again with the last returned clockc:1635956272:1970944:3:6
. This ensures there's only a single job in flight at any given time, and it will eventually settle.Improving scan rate
We can improve the existing scan rate by implementing an early filtering strategy. For a batch of potential candidates, if the candidate matches the existing database on filesize and mtime, we don't have to process it. The earlier we can push such a filter in the scan chain, the faster scans will run.
The
minModTime
parameter is intended to help with scanning. In particular, it is intended to enable the file system enumeration process to skip a large set of files, in order to bypass the full scan logic.Another way to improve scan rate is to write a scanner interested in new files only. This is faster, since one can track the mtime of subdirectories since the last scan. If a directory has an unmodified mtime, none of it's file content has been updated with new files. The pair of a directory and its mtime can be persisted, but it isn't necessary for a long-running process. The data can be kept as a cache in memory.
Concurrency & Parallelism
No considerations needed. We use the existing system.
Data storage
No changes needed, apart from storing mtime with any kind of file representation if it is not already there.
Alternatives
An earlier version of this document explored building fsnotify support directly into Stash. This idea has been superceeded by the above design. The amount of work it requires to maintain this on several OS platforms is quite high, and it affects large parts of our existing system as well. Perhaps most important is the relative immaturity if watching large file system directories. Many of the notify systems can break down here.
In addition, large stashes tend to be mounted via a network storage of some kind. These rarely support notifications, so the value of adding notifications seems to be limited.
Kodi
Kodi by default implements our current solution, with one exception: a flag "Initiate a scan on startup." Otherwise, it works like our current solution, with manual scans and so on.
Users whose workflow is to add new data to the stash directory, then boot the stash application would benefit from having such a flag to turn on.
Periodic scanning in Kodi is relegated to a plugin. If we were to support this, a possible path is to let stash support a small CLI for these kinds of operations:
$ stash scan
. This will in turn improvecrontab(5)
ergonomics.Watching in Kodi is relegated to a plugin.
Cross cutting concerns
A stash instance is used for more than scanning new content: it is actively doing database lookups, serving video streams, and can be transcoding. Thus, the compute/disk intensive operations in Deep scan should probably be prioritized at a lower priority as to not impose a too large hit on the stash instance. In order of importance:
Deep scan is a 4th priority workload.
Relevant issues / feature requests
Beta Was this translation helpful? Give feedback.
All reactions