Proposal - Automatic file system watching for stash #1965

SmallCoccinelle · 2021-11-06T13:23:07Z

SmallCoccinelle
Nov 6, 2021

Automatic File System watch for Stash

Abstract

This document collects information about automatic file system watches. That is, mechanisms in stash for automatically ingesting new data as it is added to a file system.

The current status of this document is in flux. It can be updated over time.

Background

The current way to have stash analyze metadata is through the "Configuration -> Tasks" page. A user can choose to Scan, either all stash paths or a specific subpath through a Selective Scan. Users tend to have stashes where the data continually changes. They add new data, they reorganize data and they delete data.

Users need to press the Scan and Selective Scan buttons periodically to udate the stash database with the changes. This quickly gets cumbersome for a user since they will have to do this often.

As a result, a common request is: can we have this scan done automatically? However, it turns out that automatically watching a file system is a rather complex affair, so we need an up-front design on how it is going to work.

Goals

The primary goal is to minimize the use of the buttons Scan and Selective Scan. Changes to the filesystem paths used by stash should be picked up automatically, and (re-)scans should be issued by stash without user intervention. However, not all file systems will support these features, so we cannot hope to eliminate manual scanning entirely.

The secondary goal is to limit file system operations. Scanning a complete subdirectory structure with many files can take a lot of time, and we don't want to flood the underlying disk with high IOPS if we can avoid it.

The final goal is to make the change largely seamless. Users should want this by default because it is a better experience than what we have now.

Non-goals

Operating systems and file systems pose limits to the extent of which automated scans can happen. We are not trying to solve such a limitation. Rather, we want to hook into proven technology where it is available, and await availability for missing technology.

In a full solution, it would be natural to have a chain of events like:

FS Event --> Filter --> Scan --> Deep Scan-\
                                           | --> Identify-\
                                           | --> Scrape---|
                                                          |--> Plugin Hook

That is, a change on the file system, induces a Scan, which induces Identify and Scrape operations followed by a hook to Plugins. We purposefully limit the design to the FS Event -> Filter -> Scan -> Deep Scan pathing for now, while keeping the door open for the latter development later on.

Duplicates in the file system is a concern, but not part of this design.

UX

A user will want insight into how the automated scanning progresses. This means we need enough instrumentation to tell the user what is going on. Users might want to pause, abort or otherwise manipulate the automation. When a user pushes a button, they are in control of the operation, and decide when it is running. When the operation is automated, the user isn't in direct control. We don't want a user to scratch their head as to why their CPU and Disk is loaded at a 100%.

This means all operations must be present in the UI.

Design

First, we must implement default scan configuration in the UI. This means that the following commands have a default, chosen by the user.

We implement two new CLI commands for stash:

$ stash scan [--max-age=DURATION]
{
  "jobID": 11
}
$ stash scan selective [--max-age=DURATION] /fs/path/a /fs/path/b
{
  "jobID": 12
}
$ stash job info 11
{
  "status": "RUNNING"
  ...
}

These new commands works as follows:

They read the configuration file, in order to obtain the configuration stash normally works with, and what port stash is running on.

The calls then act as a stash client, calling into the server. The client invokes a GraphQL call:

mutation Scan($paths: [String]!, $minModTime: Timestamp) {
  metadataScan(input: {paths: $paths, minModTime: $minModTime}
}

In the case of a scan, paths is set to []. In the case of a selective scan, the given paths are checked via the configuration to be valid, and if they are we invoke a scan with those paths. If the minModTime parameter is given, for instance stash scan --min-mtime=<4h, The scan can use this as an optimization: only consider files which where changed within 4 hours. In practice, this allows the scan operation to skip any file early and bypass most of the internal scan logic for a majority of files.

Avoiding old files hinges on a key observation: Most stashes grow over time, and most files will be older than a relatively small window of a couple of hours or days.

We return the job ID when initiating a scan.

The scan job info command lists the current job status as a JSON blob.

The intention of these CLI commands is to ease script writing.

Supporting periodic scans

Periodic scans can now be implemented via an entry in a crontab, and the desired periodic scan rate is controlled by a cron implementation. This is far more likely to be useful for users, since it hooks into the operating system tooling. You don't have to separately configure periodic scanning in Stash, but can outsource the problem to a tool dedicated for running things on a schedule.

Supporting watched file system scans

File system watchers are quite intricate to write. There are many concerns which have to be solved:

Different operating systems support varying calls for watching files on disk.
The operating systems provide a large variance of functionality in their watchers.
File systems are inherently full of possible race conditions.
Not all file systems support them.
There are often limits on the size of the directory structures one can realistically watch.

For these reasons, it is often better to leave this to other tools.

As an example we use watchman to scan a file system. Watchman operates as a client/server architecture. Upon the the first invocation, a daemon process is spawned which operates the file system watchers. Further invocations of watchman communicates with the central server.

We can watch some directory paths quite easily:

$ watchman watch /fs/path/a
{
    "watch": "/fs/path/b",
    "watcher": "inotify",
    "version": "4.9.0"
}

You then pull a clock for said directory:

$ watchman clock /fs/path/b
{
    "version": "4.9.0",
    "clock": "c:1635956272:1970944:3:3"
}

And feed that into a since call on a regular basis:

$ watchman since /fs/path/b c:1635956272:1970944:3:3
{
    "is_fresh_instance": false,
    "version": "4.9.0",
    "files": [
        {
            "cclock": "c:1635956272:1970944:3:2",
            "nlink": 2,
            "dev": 66310,
            "ctime": 1635956502,
            "new": false,
            "mtime": 1635956502,
            "gid": 1000,
            "mode": 16893,
            "size": 4096,
            "oclock": "c:1635956272:1970944:3:5",
            "ino": 2195471,
            "uid": 1000,
            "exists": true,
            "name": "x"
        },
        {
            "cclock": "c:1635956272:1970944:3:5",
            "nlink": 1,
            "dev": 66310,
            "ctime": 1635956502,
            "new": true,
            "mtime": 1635956502,
            "gid": 1000,
            "mode": 33204,
            "size": 0,
            "oclock": "c:1635956272:1970944:3:5",
            "ino": 2164774,
            "uid": 1000,
            "exists": true,
            "name": "x/foo.mp4"
        }
    ],
    "clock": "c:1635956272:1970944:3:6"
}

This tells us there is a new file, /fs/path/b/x/foo.mp4. We can easily process this via scripting

$ stash scan selective /fs/path/b/x
{
  "jobID": 34
}

which can severely cut down the directories to scan. You then query periodically for the job with id 34, until it is marked as being completed. At this point, call watchman since again with the last returned clock c:1635956272:1970944:3:6. This ensures there's only a single job in flight at any given time, and it will eventually settle.

Improving scan rate

We can improve the existing scan rate by implementing an early filtering strategy. For a batch of potential candidates, if the candidate matches the existing database on filesize and mtime, we don't have to process it. The earlier we can push such a filter in the scan chain, the faster scans will run.

The minModTime parameter is intended to help with scanning. In particular, it is intended to enable the file system enumeration process to skip a large set of files, in order to bypass the full scan logic.

Another way to improve scan rate is to write a scanner interested in new files only. This is faster, since one can track the mtime of subdirectories since the last scan. If a directory has an unmodified mtime, none of it's file content has been updated with new files. The pair of a directory and its mtime can be persisted, but it isn't necessary for a long-running process. The data can be kept as a cache in memory.

Concurrency & Parallelism

No considerations needed. We use the existing system.

Data storage

No changes needed, apart from storing mtime with any kind of file representation if it is not already there.

Alternatives

An earlier version of this document explored building fsnotify support directly into Stash. This idea has been superceeded by the above design. The amount of work it requires to maintain this on several OS platforms is quite high, and it affects large parts of our existing system as well. Perhaps most important is the relative immaturity if watching large file system directories. Many of the notify systems can break down here.

In addition, large stashes tend to be mounted via a network storage of some kind. These rarely support notifications, so the value of adding notifications seems to be limited.

Kodi

Kodi by default implements our current solution, with one exception: a flag "Initiate a scan on startup." Otherwise, it works like our current solution, with manual scans and so on.

Users whose workflow is to add new data to the stash directory, then boot the stash application would benefit from having such a flag to turn on.

Periodic scanning in Kodi is relegated to a plugin. If we were to support this, a possible path is to let stash support a small CLI for these kinds of operations: $ stash scan. This will in turn improve crontab(5) ergonomics.

Watching in Kodi is relegated to a plugin.

Cross cutting concerns

A stash instance is used for more than scanning new content: it is actively doing database lookups, serving video streams, and can be transcoding. Thus, the compute/disk intensive operations in Deep scan should probably be prioritized at a lower priority as to not impose a too large hit on the stash instance. In order of importance:

User facing workload + Important: 1st priority.
User facing workload: 2nd priority.
Background workload + Important: 3rd priorty.
Background workload: 4th priority.

Deep scan is a 4th priority workload.

Relevant issues / feature requests

philpw99 · 2021-11-07T17:50:07Z

philpw99
Nov 7, 2021

It's a great design and convenient for people who keep adding files at a daily-basis. I am just suggesting that please also provide the option of "not doing the auto scanning" for people like me who only add files to my collections occasionally.
I am using an "archive" hard drive from Seagate. 99% of time it's sleeping, it is slow to spin up and I actually don't want it spin up all the time. Stash provides a great way for me to know what I have, without spinning up the hard drive, but "auto scanning" will definitely disrupt that.

1 reply

SmallCoccinelle Nov 7, 2021
Author

The above idea won't by itself automate any kind of scan. You would have to set that up yourself. It acknowledges that people probably already have tooling for doing periodic tasks in their systems and provides enough QoL to hook into that tooling. But the defaults aren't going to change. In fact, the defaults we have now are the sensible setup.

In practice, I think most people drop stuff into their stash, then issues a (selective) scan. Some of the QoL that's on the test bench right now, is to filter a scan by recently changed files. That is, you scan any file that's newer than, say, 24h. This speeds up the scanner because it doesn't have to consider the vast majority of files.

SmallCoccinelle · 2021-11-11T13:46:32Z

SmallCoccinelle
Nov 11, 2021
Author

Another place we could invoke a scan is at the end of the configuration wizard. That sets people up on the right path of scanning in order to index content.

0 replies

SmallCoccinelle · 2021-11-30T15:57:43Z

SmallCoccinelle
Nov 30, 2021
Author

#2084 Cross reference. Another request in this area.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal - Automatic file system watching for stash #1965

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Proposal - Automatic file system watching for stash #1965

SmallCoccinelle Nov 6, 2021

Automatic File System watch for Stash

Abstract

Background

Goals

Non-goals

UX

Design

Supporting periodic scans

Supporting watched file system scans

Improving scan rate

Concurrency & Parallelism

Data storage

Alternatives

Kodi

Cross cutting concerns

Relevant issues / feature requests

Replies: 3 comments · 1 reply

philpw99 Nov 7, 2021

SmallCoccinelle Nov 7, 2021 Author

SmallCoccinelle Nov 11, 2021 Author

SmallCoccinelle Nov 30, 2021 Author

SmallCoccinelle
Nov 6, 2021

Replies: 3 comments 1 reply

philpw99
Nov 7, 2021

SmallCoccinelle Nov 7, 2021
Author

SmallCoccinelle
Nov 11, 2021
Author

SmallCoccinelle
Nov 30, 2021
Author