-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileChange added/deleted events for one file are returned out of order, or are missing #148
Comments
After digging around within Line 224 in acea996
Line 38 in acea996
especially given the prevalence of the phrase "arbitrary order" in Rust's If a set data structure is really desired then maybe a Suppose a scenario where, within one "cycle", a file is rapidly:
Whether or not this situation is actually possible, I would want to be told that either:
What I really would not want in this situation is for one of the creation events to be treated as a duplicate, and filtered out with the deletion remaining in the result. I suspect that currently, this is exactly what would occur. Perhaps a |
Well python sets don't maintain order, so it doesn't really matter what type we use in rust. I assume that offer doesn't matter, but de-duplication does, hence the use of sets. |
Why is deduplication important? |
Because depending on the OS there can easily be duplicates in the file changes. For example two "file modified" events because the data is modified and also the timestamp. E.g. look at your example above. |
Yes, use of a set in either space will destroy the ordering, so even if the Rust was using a data type that maintained the raw event order, the Python would be losing it. But the data is losing its order before being emitted from the Rust side of things, so I'm reasonably convinced that the data type being used in the Rust space does matter here also.
Okay, I can understand the utility in deduplicating modification events. I'm still persuaded to believe that naïvely deduplicating creation/deletion events is dangerous though. To paraphrase my earlier example, would you agree that given three consecutive raw events (create → delete → create), removing one of the create events (and nothing else—a "naïve" deduplication) paints a false picture? A more appropriate deduplication strategy for these events (if deduping is even really needed) might be to remove pairs of creation/deletion events, leaving the single creation event.
In my example I see two raw events of kind Are they truly duplicate events though? They have different "sub-kinds" and one of them seems to actually concern file deletion. |
What you miss is that filesystem changes are often delayed by seconds or even tens of seconds, and can arrive or of order. The delay issue is particularly prevalent on MacOS, but can also happen on Windows. The other thing I think you miss is that most users of watchfiles (and watchgod before that), don't actually care much about exact order, many of them don't even care about which files changed. I'm therefore 👎 on changing the interface and adding lots of extra logic for a minority of users. However I'd be willing to accept a pr to add a new interface to return details of every change so they can be used in python. Would you like to have a go at that? |
This remark makes me feel that maybe I've again missed the mark in describing the core problem, so I hope you don't mind my attempting one final rephrase:
Understood and accepted. In case anyone else lands on this bug report with the same problem I've faced, I've implemented this Python workaround: def without_multifile_events(changes: set[watchfiles.main.FileChange]):
"""
Prevent simultaneous reports of creation and deletion of one file.
`watchfiles` might report creation and deletion of the same file at once,
and since results are returned in an unordered set, the latest event can't
be deduced from the results alone. This function explicitly tests for the
existence of paths in this ambiguous situation and emits only the events
that represent the current path state.
Specifically, for each unique path in `changes`:
- Emit `Change.modified` events as is.
- Emit one `Change.added` event if and only if:
- there are no `Change.deleted` events for this path; or
- there are also `Change.deleted` events, but the path exists.
- Emit one `Change.deleted` event if and only if:
- there are no `Change.added` events for this path; or
- there are also `Change.added` events, but the path doesn't exist.
See the related GitHub issue for further details:
https://github.com/samuelcolvin/watchfiles/issues/148
```
from collections import defaultdict
from pathlib import Path
import watchfiles
for changes in watchfiles.watch("/path"):
for change, path_str in without_multifile_events(changes):
...
```
"""
changes_by_path = defaultdict[str, set[watchfiles.Change]](set)
for change_tuple in changes:
change, path = change_tuple
if change is watchfiles.Change.modified:
yield change_tuple
else:
changes_by_path[path].add(change)
for path, changes_only in changes_by_path.items():
if len(changes_only) == 1:
(single_change,) = changes_only
else:
single_change = (
watchfiles.Change.added
if Path(path).exists()
else watchfiles.Change.deleted
)
yield single_change, path
This would involve making changes to the Rust code and I have no Rust experience, so I respectfully decline. |
Ok, the rust wouldn't be that hard. I'll leave this open as a python interface to the underlying changes would be useful. |
In my experience your OS also can't reliably tell you this. And it may have changed in between your notification and you using that file. So checking is definitely the right way. That said I can understand wanting to correctly handle the case where it does actually work as expected. |
@samuelcolvin - you obviously have a much closer relationship to the typical use cases of end users utilizing this package, but in my case (monitoring many files simultaneously on a shared network drive for updates which then trigger downstream processes) @Lx's assessment and workaround has been critical in accurately identifying inflection points relative to a particular file. It seems to me (and I only offer this since the topic is still open) that a package-native solve would be something a substantive percentage of users would find utility in having, rather than having to run this out to pathlib and defaultdict to manage externally. As you mentioned above, I don't believe the request to be outside the scope of the Notify implementation this leans on, so it should be fairly simple to implement. |
Maybe @Lx is right and this is an issue with the use of Maybe we could use https://docs.rs/indexmap/latest/indexmap/set/struct.IndexSet.html and thereby maintain order. Pr welcome to implement this, we should be able to tell just from creating the pr if it fixes the issue. |
Description
I'm trying to monitor file creations and deletions using
awatch
. I'm expecting that if files are rapidly created and then deleted,watchfiles
would either:I'm getting a
Set[FileChange]
back with two entries for the same file as expected, but when I enumerate over the set, the deletion is being reported before the addition.I've tried setting
debounce=0
andstep=0
in myawatch
call in the hope that this will prevent the events from being bundled within the same set, but this doesn't work.Debug output suggests that the raw events are being emitted in the correct order. Possibly
awatch
should be returning aSequence
rather than a set?Example Code
Example Code Output
Operating System
macOS 12.4 (21F79)
Environment
No response
Watchfiles Version
0.14.1
Python Version
Python 3.10.4
The text was updated successfully, but these errors were encountered: