-
Notifications
You must be signed in to change notification settings - Fork 207
Feat!: Ensure metadata snapshots with modified audits are still audited #4341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7c40682
to
eebae76
Compare
eebae76
to
b48005a
Compare
@@ -1069,6 +1069,34 @@ def _data_hash_values(self) -> t.List[str]: | |||
|
|||
return data # type: ignore | |||
|
|||
def _audit_metadata(self) -> t.List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_audit_metadata_hash_values
to be consistent with _data_hash_values
start, end = interval | ||
|
||
try: | ||
scheduler._audit_snapshot( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like this API. Can we somehow reuse the entire scheduler machinery as is by supporting audit_only
mode in run
API? With audit_only
we'll bypass evaluate and interval recording but do everything else. We can also use the restatements
argument to remove relevant intervals so that they can be reprocessed again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was my initial approach I didn't find it better, we would end up hacking run()
to switch off all lines except of the audit. This will get messy with if not audit_only
everywhere imo, but happy to give it a try.
We can also use the restatements argument to remove relevant intervals so that they can be reprocessed again.
Sorry if I'm not understanding, but won't this add perf regressions to this PR? Currently, we retrieve the intervals from the latest snapshot so we can run the audits without having to reevaluate or compute anything.
Afaict, with restatements we can simply pass in [model_start, today]
for each snapshot and get this "computation" for free, but in exchange we have to remove + recompute + restore the intervals all over again only to get the same interval set back, right?
previous_snapshot_id | ||
] | ||
|
||
new_audits = snapshot.model._audit_metadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest having a proper top level method for this audit_metadata_hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you'd like a top-level method in the model definition that returns both the list and it's hash, right?
# core.model.definition
def audit_metadata_hash(self):
audits = self._audit_metadata()
return audits, hash_data(audits)
So then the highlighted section would be:
# core.plan.evaluator
_, previous_audits_hash = previous_snapshot.model.audit_metadata_hash()
new_audits, new_audits_hash = snapshot.model.audit_metadata_hash()
if (previous_audits_hash != new_audits_hash) and new_audits:
...
Fixes #3970