Add metrics and audit logging for async operations #10033

zubron · 2026-01-22T14:01:05Z

Change Description

Adds Prometheus metrics and structured logging for async task lifecycle (commit, merge, dump_refs, restore_refs, gc_prepare_commits). Async operations return immediately so existing middleware only captures initial request/response metrics - this change tracks the complete async lifecycle.

Implementation

Adds operation field to Task proto to explicitly identify the operation type
Uses simple span-based tracking: StartTaskSpan(log, task) returns a span, defer span.End() records completion
Leverages existing logger from call site (already has task_id, repository, audit context)
Falls back to "unknown" operation for legacy tasks without the field

Metrics

lakefs_async_operations_running (gauge): currently executing operations
lakefs_async_operations_total (counter): completed operations by status
lakefs_async_operation_duration_seconds (histogram): operation duration

Status labels distinguish outcomes:

success: completed successfully
failure: completed with error
expired: completed but exceeded client-facing deadline
orphaned: stopped heartbeat, cleaned up by background job

Testing Details

Unit tests and manual testing locally

Closes #10047 / treeverse/lakeFS-Enterprise#1298

Introduce TaskObserver interface to enable monitoring of async task lifecycle events (submit, start, complete, expire). This provides a hook point for Enterprise to collect metrics without modifying core async operation logic. OSS uses a no-op implementation; Enterprise can override via BuildTaskObserver factory.

itaigilo

Thanks @zubron for handling this -
I assume that every change that also involves the Enterprise repo might be tricky,
And this one is pretty nice.

Some comments, mainly about the interface.

itaigilo · 2026-01-26T16:58:43Z

modules/catalog/factory/build.go

+// BuildTaskObserver returns the task observer for async operation lifecycle events.
+// OSS returns a no-op observer. Enterprise overrides this to provide metrics.


Suggested change

// BuildTaskObserver returns the task observer for async operation lifecycle events.

// OSS returns a no-op observer. Enterprise overrides this to provide metrics.

// BuildTaskObserver returns a no-op task observer for Tasks lifecycle events.

Not specific to async ops, but every Task created by the catalog.
Plus, I'd avoid mentioning Enterprise, since it's pretty much implicit here.

After a discussion with @guy-har earlier, I ended up moving all the metrics into this change so there is no need for the factory function here. Removed.

itaigilo · 2026-01-26T17:18:20Z

pkg/catalog/catalog.go

+	}
+
+	// Notify observer that execution has begun
+	notifyObserver(taskID, func() {


Why not to notify the observer as part of UpdateTaskStatus(),
And let the observer decide about the logic?

itaigilo · 2026-01-26T17:25:07Z

pkg/catalog/catalog.go

 func (c *Catalog) executeTaskSteps(ctx context.Context, log logging.Logger, repository *graveler.RepositoryRecord, taskID string, task *Task, taskStatus protoreflect.ProtoMessage, steps []TaskStep) {
+	// Mark task as started
+	task.UpdatedAt = timestamppb.Now()
+	if err := UpdateTaskStatus(ctx, c.KVStore, repository, taskID, taskStatus); err != nil {


This is risky, because it's not only about adding metrics, but also adds a task update.

Also - why this marks a task as "started"? Isn't it the same call as in line 2329?

itaigilo · 2026-01-26T17:29:39Z

pkg/catalog/catalog.go

 		if err := UpdateTaskStatus(ctx, c.KVStore, repository, taskID, taskStatus); err != nil {
 			log.WithError(err).Error("Catalog failed to update task status")
 		}
+		// Notify observer of completion (success with no steps)


I know that such comment are both useful for agents and agents like to add them,
But we should decide -
Either add these comments to all the code in a file, or not to add them at all.
Having these comments sporadically makes me think that there's something worth noting there, in these cases of self-explanatory code.

Yep, I had gone through and removed these self-explanatory comments on the other change but not this one. Thanks for pointing it out. Definitely something to think about moving forward, and should part of the agent context discussion 👍

itaigilo · 2026-01-26T17:32:15Z

pkg/catalog/catalog.go

+// and the provided expiry duration. If expired, marks the task as done with timeout error
+// and notifies the observer. The observer is notified exactly once when the task
+// transitions to expired state (already-done tasks are skipped).
+func checkAndMarkTaskExpired(statusMsg protoreflect.ProtoMessage, expiryDuration time.Duration, observer TaskObserver) {


Any reason not to make this function a "member" of Catalog and use c.observer instead?

Thanks for pointing this out - turns out I misunderstood this function and realised that it doesn't actually persist the expiry to the KV. This is called every time the commit/merge status is queried so we can't make any guarantee about only counting an expired task once. I ended up moving the expiry notification elsewhere. The change to this function has been reverted.

itaigilo · 2026-01-26T17:34:07Z

pkg/catalog/catalog.go

+// and the provided expiry duration. If expired, marks the task as done with timeout error
+// and notifies the observer. The observer is notified exactly once when the task
+// transitions to expired state (already-done tasks are skipped).
+func checkAndMarkTaskExpired(statusMsg protoreflect.ProtoMessage, expiryDuration time.Duration, observer TaskObserver) {


Also, notifying the observer is detailed in the comment but not in the func name.
Should it be reflected in the func name?

itaigilo · 2026-01-26T17:36:09Z

pkg/catalog/task_observer.go

+// TaskObserver receives notifications about task lifecycle events.
+// Implementations can use these for metrics, logging, or audit trails.
+// All callbacks are synchronous and should be fast.
+type TaskObserver interface {


Again, I believe that a single OnTaskUpdated() method would be cleaner and more agnostic, for future use.

Keep all task metric collection in this repository. Remove the use of the factory function to create the observer and instead use the Async metrics observer. Move the expiry notification from the heartbeat check to cleanup of expired tasks. Expiry status isn't persisted in the KV so it would be counted during every task status check.

zubron · 2026-01-26T23:13:08Z

Thanks for the review, @itaigilo! Really helpful comments 👍

Re: your suggestions about using a single OnTaskUpdated function and notifying from UpdateTaskStatus - I see where you're coming from! Let me explain why I chose the above approach:

UpdateTaskStatus is a simple write function that persists task state to KV without lifecycle knowledge. Adding observer notifications there would muddy its responsibility (it becomes both persistence layer and event dispatcher) and require lifecycle inference - it doesn't know if this write is "submitted", "started", "completed", or just a heartbeat. The caller has that context.

With a single OnTaskUpdated(taskID, task) method, every observer implementation would need to track previous state internally, diff new state against old, and infer the lifecycle phase (start? completion? failure?). This pushes complexity into every observer rather than keeping it at the call site where the lifecycle phase is already known. The current interface is more explicit and I believe less error-prone.

Open to discussing more though if you have other concerns!

itaigilo · 2026-01-27T16:17:23Z

Thanks for the review, @itaigilo! Really helpful comments 👍

Re: your suggestions about using a single OnTaskUpdated function and notifying from UpdateTaskStatus - I see where you're coming from! Let me explain why I chose the above approach:

UpdateTaskStatus is a simple write function that persists task state to KV without lifecycle knowledge. Adding observer notifications there would muddy its responsibility (it becomes both persistence layer and event dispatcher) and require lifecycle inference - it doesn't know if this write is "submitted", "started", "completed", or just a heartbeat. The caller has that context.

With a single OnTaskUpdated(taskID, task) method, every observer implementation would need to track previous state internally, diff new state against old, and infer the lifecycle phase (start? completion? failure?). This pushes complexity into every observer rather than keeping it at the call site where the lifecycle phase is already known. The current interface is more explicit and I believe less error-prone.

Open to discussing more though if you have other concerns!

Well, my main point was not setting the task's lifecycle in the notifier, but have the observer decide about the lifecycle based on the task's content. This is more agnostic and more reusable for future use-cases. I guess is that the goal is to keep the OSS as lean as possible in such cases, and expose very little in these interfaces.

You are right about actually not overloading UpdateTaskStatus() with extra responsibilities, but how about wrapping it with UpdateTaskStatusAndNotify(), to prevent repetition?

itaigilo · 2026-01-27T16:17:53Z

pkg/catalog/async_metrics.go

+}
+
+var (
+	asyncOperationsPending = promauto.NewGaugeVec(


Why these appear now both here and on Enterprise?

After discussing with @guy-har, he recommended moving the metric definitions into this repo. The Enterprise PR has been closed.

zubron · 2026-01-27T19:36:01Z

Well, my main point was not setting the task's lifecycle in the notifier, but have the observer decide about the lifecycle based on the task's content. This is more agnostic and more reusable for future use-cases. I guess is that the goal is to keep the OSS as lean as possible in such cases, and expose very little in these interfaces.

Ah - I see what you mean. Sorry, I think I misunderstood your previous comment! I'll add that change 👍

You are right about actually not overloading UpdateTaskStatus() with extra responsibilities, but how about wrapping it with UpdateTaskStatusAndNotify(), to prevent repetition?

Sounds good!

I think there are some broader changes to make here after your comments and comments from @guy-har so I'm going to put this PR into draft for now. I'll re-request review when it's ready.

guy-har

Commenting here what we talked F2F.

I think the observer here is implemented nicely, but I believe that in our case which is adding metrics, the observer is much more than required and adds a bit of complexity here.
IMO if we decide to have metrics for the AsyncOperator with labels for the specific operations we will:

Reduce the need for the observer
Have metrics for all our async operations (e.g refs dump)

should look something like

var taskDurationHistograms = promauto.NewHistogramVec(
	prometheus.HistogramOpts{
		Name: "...",
		Help: "...",
	},
	[]string{"task_type", "success", "status_code"})

// executeTaskSteps runs each step sequentially, updating task status after each step.
func (c *Catalog) executeTaskSteps(ctx context.Context, log logging.Logger, repository *graveler.RepositoryRecord, taskID string, task *Task, taskStatus protoreflect.ProtoMessage, steps []TaskStep) {
	// Mark task as started
	start := time.Now()
	taskName := "willknowthis"
	defer taskDurationHistograms.
		WithLabelValues(taskName, strconv.FormatBool(task.ErrorMsg == ""), strconv.Itoa(int(task.StatusCode))).
		Observe(time.Since(start).Seconds())[7:12 PM]Add a metric (in this example taskDurationHistogramsAt the beginning of executeTaskSteps defer a call to the metric with Observe

Introduces TaskMonitor to track async task lifecycle (commit, merge, dump_refs, restore_refs, gc_prepare_commits) with Prometheus metrics and structured logging. Metrics: - lakefs_async_operations_running (gauge): currently executing operations - lakefs_async_operations_total (counter): completed operations by status - lakefs_async_operation_duration_seconds (histogram): operation duration Status labels distinguish outcomes: - success: completed successfully - failure: completed with error - expired: completed but exceeded client-facing deadline - orphaned: stopped heartbeating, cleaned up by background job

itaigilo

Thanks @zubron , this implementation looks much cleaner.

Overall looks good,
Adding some small comments,
And leaving the approval for others with a bit more context.

itaigilo · 2026-01-30T18:57:50Z

pkg/catalog/task_monitor.go

+		return
+	}
+
+	asyncOperationsRunning.WithLabelValues(ts.operation).Dec()


If End() is accidentally called twice, the gauge decrements twice but was only incremented once, causing negative values. Consider:

type TaskSpan struct { // ... ended bool // or use sync/atomic }

And:

func (ts *TaskSpan) End() { if ts.operation == "" || ts.task == nil || ts.ended { return } ts.ended = true // ... }

pkg/catalog/task_monitor_test.go

itaigilo · 2026-01-30T19:57:57Z

pkg/catalog/task_monitor.go

+	status := statusSuccess
+	if ts.task.StatusCode == http.StatusRequestTimeout {
+		status = statusExpired
+	} else if ts.task.ErrorMsg != "" {


Suggesting to prioritize error over expired -
Not realistic in the current flow, but this aligns with the way we write go code.

itaigilo · 2026-01-30T19:59:20Z

pkg/catalog/task_monitor.go

+	commitAsyncTaskIDPrefix = "CA"
+	mergeAsyncTaskIDPrefix  = "MA"


Tasks are async anyway. I suggest "CMT" and "MRG" as prefixes.

Not blocking, letting other with more context to approve.

guy-har

Great change!
Thank you!

Requesting changes mainly due to the taskIDToOperation, I prefer we don't conclude it from the prefix.

guy-har · 2026-02-01T10:24:11Z

pkg/catalog/task_monitor.go

+func (m *TaskMonitor) taskIDToOperation(taskID string) string {
+	switch {
+	case strings.HasPrefix(taskID, commitAsyncTaskIDPrefix):
+		return opCommit
+	case strings.HasPrefix(taskID, mergeAsyncTaskIDPrefix):
+		return opMerge
+	case strings.HasPrefix(taskID, DumpRefsTaskIDPrefix):
+		return opDumpRefs
+	case strings.HasPrefix(taskID, RestoreRefsTaskIDPrefix):
+		return opRestoreRefs
+	case strings.HasPrefix(taskID, GarbageCollectionPrepareCommitsPrefix):
+		return opGCPrepareCommits
+	default:
+		m.logger.WithField(taskIDFieldKey, taskID).Debug("Unknown task ID prefix, skipping metrics")
+		return ""
+	}
+}


Concluding the operation from the prefix requires a maintenance overhead of:

Updating this function for each new task type.

Inserting external information, such as the Async commit prefix

I suggest adding the operation type to the Protobuff of the task and each task will be submitted with his type and that will be used as the metric operation

guy-har · 2026-02-01T12:08:22Z

pkg/catalog/catalog.go

+	span := c.taskMonitor.StartSpan(ctx, task, string(repository.RepositoryID))
+	defer span.End() // reads status from task.StatusCode/ErrorMsg


Why did we choose the task monitor to be a member of the catalog?
I think it would be cleaner to call the metric...observer directly from the catalog. We can have a helper function if we don't want to do it directly. But I'm not sure I understand why this is a catalog member

You're right, it does not need to be a member of the catalog. I've updated it so it's just calling the start and end functions directly with the available logger.

guy-har · 2026-02-01T12:29:02Z

pkg/catalog/task_monitor.go

+	if ctx != nil {
+		if user, err := auth.GetUser(ctx); err == nil && user != nil {
+			fields[userIDFieldKey] = user.Username
+		}
+		if reqID := httputil.RequestIDFromContext(ctx); reqID != nil {
+			fields[logging.RequestIDFieldKey] = *reqID
+		}


How can this happen?

guy-har · 2026-02-01T12:30:34Z

pkg/catalog/task_monitor.go

+// NewTaskMonitor creates a new TaskMonitor with the given logger and audit log level.
+func NewTaskMonitor(logger logging.Logger, auditLogLevel string, isAdvancedAuth bool) *TaskMonitor {


This is in charge of both, metrics and logs, if we choose to leave it, this should be documented somehow

guy-har · 2026-02-01T12:35:15Z

@zubron, can we replace the title to be more generic, Add metrics for Async operations
If we stay with the audit logs maybe we should mention that as well

- Add 'operation' field to Task proto to explicitly identify operation type instead of inferring from task ID prefix - Replace TaskMonitor struct with stateless functions (StartTaskSpan, RecordOrphanedTask) that take logger as parameter - Remove TaskMonitor from Catalog config - use logger from call site - Add double-call protection to TaskSpan.End() - Reorder RunBackgroundTaskSteps params: (ctx, repo, operation, taskID, ...) - Rename task_monitor.go to task_observability.go

zubron added the exclude-changelog PR description should not be included in next release changelog label Jan 22, 2026

zubron force-pushed the task/add-metrics-for-async-commit-merge branch from 0768bea to 30ece9b Compare January 22, 2026 14:12

github-actions bot added area/cataloger Improvements or additions to the cataloger area/testing Improvements or additions to tests labels Jan 22, 2026

zubron marked this pull request as draft January 22, 2026 14:13

zubron force-pushed the task/add-metrics-for-async-commit-merge branch 2 times, most recently from 8a90f6e to 2b314b3 Compare January 24, 2026 19:56

zubron force-pushed the task/add-metrics-for-async-commit-merge branch from 2b314b3 to e00483f Compare January 24, 2026 20:07

zubron requested a review from a team January 24, 2026 20:15

zubron marked this pull request as ready for review January 24, 2026 20:15

Annaseli self-requested a review January 26, 2026 09:52

itaigilo previously requested changes Jan 26, 2026

View reviewed changes

zubron changed the title ~~Add TaskObserver interface for async events~~ Add metrics for async commit and merge operations Jan 26, 2026

zubron requested a review from itaigilo January 26, 2026 23:13

itaigilo reviewed Jan 27, 2026

View reviewed changes

zubron marked this pull request as draft January 27, 2026 19:36

guy-har reviewed Jan 27, 2026

View reviewed changes

zubron force-pushed the task/add-metrics-for-async-commit-merge branch 2 times, most recently from fbf8cbf to 23a2b0a Compare January 29, 2026 21:26

zubron force-pushed the task/add-metrics-for-async-commit-merge branch from 23a2b0a to 136fe73 Compare January 29, 2026 22:13

zubron marked this pull request as ready for review January 29, 2026 22:13

zubron requested review from guy-har and itaigilo January 29, 2026 22:13

itaigilo reviewed Jan 30, 2026

View reviewed changes

guy-har requested changes Feb 1, 2026

View reviewed changes

zubron changed the title ~~Add metrics for async commit and merge operations~~ Add metrics for async operations Feb 2, 2026

zubron changed the title ~~Add metrics for async operations~~ Add metrics and audit logging for async operations Feb 2, 2026

zubron requested a review from guy-har February 2, 2026 21:44

		// BuildTaskObserver returns the task observer for async operation lifecycle events.
		// OSS returns a no-op observer. Enterprise overrides this to provide metrics.

	// BuildTaskObserver returns the task observer for async operation lifecycle events.
	// OSS returns a no-op observer. Enterprise overrides this to provide metrics.
	// BuildTaskObserver returns a no-op task observer for Tasks lifecycle events.

		span := c.taskMonitor.StartSpan(ctx, task, string(repository.RepositoryID))
		defer span.End() // reads status from task.StatusCode/ErrorMsg

		// NewTaskMonitor creates a new TaskMonitor with the given logger and audit log level.
		func NewTaskMonitor(logger logging.Logger, auditLogLevel string, isAdvancedAuth bool) *TaskMonitor {

Add metrics and audit logging for async operations #10033

Are you sure you want to change the base?

Add metrics and audit logging for async operations #10033

Conversation

zubron commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Implementation

Metrics

Testing Details

Uh oh!

itaigilo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zubron commented Jan 26, 2026

Uh oh!

itaigilo commented Jan 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zubron commented Jan 27, 2026

Uh oh!

guy-har left a comment

Choose a reason for hiding this comment

Uh oh!

itaigilo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guy-har left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guy-har commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

zubron commented Jan 22, 2026 •

edited

Loading