[o365] Simplification of data fetching logic #16024

chrisberkhout · 2025-11-19T15:49:15Z

Proposed commit message

[o365] Simplification of data fetching logic

Flattens the structure of the CEL program.

Simplifications:
- Make only one request per evaluation (or none).
- Expired items are skipped during fetching rather than filtering them
  out in multiple places.
- The non-canonical header name `NextPageUri` is no longer considered,
  as it's always normalized by the HTTP client.
- Assume that items can be fetched in the order they are listed.
- Assume that content items will not be empty.
- Update the `last_for` times once based on the listing range, rather
  than repeatedly (with the same value) for each followed listing link.
- Unify handling of generated listing links (for initial requests) and
  received listing links (for later pages).
- Subscribe once per input start (an alternative to once for the life of
  cursor data, as introduced in #15476).

Other changes:
- Moves some state into `state.cursor`, so that it persists across
  restarts: `state.work.todo_content` → `state.cursor.todo_content`,
  `state.work.next_list` (string) → `state.cursor.todo_links` (array).
- Renames `state.work.todo_type` (array) → `state.todo_types` (plural
  name, array). This stays out of the cursor because it can be
  reconstructed.
- Do all subscriptions first, then rotate types so everything is roughly
  chronological rather than type-by-type.
- Adds `state.subscribed` (map). It's not in the cursor data because we
  want to resubscribe if restarted.
- Keep querying until the time 3 seconds before the start has been
  reached (exclusive). The 3 second buffer avoids requesting times that
  may have unstable results.
- Log an error if no type is configured.
- The `max_executions` limit is raised. Getting up to date means
  hour-long listings for 168 hours of data, possibly over multiple pages
  each, likely for multiple content types, and fetching everything that
  was listed.

A mock server is added and used for system tests.

Notes for the reviewer

I started this before the last 4 PRs, and I checked that there are no conflicts with those changes:

o365: tolerate changed API next page URI behaviour #15325
We no longer parse the next page URLs, since we have the endTime from the initial request.
o365: fix handling of error conditions when requesting work continuation #15380
o365: fix error propagation within cel program #15445
This error handling logic isn't relevant in the new flattened structure.
o365: Fix 429 due to multiple subscription start attempts. #15476
The new code still limits subscription requests, but to a lesser degree, which I think is better overall.

The new CEL code passes with the original system tests.

The system tests have been updated to use a mock o365 server rather than the stream tool. The mock server has configurability, assertions and logging beyond what is used in the system test, which may be useful for future debugging. The amount of code is not insignificant, and the quality is just okay, so if this is a maintenance concern, it can be moved to a separate PR or removed entirely. Let me know what you think.

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.
I have verified that any added dashboard complies with Kibana's Dashboard good practices

How to test this PR locally

You can manually run the mock server like this:

go run ./_dev/deploy/docker/o365mock.go chunks_with_gaps_and_1_expired

In another terminal, run the CEL code in mito like this:

mito \
  -cfg <(echo '
auth:
  oauth2:
    client.id: test-cel-client-id
    client.secret: test-cel-client-secret
    provider: azure
    scopes:
      - "https://manage.office.com/.default"
    endpoint_params:
      grant_type: ["client_credentials"]
    token_url: http://localhost:9999/test-cel-tenant-id/oauth2/v2.0/token
') \
   -data <(echo '
{
	"url": "http://localhost:9999",
	"want_more": false,
	"base": {
		"tenant_id": "test-cel-tenant-id",
		"list_contents_start_time": "15h",
		"batch_interval": "1h",
		"maximum_age": "167h55m",
		"content_types": "Audit.AzureActiveDirectory, Audit.Exchange"
	}
}
') \
  -log_requests \
  <(awk '/^program:/{iscel=1; next} /^\{\{/{iscel=0} iscel' ./data_stream/audit/agent/stream/cel.yml.hbs)

Stop the mock server with Ctrl+C to trigger it's shutdown report.

You can remove the mito auth configuration if you disable CheckAccessToken in the mock server's inline configuration.

This version of mito will run faster than the following one, which rate limits to 1 rps:

go install github.com/elastic/mito/cmd/mito@835128

Related issues

Closes o365: flatten structure, one request per evaluation #15066

elasticmachine · 2025-11-19T15:49:21Z

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

packages/o365/_dev/deploy/docker/o365mock.go

packages/o365/data_stream/audit/agent/stream/cel.yml.hbs

packages/o365/_dev/deploy/docker/o365mock.go

packages/o365/data_stream/audit/agent/stream/cel.yml.hbs

efd6 · 2025-11-20T20:18:44Z

packages/o365/data_stream/audit/agent/stream/cel.yml.hbs

Because we lean on repeated input loops more now, I think it may be worth doing the dropping of the retry events in the agent with a beat processor, rather than sending the event to be dropped by the ingest pipeline. An example of this is here corresponding to this alternative way of saying {"retry": true}.

…one later.

… means it won't be a duplicate during ingest anyway).

…ne in elastic#15981.

elastic-vault-github-plugin-prod · 2025-11-21T09:12:14Z

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

elasticmachine · 2025-11-21T16:24:10Z

💚 Build Succeeded

Buildkite Build
Commit: f2827ac

History

💚 Build #34466 succeeded 12244da
💔 Build #34464 failed d0e4387
💔 Build #34461 failed 3645b09
💔 Build #34432 failed aa3b3f0
💔 Build #34430 failed 74f863d
💔 Build #34388 failed e620dd5

cc @chrisberkhout

chrisberkhout self-assigned this Nov 19, 2025

chrisberkhout requested a review from a team as a code owner November 19, 2025 15:49

chrisberkhout added enhancement New feature or request Integration:o365 Microsoft Office 365 Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] labels Nov 19, 2025

efd6 reviewed Nov 19, 2025

View reviewed changes

chrisberkhout requested a review from efd6 November 20, 2025 15:59

efd6 reviewed Nov 20, 2025

View reviewed changes

chrisberkhout added 18 commits November 21, 2025 08:37

celfmt.

9a6dffb

New version.

aa4ebcf

celfmt.

6fe28a6

Work around CEL types error.

23ae05f

Run system tests against o365mock.go instead of stream.

62857a7

Raise the max_executions limit.

57d9832

Version bump, changelog update.

dad7803

Fix typo.

2f73879

Min args check can go at the start because it doesn't need anything d…

b10431a

…one later.

Simplify sort function.

3bb1e0f

Concatenate strings the easy way.

35e9d91

Simplify time range tracking.

36a54a0

Simplify expiry check.

0bb02a6

Minimize exports.

f2aca23

Reorder sections to have entry point at the top.

80dca3f

Add copyright notice to o365mock.go.

c682c0c

Remove a duplicate event from o365mock fetchItemPool (the Id override…

c863803

… means it won't be a duplicate during ingest anyway).

In the fetchItemPool, add ExtendedProperties to an event, like was do…

d0e4387

…ne in elastic#15981.

chrisberkhout force-pushed the o365-flatten branch from 3645b09 to d0e4387 Compare November 21, 2025 08:01

chrisberkhout added 2 commits November 21, 2025 09:45

Update policy tests for new CEL code.

3a2c2dc

Update policy tests for new setting max_executions: 10000.

12244da

Add overview.

f2827ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[o365] Simplification of data fetching logic #16024

[o365] Simplification of data fetching logic #16024

chrisberkhout commented Nov 19, 2025 •

edited

Loading

Uh oh!

elasticmachine commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

efd6 Nov 20, 2025

Uh oh!

elastic-vault-github-plugin-prod bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

elasticmachine commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[o365] Simplification of data fetching logic #16024

Are you sure you want to change the base?

[o365] Simplification of data fetching logic #16024

Conversation

chrisberkhout commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Notes for the reviewer

Checklist

How to test this PR locally

Related issues

Uh oh!

elasticmachine commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

efd6 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

elastic-vault-github-plugin-prod bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Benchmarks report

Uh oh!

elasticmachine commented Nov 21, 2025

💚 Build Succeeded

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chrisberkhout commented Nov 19, 2025 •

edited

Loading

elastic-vault-github-plugin-prod bot commented Nov 21, 2025 •

edited

Loading