e2e: refactor metrics test to use NSD and WI #19022

shoenig · 2023-11-07T23:20:35Z

This PR overhauls the metrics e2e suite which has been failing for like a year. It swaps out the use of Consul in favor of Nomad native service discovery and workload identity.

Basically it runs a handful of random jobs which produce metrics, then uses Prometheus to gather those metrics. Caddy is used to expose the Prometheus API to be reachable from the test runner. The little nomad-holepunch thing is used as a side car to enable Prometheus to access the Nomad API using workload identity. It's also used as a service job to represent each Nomad client in the Nomad service registry.

…dentity

shoenig · 2023-11-08T14:59:09Z

Spot check against e2e

nomad/e2e/metrics on e2e-metrics-metrics-metrics
➜ go test -v
=== RUN   TestMetrics
    metrics_test.go:52: tweaking podman registry auth files ...
    metrics_test.go:56: running metrics job cpustress ...
    metrics_test.go:60: running metrics job nomadagent ...
    metrics_test.go:64: running metrics job prometheus ...
    metrics_test.go:68: running metrics job pythonhttp ...
    metrics_test.go:72: running metrics job caddy ...
    metrics_test.go:76: let the metrics collect for a bit (10s) ...
    metrics_test.go:79: measuring alloc metrics ...
    metrics_test.go:153: expose prometheus http address http://3.86.25.98:9999
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:174: query for metric nomad_client_allocs_cpu_user{exported_job="cpustress-348"}
    metrics_test.go:174: query for metric nomad_client_allocs_cpu_allocated{exported_job="pythonhttp-071"}
    metrics_test.go:94: measuring client metrics ...
    metrics_test.go:153: expose prometheus http address http://3.86.25.98:9999
    metrics_test.go:174: query for metric nomad_client_allocated_memory{node_id="0169fb1a-8631-a65d-dc26-45c7f47dbf69"}
    metrics_test.go:174: query for metric sum(nomad_client_host_cpu_user{node_id="0169fb1a-8631-a65d-dc26-45c7f47dbf69"})
    metrics_test.go:174: query for metric nomad_client_host_memory_used{node_id="0169fb1a-8631-a65d-dc26-45c7f47dbf69"}
    metrics_test.go:174: query for metric nomad_client_uptime{node_id="0169fb1a-8631-a65d-dc26-45c7f47dbf69"}
    metrics_test.go:174: query for metric nomad_client_allocated_memory{node_id="c60934dc-ca04-dd43-76cb-478b59a7fe56"}
    metrics_test.go:174: query for metric sum(nomad_client_host_cpu_user{node_id="c60934dc-ca04-dd43-76cb-478b59a7fe56"})
    metrics_test.go:174: query for metric nomad_client_host_memory_used{node_id="c60934dc-ca04-dd43-76cb-478b59a7fe56"}
    metrics_test.go:174: query for metric nomad_client_uptime{node_id="c60934dc-ca04-dd43-76cb-478b59a7fe56"}
    metrics_test.go:174: query for metric nomad_client_allocated_memory{node_id="5123e03c-b00d-523a-1c7c-e1f09404c0e5"}
    metrics_test.go:174: query for metric sum(nomad_client_host_cpu_user{node_id="5123e03c-b00d-523a-1c7c-e1f09404c0e5"})
    metrics_test.go:174: query for metric nomad_client_host_memory_used{node_id="5123e03c-b00d-523a-1c7c-e1f09404c0e5"}
    metrics_test.go:174: query for metric nomad_client_uptime{node_id="5123e03c-b00d-523a-1c7c-e1f09404c0e5"}
    metrics_test.go:174: query for metric nomad_client_allocated_memory{node_id="5733b7f2-f742-dc07-18af-73132c844e56"}
    metrics_test.go:174: query for metric sum(nomad_client_host_cpu_user{node_id="5733b7f2-f742-dc07-18af-73132c844e56"})
    metrics_test.go:174: query for metric nomad_client_host_memory_used{node_id="5733b7f2-f742-dc07-18af-73132c844e56"}
    metrics_test.go:174: query for metric nomad_client_uptime{node_id="5733b7f2-f742-dc07-18af-73132c844e56"}
--- PASS: TestMetrics (77.29s)
PASS
ok      github.com/hashicorp/nomad/e2e/metrics  77.299s

lgfa29

Great work! It seems like a nice mix of task drivers, would make sense to use Docker for one of the Podman task to add even more flavours?

lgfa29 · 2023-11-08T20:42:17Z

e2e/metrics/input/caddy.hcl

+http:// {
+{{ $allocID := env "NOMAD_ALLOC_ID" -}}
+{{ range nomadService 1 $allocID "prometheus" }}
+  reverse_proxy {{ .Address }}:{{ .Port }}
+{{ end }}
+}


Suggested change

http:// {

{{ $allocID := env "NOMAD_ALLOC_ID" -}}

{{ range nomadService 1 $allocID "prometheus" }}

reverse_proxy {{ .Address }}:{{ .Port }}

{{ end }}

}

http:// {

reverse_proxy {{ range nomadService 1 $allocID "prometheus" }}{{ .Address }}:{{ .Port }} {{end}}

}

Not that it matters much in this case, but we could let Caddy handle the load balancing. I'm also not too familar with Caddyfiles, so I hope this is correct 😅

I gave this a try and it failed ... probably not going to spend the time to investigate, so many other test failures to look into 😓

lgfa29 · 2023-11-08T20:42:35Z

e2e/metrics/input/cpustress.hcl

+
+      config {
+        command  = "stress"
+        args     = ["--cpu", "1", ]


Suggested change

args = ["--cpu", "1", ]

args = ["--cpu", "1"]

lgfa29 · 2023-11-08T20:42:43Z

e2e/metrics/input/cpustress.hcl

+    }
+
+    task "cpustress" {
+      driver = "pledge"


lgfa29 · 2023-11-08T20:44:20Z

e2e/metrics/input/prometheus.hcl

+    # run a private holepunch instance in this group network
+    # so prometheus can access the nomad api for service disco


shoenig · 2023-11-09T14:21:12Z

would make sense to use Docker for one of the Podman task to add even more flavours?

Yeah for sure, we can look into expanding this especially once we get the windows client back

vercel bot deployed to Preview – nomad-storybook-and-ui November 7, 2023 23:23 View deployment

shoenig added 2 commits November 8, 2023 07:02

e2e: remove old metrics suite

76a44c3

e2e: install stress on e2e jammy image

b171eeb

shoenig force-pushed the e2e-metrics-metrics-metrics branch from 1d8b813 to a00df18 Compare November 8, 2023 13:02

vercel bot deployed to Preview – nomad-storybook-and-ui November 8, 2023 13:05 View deployment

e2e: overhaul metrics test to use nomad service discovery, workload i…

39f2033

…dentity

shoenig force-pushed the e2e-metrics-metrics-metrics branch from a00df18 to 39f2033 Compare November 8, 2023 13:57

vercel bot deployed to Preview – nomad-storybook-and-ui November 8, 2023 14:01 View deployment

e2e: format metrics hcl files and copywrite

2e754bc

vercel bot deployed to Preview – nomad-storybook-and-ui November 8, 2023 14:33 View deployment

shoenig added 2 commits November 8, 2023 08:44

e2e: undo tf lock file

44f7775

e2e: undo reg auth file perms

36c6cc5

vercel bot deployed to Preview – nomad-storybook-and-ui November 8, 2023 14:50 View deployment

shoenig marked this pull request as ready for review November 8, 2023 15:08

shoenig requested review from lgfa29 and jrasell November 8, 2023 15:12

lgfa29 approved these changes Nov 8, 2023

View reviewed changes

e2e: format cpustress.hcl

cf9ca21

vercel bot deployed to Preview – nomad-storybook-and-ui November 9, 2023 13:51 View deployment

shoenig merged commit a28e5b6 into main Nov 9, 2023
17 checks passed

shoenig deleted the e2e-metrics-metrics-metrics branch November 9, 2023 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e: refactor metrics test to use NSD and WI #19022

e2e: refactor metrics test to use NSD and WI #19022

shoenig commented Nov 7, 2023 •

edited

Loading

shoenig commented Nov 8, 2023

lgfa29 left a comment

lgfa29 Nov 8, 2023

shoenig Nov 9, 2023

lgfa29 Nov 8, 2023

lgfa29 Nov 8, 2023

lgfa29 Nov 8, 2023

shoenig commented Nov 9, 2023

		# run a private holepunch instance in this group network
		# so prometheus can access the nomad api for service disco

e2e: refactor metrics test to use NSD and WI #19022

e2e: refactor metrics test to use NSD and WI #19022

Conversation

shoenig commented Nov 7, 2023 • edited Loading

shoenig commented Nov 8, 2023

lgfa29 left a comment

Choose a reason for hiding this comment

lgfa29 Nov 8, 2023

Choose a reason for hiding this comment

shoenig Nov 9, 2023

Choose a reason for hiding this comment

lgfa29 Nov 8, 2023

Choose a reason for hiding this comment

lgfa29 Nov 8, 2023

Choose a reason for hiding this comment

lgfa29 Nov 8, 2023

Choose a reason for hiding this comment

shoenig commented Nov 9, 2023

shoenig commented Nov 7, 2023 •

edited

Loading