Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: refactor metrics test to use NSD and WI #19022

Merged
merged 7 commits into from
Nov 9, 2023
Merged

Conversation

shoenig
Copy link
Contributor

@shoenig shoenig commented Nov 7, 2023

This PR overhauls the metrics e2e suite which has been failing for like a year. It swaps out the use of Consul in favor of Nomad native service discovery and workload identity.

Basically it runs a handful of random jobs which produce metrics, then uses Prometheus to gather those metrics. Caddy is used to expose the Prometheus API to be reachable from the test runner. The little nomad-holepunch thing is used as a side car to enable Prometheus to access the Nomad API using workload identity. It's also used as a service job to represent each Nomad client in the Nomad service registry.

@shoenig
Copy link
Contributor Author

shoenig commented Nov 8, 2023

Spot check against e2e

nomad/e2e/metrics on e2e-metrics-metrics-metrics
➜ go test -v
=== RUN   TestMetrics
    metrics_test.go:52: tweaking podman registry auth files ...
    metrics_test.go:56: running metrics job cpustress ...
    metrics_test.go:60: running metrics job nomadagent ...
    metrics_test.go:64: running metrics job prometheus ...
    metrics_test.go:68: running metrics job pythonhttp ...
    metrics_test.go:72: running metrics job caddy ...
    metrics_test.go:76: let the metrics collect for a bit (10s) ...
    metrics_test.go:79: measuring alloc metrics ...
    metrics_test.go:153: expose prometheus http address http://3.86.25.98:9999
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:185: -> empty vector, will try again in 5 seconds
    metrics_test.go:174: query for metric nomad_client_allocs_memory_usage{exported_job="nomadagent-928"}
    metrics_test.go:174: query for metric nomad_client_allocs_cpu_user{exported_job="cpustress-348"}
    metrics_test.go:174: query for metric nomad_client_allocs_cpu_allocated{exported_job="pythonhttp-071"}
    metrics_test.go:94: measuring client metrics ...
    metrics_test.go:153: expose prometheus http address http://3.86.25.98:9999
    metrics_test.go:174: query for metric nomad_client_allocated_memory{node_id="0169fb1a-8631-a65d-dc26-45c7f47dbf69"}
    metrics_test.go:174: query for metric sum(nomad_client_host_cpu_user{node_id="0169fb1a-8631-a65d-dc26-45c7f47dbf69"})
    metrics_test.go:174: query for metric nomad_client_host_memory_used{node_id="0169fb1a-8631-a65d-dc26-45c7f47dbf69"}
    metrics_test.go:174: query for metric nomad_client_uptime{node_id="0169fb1a-8631-a65d-dc26-45c7f47dbf69"}
    metrics_test.go:174: query for metric nomad_client_allocated_memory{node_id="c60934dc-ca04-dd43-76cb-478b59a7fe56"}
    metrics_test.go:174: query for metric sum(nomad_client_host_cpu_user{node_id="c60934dc-ca04-dd43-76cb-478b59a7fe56"})
    metrics_test.go:174: query for metric nomad_client_host_memory_used{node_id="c60934dc-ca04-dd43-76cb-478b59a7fe56"}
    metrics_test.go:174: query for metric nomad_client_uptime{node_id="c60934dc-ca04-dd43-76cb-478b59a7fe56"}
    metrics_test.go:174: query for metric nomad_client_allocated_memory{node_id="5123e03c-b00d-523a-1c7c-e1f09404c0e5"}
    metrics_test.go:174: query for metric sum(nomad_client_host_cpu_user{node_id="5123e03c-b00d-523a-1c7c-e1f09404c0e5"})
    metrics_test.go:174: query for metric nomad_client_host_memory_used{node_id="5123e03c-b00d-523a-1c7c-e1f09404c0e5"}
    metrics_test.go:174: query for metric nomad_client_uptime{node_id="5123e03c-b00d-523a-1c7c-e1f09404c0e5"}
    metrics_test.go:174: query for metric nomad_client_allocated_memory{node_id="5733b7f2-f742-dc07-18af-73132c844e56"}
    metrics_test.go:174: query for metric sum(nomad_client_host_cpu_user{node_id="5733b7f2-f742-dc07-18af-73132c844e56"})
    metrics_test.go:174: query for metric nomad_client_host_memory_used{node_id="5733b7f2-f742-dc07-18af-73132c844e56"}
    metrics_test.go:174: query for metric nomad_client_uptime{node_id="5733b7f2-f742-dc07-18af-73132c844e56"}
--- PASS: TestMetrics (77.29s)
PASS
ok      github.com/hashicorp/nomad/e2e/metrics  77.299s

@shoenig shoenig marked this pull request as ready for review November 8, 2023 15:08
@shoenig shoenig requested review from lgfa29 and jrasell November 8, 2023 15:12
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! It seems like a nice mix of task drivers, would make sense to use Docker for one of the Podman task to add even more flavours?

Comment on lines +64 to +69
http:// {
{{ $allocID := env "NOMAD_ALLOC_ID" -}}
{{ range nomadService 1 $allocID "prometheus" }}
reverse_proxy {{ .Address }}:{{ .Port }}
{{ end }}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
http:// {
{{ $allocID := env "NOMAD_ALLOC_ID" -}}
{{ range nomadService 1 $allocID "prometheus" }}
reverse_proxy {{ .Address }}:{{ .Port }}
{{ end }}
}
http:// {
reverse_proxy {{ range nomadService 1 $allocID "prometheus" }}{{ .Address }}:{{ .Port }} {{end}}
}

Not that it matters much in this case, but we could let Caddy handle the load balancing. I'm also not too familar with Caddyfiles, so I hope this is correct 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a try and it failed ... probably not going to spend the time to investigate, so many other test failures to look into 😓


config {
command = "stress"
args = ["--cpu", "1", ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
args = ["--cpu", "1", ]
args = ["--cpu", "1"]

}

task "cpustress" {
driver = "pledge"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +46 to +47
# run a private holepunch instance in this group network
# so prometheus can access the nomad api for service disco
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice x2

@shoenig
Copy link
Contributor Author

shoenig commented Nov 9, 2023

would make sense to use Docker for one of the Podman task to add even more flavours?

Yeah for sure, we can look into expanding this especially once we get the windows client back

@shoenig shoenig merged commit a28e5b6 into main Nov 9, 2023
17 checks passed
@shoenig shoenig deleted the e2e-metrics-metrics-metrics branch November 9, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants