Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node heartbeats #3709

Merged
merged 28 commits into from
Apr 8, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
ebda95f
Implements compute node heartbeats to requester node(s)
rossjones Mar 26, 2024
dfa33b1
Enqueue heartbeats in priority queue for checking
rossjones Mar 26, 2024
ecb53bf
Fix typing in test;
rossjones Mar 26, 2024
2191025
Remove loop var
rossjones Mar 26, 2024
98235b0
Initial heartbeat at startup
rossjones Mar 27, 2024
9bfd366
Warn not error when compute cannot send message
rossjones Mar 27, 2024
066e2f2
Add liveness info to node list
rossjones Mar 27, 2024
ce7e83a
Configuration for frequencies/intervals etc
rossjones Mar 28, 2024
f09e590
Configure heartbeat deadlines
rossjones Mar 28, 2024
ea11796
Document how to add a new nodestate
rossjones Apr 2, 2024
6365301
Close heartbeat client when done
rossjones Apr 2, 2024
bda14a5
Fix use of old errors.Wrap
rossjones Apr 2, 2024
a75b740
Defaults for new state timeout fields
rossjones Apr 2, 2024
4cd989e
Add tests for heartbeat states
rossjones Apr 2, 2024
d93c416
Test for initial state/unknown nodes
rossjones Apr 2, 2024
80a74b4
Make tests idempotent
rossjones Apr 2, 2024
d4f254b
New nodeinfo per test
rossjones Apr 2, 2024
8408aa9
Add node state to CLI `node list`
rossjones Apr 3, 2024
0cc06b6
Adds documentation for node management and the new node subcommands
rossjones Apr 3, 2024
e247467
Add NATS to cspell dictionary
rossjones Apr 3, 2024
63336e1
Fix duplicate docs line
rossjones Apr 3, 2024
a96c1b0
Mention flag in error message
rossjones Apr 8, 2024
0f7ef4d
Log potential error during server shutdown
rossjones Apr 8, 2024
c9f92a8
Change where we log the node manager start message
rossjones Apr 8, 2024
8487070
Merge branch 'main' into heartbeats
rossjones Apr 8, 2024
a50051b
switch to bi-modal status
rossjones Apr 8, 2024
e841e6e
Merge branch 'main' into heartbeats
rossjones Apr 8, 2024
8cc5fe4
Fix heartbeat server + tests
rossjones Apr 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .cspell/custom-dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ multierror
multiformats
Muxed
mypy
NATS
nbconvert
nemt
nocheck
Expand Down
6 changes: 5 additions & 1 deletion cmd/cli/node/columns.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,13 @@ var alwaysColumns = []output.TableColumn[*models.NodeInfo]{
Value: func(ni *models.NodeInfo) string { return ni.NodeType.String() },
},
{
ColumnConfig: table.ColumnConfig{Name: "status"},
ColumnConfig: table.ColumnConfig{Name: "approval"},
Value: func(ni *models.NodeInfo) string { return ni.Approval.String() },
},
{
ColumnConfig: table.ColumnConfig{Name: "status"},
Value: func(ni *models.NodeInfo) string { return ni.State.String() },
},
}

var toggleColumns = map[string][]output.TableColumn[*models.NodeInfo]{
Expand Down
31 changes: 22 additions & 9 deletions cmd/cli/node/list.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,17 @@ import (

var defaultColumnGroups = []string{"labels", "capacity"}
var orderByFields = []string{"id", "type", "available_cpu", "available_memory", "available_disk", "available_gpu", "status"}
var filterStatusValues = []string{"approved", "pending", "rejected"}
var filterApprovalValues = []string{"approved", "pending", "rejected"}
var filterStatusValues = []string{"healthy", "unhealthy", "unknown"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it should be binary - "connected" or "unconnected" (we could add something more sophisticated like "unhealthy" when we have more health checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the filters, or for thenode list as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the node list as well - "healthy" and "unhealthy" imply something i don't think we know. We only know if it's connected or disconnected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with Connected/Disconnected for now, discussed with Walid and will add another duration in future to mark the point beyond which we don't think the node is coming back.


// ListOptions is a struct to support node command
type ListOptions struct {
output.OutputOptions
cliflags.ListOptions
ColumnGroups []string
Labels string
FilterByStatus string
ColumnGroups []string
Labels string
FilterByApproval string
FilterByStatus string
}

// NewListOptions returns initialized Options
Expand All @@ -42,22 +44,24 @@ func NewListCmd() *cobra.Command {
Use: "list",
Short: "List info of network nodes. ",
Args: cobra.NoArgs,
Run: o.run,
RunE: o.run,
}
nodeCmd.Flags().StringSliceVar(&o.ColumnGroups, "show", o.ColumnGroups,
fmt.Sprintf("What column groups to show. Zero or more of: %q", maps.Keys(toggleColumns)))
nodeCmd.Flags().StringVar(&o.Labels, "labels", o.Labels,
"Filter nodes by labels. See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for more information.")
nodeCmd.Flags().AddFlagSet(cliflags.ListFlags(&o.ListOptions))
nodeCmd.Flags().AddFlagSet(cliflags.OutputFormatFlags(&o.OutputOptions))
nodeCmd.Flags().StringVar(&o.FilterByApproval, "filter-approval", o.FilterByApproval,
fmt.Sprintf("Filter nodes by approval. One of: %q", filterApprovalValues))
nodeCmd.Flags().StringVar(&o.FilterByStatus, "filter-status", o.FilterByStatus,
fmt.Sprintf("Filter nodes by status. One of: %q", filterStatusValues))

return nodeCmd
}

// Run executes node command
func (o *ListOptions) run(cmd *cobra.Command, _ []string) {
func (o *ListOptions) run(cmd *cobra.Command, _ []string) error {
Copy link
Member

@frrist frrist Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: if we are going to return an error from this method can we replace all the util.Fatal calls while we are at it? (be wary this might break some tests that (annoyingly) override the util.Fatal field 😞)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can definitely add a ticket to address this, yeah. There's a mix of approaches atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #3764

ctx := cmd.Context()

var err error
Expand All @@ -69,15 +73,22 @@ func (o *ListOptions) run(cmd *cobra.Command, _ []string) {
}
}

if o.FilterByApproval != "" {
if !slices.Contains(filterApprovalValues, o.FilterByApproval) {
return fmt.Errorf("cannot use '%s' as filter approval value, should be one of: %q", o.FilterByApproval, filterApprovalValues)
}
}

if o.FilterByStatus != "" {
if !slices.Contains(filterStatusValues, o.FilterByStatus) {
util.Fatal(cmd, fmt.Errorf("cannot use '%s' as filter status value, should be one of: %q", o.FilterByStatus, filterStatusValues), 1)
return fmt.Errorf("cannot use '%s' as filter status value, should be one of: %q", o.FilterByStatus, filterStatusValues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider printing the specific flag name in the error message e.g. filter-status and filter-approval

}
}

response, err := util.GetAPIClientV2(cmd).Nodes().List(ctx, &apimodels.ListNodesRequest{
Labels: labelRequirements,
FilterByStatus: o.FilterByStatus,
Labels: labelRequirements,
FilterByApproval: o.FilterByApproval,
FilterByStatus: o.FilterByStatus,
BaseListRequest: apimodels.BaseListRequest{
Limit: o.Limit,
NextToken: o.NextToken,
Expand All @@ -97,4 +108,6 @@ func (o *ListOptions) run(cmd *cobra.Command, _ []string) {
if err = output.Output(cmd, columns, o.OutputOptions, response.Nodes); err != nil {
util.Fatal(cmd, fmt.Errorf("failed to output: %w", err), 1)
}

return nil
}
67 changes: 67 additions & 0 deletions docs/docs/dev/cli-reference/cli/node/approve/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
sidebar_label: approve
---

# Command: `node approve`

The `bacalhau node approve` command offers administrations the ability to approve the cluster membership for a node using its unique identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

command offers administrations administrators?
unique identifier being synonymous with name, right?


## Description:

Using the `approve` sub-command under the `bacalhau node` umbrella, users can allow a node in the pending state to join the cluster and receive work. This feature is crucial for system administrators to manage the cluster.

## Usage:

```bash
bacalhau node approve [id] [flags]
```

## Flags:

- `[id]`:

- The unique identifier of the node you wish to describe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe approve


- `-h`, `--help`:

- Displays the help documentation for the `describe` command.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe node approve


- `-m message`:

- A message to be attached to the approval action.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes me wonder if we should also include the ClientID of the user issuing the approval, but that could always be included later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think at the moment something being audited just means it was logged somewhere, and so the client id should be in the request.


## Global Flags:

- `--api-host string`:

- Specifies the host for client-server communication through REST. This flag is overridden if the `BACALHAU_API_HOST` environment variable is set.
- Default: `"bootstrap.production.bacalhau.org"`

- `--api-port int`:

- Designates the port for REST-based communication between client and server. This flag is overlooked if the `BACALHAU_API_PORT` environment variable is defined.
- Default: `1234`

- `--log-mode logging-mode`:

- Determines the log format preference.
- Options: `'default','station','json','combined','event'`
- Default: `'default'`

- `--repo string`:
- Points to the bacalhau repository's path.
- Default: `"`$HOME/.bacalhau"`
Comment on lines +33 to +53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rather not have these repeated here and just document them one, optionally providing a link, otherwise it draw attention away from what this page is documenting imo.


## Examples:

1. Approve a Node with ID `nodeID123`:

```bash
bacalhau node approve nodeID123
```

2. Approve a Node with an audit message:

```bash
bacalhau node approve nodeID123 -m "okay"
```
67 changes: 67 additions & 0 deletions docs/docs/dev/cli-reference/cli/node/delete/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
sidebar_label: delete
---

# Command: `node approve`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node delete


The `bacalhau node delete` command offers administrations the ability to remove a node from the cluster using its unique identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment about administrations


## Description:

Using the `delete` sub-command, administrators can remove a node from the list of available compute nodes in the cluster. This feature is necessary for the management of the infrastructure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency with above: Using the delete sub-command under the bacalhau node umbrella


## Usage:

```bash
bacalhau node delete [id] [flags]
```

## Flags:

- `[id]`:

- The unique identifier of the node you wish to describe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe delete


- `-h`, `--help`:

- Displays the help documentation for the `describe` command.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace describe


- `-m message`:

- A message to be attached to the deletion action.

## Global Flags:

- `--api-host string`:

- Specifies the host for client-server communication through REST. This flag is overridden if the `BACALHAU_API_HOST` environment variable is set.
- Default: `"bootstrap.production.bacalhau.org"`

- `--api-port int`:

- Designates the port for REST-based communication between client and server. This flag is overlooked if the `BACALHAU_API_PORT` environment variable is defined.
- Default: `1234`

- `--log-mode logging-mode`:

- Determines the log format preference.
- Options: `'default','station','json','combined','event'`
- Default: `'default'`

- `--repo string`:
- Points to the bacalhau repository's path.
- Default: `"`$HOME/.bacalhau"`
Comment on lines +33 to +53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rather not have these repeated here and just document them one, optionally providing a link, otherwise it draw attention away from what this page is documenting imo.


## Examples:

1. Delete the Node with ID `nodeID123`:

```bash
bacalhau node delete nodeID123
```

2. Delete a Node with an audit message:

```bash
bacalhau node delete nodeID123 -m "bad actor"
```
77 changes: 52 additions & 25 deletions docs/docs/dev/cli-reference/cli/node/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,48 +12,75 @@ bacalhau node [command]

## Available Commands

1. **[approve](./approve)**:

- Description: Approves a single node to join the cluster.
- Usage:

```bash
bacalhau node approve
```

1. **[delete](./delete)**:

- Description: Deletes a node from the cluster using its ID.
- Usage:
```bash
bacalhau node delete
```

1. **[describe](./describe)**:
- Description: Retrieves detailed information of a node using its ID.
- Usage:
```bash
bacalhau node describe
```

2. **[list](./list)**:
- Description: Lists the details of all nodes present in the network.
- Usage:
```bash
bacalhau node list
```

- Description: Retrieves detailed information of a node using its ID.
- Usage:
```bash
bacalhau node describe
```

1. **[list](./list)**:

- Description: Lists the details of all nodes present in the network.
- Usage:
```bash
bacalhau node list
```

1. **[reject](./reject)**:

- Description: Reject a specific node's request to join the cluster.
- Usage:
```bash
bacalhau node reject
```

For comprehensive details on any of the sub-commands, run:

```bash
bacalhau node [command] --help
```

## Flags

- `-h`, `--help`:
- Description: Shows the help information for the `node` command.
- Description: Shows the help information for the `node` command.

## Global Flags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar sentiment around not repeating global flags


- `--api-host string`:
- Description: Specifies the host for RESTful communication between the client and server. The flag will be ignored if the `BACALHAU_API_HOST` environment variable is set.
- Default: `bootstrap.production.bacalhau.org`

- Description: Specifies the host for RESTful communication between the client and server. The flag will be ignored if the `BACALHAU_API_HOST` environment variable is set.
- Default: `bootstrap.production.bacalhau.org`

- `--api-port int`:
- Description: Designates the port for RESTful communication. The flag will be bypassed if the `BACALHAU_API_PORT` environment variable is active.
- Default: `1234`

- `--log-mode logging-mode`:
- Description: Chooses the preferred log format. Available choices are: `default`, `station`, `json`, `combined`, and `event`.
- Default: `default`
- Description: Designates the port for RESTful communication. The flag will be bypassed if the `BACALHAU_API_PORT` environment variable is active.
- Default: `1234`

- `--repo string`:
- Description: Specifies the path to the bacalhau repository.
- Default: `/Users/walid/.bacalhau`
- `--log-mode logging-mode`:

---
- Description: Chooses the preferred log format. Available choices are: `default`, `station`, `json`, `combined`, and `event`.
- Default: `default`

This should provide an organized and structured overview of the `node` command and its functionalities!
- `--repo string`:
- Description: Specifies the path to the bacalhau repository.
- Default: `/Users/walid/.bacalhau`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who's walid? :trollface:

Loading
Loading