Skip to content

Event driven dra #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 71 commits into
base: main
Choose a base branch
from
Draft

Event driven dra #58

wants to merge 71 commits into from

Conversation

heerener
Copy link
Collaborator

@jplanasc @jamesgking could you give this a thorough review please?

My main concern is that it makes the code less clean, but I'd like a sanity check on that :-) If you want to take some time to go through it together, that's definitely an option.

@heerener heerener requested review from jamesgking and jplanasc June 20, 2025 15:07
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 23, 2025 07:39 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 23, 2025 08:57 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 23, 2025 09:28 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 23, 2025 09:49 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 24, 2025 09:28 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 24, 2025 12:53 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 25, 2025 08:32 — with GitHub Actions Inactive
Copy link
Contributor

@jplanasc jplanasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue after the online review session this afternoon.

from hpc_provisioner.logging_config import LOGGING_CONFIG

TABLE_NAME = "sbo-parallelcluster-subnets"
SUBNET_TABLE_NAME = "sbo-parallelcluster-subnets"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we replace sbo with obi? :)

1. Check which clusters are pending (provisioning_launched=False) creation and have include_lustre=True
2. For each of them:
check whether any DRAs are still pending
if not: call do_cluster_create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a personal opinion: if I understand correctly, the logic is to create the filesystems first and, once successful, we create the pcluster. I'd swap the order of the operations to promote a "fail fast" approach: if the cluster creation (which in theory is faster/lighter than the filesystem mounting) fails for any reason , then, we fail "fast" and don't have to wait for the slower/heavier FS mounting process.

raise RuntimeError(f"Filesystem {cluster.name} not created when it should have been")
CONFIG_VALUES["fsx"] = {
"Name": "LustreFSX",
"MountDir": "/obi/data",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new 'obi' path :)
However, I'd still prefix FSx filesystems with /fsx/obi...


def get_fs_bucket(bucket_name: str, cluster: Cluster) -> str:
if bucket_name == "projects":
return get_sbonexusdata_bucket()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not for this PR, but I guess we should get rid of the 'sbo' and 'nexus' keywords. This could be named projects, as it is named somewhere else?

@@ -8,6 +8,7 @@ dependencies = [
"aws-parallelcluster",
"boto3",
"cryptography",
"python_dynamodb_lock",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove again


def get_fs_bucket(bucket_name: str, cluster: Cluster) -> str:
if bucket_name == "projects":
return get_sbonexusdata_bucket()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another SBO reference that should be OBI

@@ -72,3 +95,73 @@ def free_subnet(dynamodb_client, subnet_id: str) -> None:
dynamodb_client.delete_item(
TableName="sbo-parallelcluster-subnets", Key={"subnet_id": {"S": subnet_id}}
)


def get_unclaimed_clusters(dynamodb_resource) -> list:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclaimed -> provisioning_started

)


def claim_cluster(dynamodb_resource, cluster: Cluster) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claim: provisioning_launched?

dynamo = dynamodb_resource()
try:
register_cluster(dynamo, cluster)
except ClusterAlreadyRegistered as e:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably shouldn't be an error

cluster_secrets = get_secrets_for_cluster(
sm_client=sm_client, cluster_name=pcluster["clusterName"]
)
pcluster.update(cluster_secrets)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add status field (similar as cluster status) with something like WAITING_FOR_DRAS

@@ -165,6 +255,8 @@ def pcluster_delete_handler(event, _context=None):

logger.debug(f"delete pcluster {cluster}")
try:
dynamo = dynamodb_resource()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: why not in reverse order (delete cluster, then from dynamo)

}
if not fs:
raise RuntimeError(f"Filesystem {cluster.name} not created when it should have been")
CONFIG_VALUES["fsx"] = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No DRA included; maybe don't wait for cluster creation?

@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 25, 2025 14:21 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 26, 2025 08:41 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 26, 2025 09:14 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 26, 2025 13:23 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 26, 2025 13:36 — with GitHub Actions Inactive
@heerener heerener temporarily deployed to aws-sandbox-benchmarks June 27, 2025 09:30 — with GitHub Actions Inactive
@heerener heerener deployed to aws-sandbox-benchmarks June 27, 2025 14:38 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants