Event driven dra #58

heerener · 2025-06-20T15:07:02Z

@jplanasc @jamesgking could you give this a thorough review please?

My main concern is that it makes the code less clean, but I'd like a sanity check on that :-) If you want to take some time to go through it together, that's definitely an option.

… arg

jplanasc

I'll continue after the online review session this afternoon.

jplanasc · 2025-06-25T09:02:41Z

hpc_provisioner/src/hpc_provisioner/dynamodb_actions.py

 from hpc_provisioner.logging_config import LOGGING_CONFIG

-TABLE_NAME = "sbo-parallelcluster-subnets"
+SUBNET_TABLE_NAME = "sbo-parallelcluster-subnets"


should we replace sbo with obi? :)

jplanasc · 2025-06-25T09:17:30Z

hpc_provisioner/src/hpc_provisioner/handlers.py

+    1. Check which clusters are pending (provisioning_launched=False) creation and have include_lustre=True
+    2. For each of them:
+        check whether any DRAs are still pending
+        if not: call do_cluster_create


Just a personal opinion: if I understand correctly, the logic is to create the filesystems first and, once successful, we create the pcluster. I'd swap the order of the operations to promote a "fail fast" approach: if the cluster creation (which in theory is faster/lighter than the filesystem mounting) fails for any reason , then, we fail "fast" and don't have to wait for the slower/heavier FS mounting process.

jplanasc · 2025-06-25T09:27:17Z

hpc_provisioner/src/hpc_provisioner/pcluster_manager.py

+            raise RuntimeError(f"Filesystem {cluster.name} not created when it should have been")
+        CONFIG_VALUES["fsx"] = {
+            "Name": "LustreFSX",
+            "MountDir": "/obi/data",


I like the new 'obi' path :)
However, I'd still prefix FSx filesystems with /fsx/obi...

jplanasc · 2025-06-25T09:31:47Z

hpc_provisioner/src/hpc_provisioner/utils.py

+
+def get_fs_bucket(bucket_name: str, cluster: Cluster) -> str:
+    if bucket_name == "projects":
+        return get_sbonexusdata_bucket()


Probably not for this PR, but I guess we should get rid of the 'sbo' and 'nexus' keywords. This could be named projects, as it is named somewhere else?

…rk flag

heerener · 2025-06-25T13:05:06Z

hpc_provisioner/pyproject.toml

@@ -8,6 +8,7 @@ dependencies = [
    "aws-parallelcluster",
    "boto3",
    "cryptography",
+    "python_dynamodb_lock",


Remove again

heerener · 2025-06-25T13:16:08Z

hpc_provisioner/src/hpc_provisioner/utils.py

+
+def get_fs_bucket(bucket_name: str, cluster: Cluster) -> str:
+    if bucket_name == "projects":
+        return get_sbonexusdata_bucket()


another SBO reference that should be OBI

heerener · 2025-06-25T13:17:22Z

hpc_provisioner/src/hpc_provisioner/dynamodb_actions.py

@@ -72,3 +95,73 @@ def free_subnet(dynamodb_client, subnet_id: str) -> None:
    dynamodb_client.delete_item(
        TableName="sbo-parallelcluster-subnets", Key={"subnet_id": {"S": subnet_id}}
    )
+
+
+def get_unclaimed_clusters(dynamodb_resource) -> list:


unclaimed -> provisioning_started

heerener · 2025-06-25T13:18:29Z

hpc_provisioner/src/hpc_provisioner/dynamodb_actions.py

+    )
+
+
+def claim_cluster(dynamodb_resource, cluster: Cluster) -> None:


claim: provisioning_launched?

heerener · 2025-06-25T13:28:16Z

hpc_provisioner/src/hpc_provisioner/handlers.py

+    dynamo = dynamodb_resource()
+    try:
+        register_cluster(dynamo, cluster)
+    except ClusterAlreadyRegistered as e:


probably shouldn't be an error

heerener · 2025-06-25T13:29:43Z

hpc_provisioner/src/hpc_provisioner/handlers.py

+            cluster_secrets = get_secrets_for_cluster(
+                sm_client=sm_client, cluster_name=pcluster["clusterName"]
+            )
+            pcluster.update(cluster_secrets)


add status field (similar as cluster status) with something like WAITING_FOR_DRAS

heerener · 2025-06-25T13:31:11Z

hpc_provisioner/src/hpc_provisioner/handlers.py

@@ -165,6 +255,8 @@ def pcluster_delete_handler(event, _context=None):

    logger.debug(f"delete pcluster {cluster}")
    try:
+        dynamo = dynamodb_resource()


comment: why not in reverse order (delete cluster, then from dynamo)

heerener · 2025-06-25T13:34:11Z

hpc_provisioner/src/hpc_provisioner/pcluster_manager.py

-            }
+        if not fs:
+            raise RuntimeError(f"Filesystem {cluster.name} not created when it should have been")
+        CONFIG_VALUES["fsx"] = {


No DRA included; maybe don't wait for cluster creation?

heerener requested review from jamesgking and jplanasc June 20, 2025 15:07

heerener temporarily deployed to aws-sandbox-benchmarks June 23, 2025 07:39 — with GitHub Actions Inactive

heerener force-pushed the event-driven-dra branch from 668f323 to 4a50182 Compare June 23, 2025 08:56

heerener temporarily deployed to aws-sandbox-benchmarks June 23, 2025 08:57 — with GitHub Actions Inactive

heerener and others added 25 commits June 23, 2025 11:27

Call creator lambda with suffix

6c27582

correct ClusterName in the json file

171926d

move to include_lustre flag instead of dev flag

1420e79

Bugfix in pcluster_manager

06e9546

Too much popping; it's not bubble wrap!

8a4a664

Cleaner way than multiple pops

4fa930f

dynamodb: cluster actions pt 1

915cd36

Register and delete cluster in dynamo

a748229

Lock library

192ea9f

release_subnets: debug logging

5ab499b

cluster_name -> cluster.name for 80_cloudwatch_agent_config_prolog.sh…

132ae90

… arg

pcluster_delete: debug logging

a8ef26f

dynamodb actions on Table, not on dynamodb resource

3e20bc4

Register cluster with all parameters

033a559

No waiting for fsx/dra, exit after first fsx precreate

031614f

Log dra_ready event

c2a3bcc

Handle EFS events

ba9a513

get_fsx_name: argument correction

89272ba

test_delete: assert dynamodb also cleared

2b14a61

test_{vlab,project}_id_not_specified: add path to event

f53b904

Change path for DRA callback and cluster comparison

3cbc750

WIP: fix POST tests

1437c65

WIP: create eventbridge timed rule

b9e53da

add benchmarks environment

81fc662

Custom AMI ID from env var

ca6d02d

heerener added 4 commits June 23, 2025 11:27

Make get cluster more robust

bda7dd6

Fix DRA mountpoints

dbe556d

Store sim pubkey after generating it

a93f9a0

Fix some failing tests

c2e6577

heerener force-pushed the event-driven-dra branch from 4a50182 to c2e6577 Compare June 23, 2025 09:28

heerener temporarily deployed to aws-sandbox-benchmarks June 23, 2025 09:28 — with GitHub Actions Inactive

aws-parallelcluster-3.13.1 for benchmarks sandbox

18908db

heerener temporarily deployed to aws-sandbox-benchmarks June 23, 2025 09:49 — with GitHub Actions Inactive

No more suffix

55682ba

heerener temporarily deployed to aws-sandbox-benchmarks June 24, 2025 09:28 — with GitHub Actions Inactive

Shorter DRA creation token

d211380

heerener temporarily deployed to aws-sandbox-benchmarks June 24, 2025 12:53 — with GitHub Actions Inactive

AWS doesn't like long strings

50ac1b8

heerener temporarily deployed to aws-sandbox-benchmarks June 25, 2025 08:32 — with GitHub Actions Inactive

jplanasc reviewed Jun 25, 2025

View reviewed changes

mount the whole scratch bucket, that's the whole point of the benchma…

6a4ee86

…rk flag

heerener commented Jun 25, 2025

View reviewed changes

heerener temporarily deployed to aws-sandbox-benchmarks June 25, 2025 14:21 — with GitHub Actions Inactive

Track cluster creation time and use it in the DRA create token

926ce39

heerener temporarily deployed to aws-sandbox-benchmarks June 26, 2025 08:41 — with GitHub Actions Inactive

Decimal -> int

bbcd4a8

heerener temporarily deployed to aws-sandbox-benchmarks June 26, 2025 09:14 — with GitHub Actions Inactive

Fix typo

046138e

heerener temporarily deployed to aws-sandbox-benchmarks June 26, 2025 13:23 — with GitHub Actions Inactive

Bugfix

9fe8f5f

heerener temporarily deployed to aws-sandbox-benchmarks June 26, 2025 13:36 — with GitHub Actions Inactive

Add cluster name tag on fsx and dra

916ccf1

heerener temporarily deployed to aws-sandbox-benchmarks June 27, 2025 09:30 — with GitHub Actions Inactive

Try with EfaEnabled=False for FSx

5737c71

heerener deployed to aws-sandbox-benchmarks June 27, 2025 14:38 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Event driven dra #58

Event driven dra #58

Uh oh!

heerener commented Jun 20, 2025

Uh oh!

jplanasc left a comment

Uh oh!

jplanasc Jun 25, 2025

Uh oh!

jplanasc Jun 25, 2025

Uh oh!

jplanasc Jun 25, 2025

Uh oh!

jplanasc Jun 25, 2025

Uh oh!

heerener Jun 25, 2025

Uh oh!

heerener Jun 25, 2025

Uh oh!

heerener Jun 25, 2025

Uh oh!

heerener Jun 25, 2025

Uh oh!

heerener Jun 25, 2025

Uh oh!

heerener Jun 25, 2025

Uh oh!

heerener Jun 25, 2025

Uh oh!

heerener Jun 25, 2025

Uh oh!

Uh oh!

		)


		def claim_cluster(dynamodb_resource, cluster: Cluster) -> None:

Event driven dra #58

Are you sure you want to change the base?

Event driven dra #58

Uh oh!

Conversation

heerener commented Jun 20, 2025

Uh oh!

jplanasc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!