-
Notifications
You must be signed in to change notification settings - Fork 0
Event driven dra #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Event driven dra #58
Conversation
668f323
to
4a50182
Compare
4a50182
to
c2e6577
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll continue after the online review session this afternoon.
from hpc_provisioner.logging_config import LOGGING_CONFIG | ||
|
||
TABLE_NAME = "sbo-parallelcluster-subnets" | ||
SUBNET_TABLE_NAME = "sbo-parallelcluster-subnets" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we replace sbo with obi? :)
1. Check which clusters are pending (provisioning_launched=False) creation and have include_lustre=True | ||
2. For each of them: | ||
check whether any DRAs are still pending | ||
if not: call do_cluster_create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a personal opinion: if I understand correctly, the logic is to create the filesystems first and, once successful, we create the pcluster. I'd swap the order of the operations to promote a "fail fast" approach: if the cluster creation (which in theory is faster/lighter than the filesystem mounting) fails for any reason , then, we fail "fast" and don't have to wait for the slower/heavier FS mounting process.
raise RuntimeError(f"Filesystem {cluster.name} not created when it should have been") | ||
CONFIG_VALUES["fsx"] = { | ||
"Name": "LustreFSX", | ||
"MountDir": "/obi/data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new 'obi' path :)
However, I'd still prefix FSx filesystems with /fsx/obi...
|
||
def get_fs_bucket(bucket_name: str, cluster: Cluster) -> str: | ||
if bucket_name == "projects": | ||
return get_sbonexusdata_bucket() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not for this PR, but I guess we should get rid of the 'sbo' and 'nexus' keywords. This could be named projects, as it is named somewhere else?
@@ -8,6 +8,7 @@ dependencies = [ | |||
"aws-parallelcluster", | |||
"boto3", | |||
"cryptography", | |||
"python_dynamodb_lock", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove again
|
||
def get_fs_bucket(bucket_name: str, cluster: Cluster) -> str: | ||
if bucket_name == "projects": | ||
return get_sbonexusdata_bucket() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another SBO reference that should be OBI
@@ -72,3 +95,73 @@ def free_subnet(dynamodb_client, subnet_id: str) -> None: | |||
dynamodb_client.delete_item( | |||
TableName="sbo-parallelcluster-subnets", Key={"subnet_id": {"S": subnet_id}} | |||
) | |||
|
|||
|
|||
def get_unclaimed_clusters(dynamodb_resource) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unclaimed -> provisioning_started
) | ||
|
||
|
||
def claim_cluster(dynamodb_resource, cluster: Cluster) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
claim: provisioning_launched?
dynamo = dynamodb_resource() | ||
try: | ||
register_cluster(dynamo, cluster) | ||
except ClusterAlreadyRegistered as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably shouldn't be an error
cluster_secrets = get_secrets_for_cluster( | ||
sm_client=sm_client, cluster_name=pcluster["clusterName"] | ||
) | ||
pcluster.update(cluster_secrets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add status field (similar as cluster status) with something like WAITING_FOR_DRAS
@@ -165,6 +255,8 @@ def pcluster_delete_handler(event, _context=None): | |||
|
|||
logger.debug(f"delete pcluster {cluster}") | |||
try: | |||
dynamo = dynamodb_resource() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment: why not in reverse order (delete cluster, then from dynamo)
} | ||
if not fs: | ||
raise RuntimeError(f"Filesystem {cluster.name} not created when it should have been") | ||
CONFIG_VALUES["fsx"] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No DRA included; maybe don't wait for cluster creation?
@jplanasc @jamesgking could you give this a thorough review please?
My main concern is that it makes the code less clean, but I'd like a sanity check on that :-) If you want to take some time to go through it together, that's definitely an option.