-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Organization List Feature in Server AWS Node Attester Plugin "aws_iid" #4838
Organization List Feature in Server AWS Node Attester Plugin "aws_iid" #4838
Conversation
Details about the feature are being discussed on : #4770 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this contribution @rushi47 ❤️
I took a pass on it and I think it will need one or two more. I started with some high level comments on config shape, docs, etc. I'll make another pass over the logic shortly
assume_role = "spire_node_attestor" | ||
account_ids_belong_to_org_validation { | ||
org_account_id = "7891011" | ||
org_account_role = "spire-server-org-role" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, I suggest:
org_account_role = "spire-server-org-role" | |
assume_org_role = "spire-server-org-role" |
Can it default to the value of assume_role
? Is it something you need right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This role needs to be created in management account, so using the value of default assume role, might conflict. So I was hoping to keep it configurable.
partition = "aws" | ||
assume_role = "spire_node_attestor" | ||
account_ids_belong_to_org_validation { | ||
org_account_id = "7891011" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must this always be the management account? Should we name it management_account_id
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must this always be the management account? Should we name it management_account_id or similar?
From the docs it looks like yes, it always need to management account id as it can list all the organization under. Yeah i guess its better to name it management account id
account_ids_belong_to_org_validation { | ||
org_account_id = "7891011" | ||
org_account_role = "spire-server-org-role" | ||
org_account_region = "us-west-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it required? There doesn't seem to be any regionality in any of the AWS Organization docs. Can we hardcode a region?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh no sorry i slipped outside this was required for calling internal function, refactored.
org_account_id = "7891011" | ||
org_account_role = "spire-server-org-role" | ||
org_account_region = "us-west-2" | ||
org_account_map_ttl = "3m" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see this hardcoded on the first pass .. if/when we know more about what values to use in what circumstances, we can make better decisions on the behavior or configurability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way currently its written it already defaults to 3m
if its not explicitly passed. So it's optional param.
I was thinking we can keep the way it is rn as there is support for default and its already written but I will let you make the final call. Update the doc to reflect the defaults.
} | ||
|
||
``` | ||
Using the block `account_ids_belong_to_org_validation` the org validation node attestation method will be enabled. With above configuration spire server will try to assume this role : `arn:aws:iam::7891011:role/spire-server-org-role` and get the list of all accounts in organization. When node attestation request is sent to server, nodes account id will be check against this list. The role, should have permission to make `ListAccounts` request. More on list account request, can be read from aws [docs](https://docs.aws.amazon.com/organizations/latest/APIReference/API_ListAccounts.html). When not used, block ex. `account_ids_belong_to_org_validation = {}` should not be empty, it should be completely removed as its optional or should have all required parameters namely `org_account_id`, `org_account_role`, `org_account_region`. `org_account_map_ttl` is an optional param to configure TTL in hours, least can be `1m` (1 minute). If this param is not defined, default TTL for the account list cache is 3m (3 minutes), format to specify TTL is `XhYm`, `Xm` etc. we internally use [time.parseDuration](https://pkg.go.dev/time#ParseDuration) lib. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that we should not capture details about the implementation here (e.g. org list caching, refresh semantics, etc). They are likely to change in the future and we'll forget to update the docs. Folks interested in that level of detail can go look at the code. Instead, we should document any gotchas or requirements on the user. For example, what IAM permissions are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. removed this and added example of permission policy
// Get the timestamp from config | ||
existingTimestamp := o.orgListAccountMapCreationTime | ||
|
||
currTimeStamp := time.Now().UTC() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have mock clock hooks for testing, and tests to go with it 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, will come back to this at testing.
Thank you @evan2645 for review ❤️. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @rushi47 for this contribution. As requested, I did another review pass on the logic bits. It was a little hard for me to review due to 1) unsure on how prior comments will change the implementation, and 2) code structure and naming. Most of my latest comments focus on the second point .. there are several other spots where I wanted to leave comments but refrained because I think it is likely to change in your next commit. Happy to jump on a quick zoom if helpful just let me know!
go.mod
Outdated
cloud.google.com/go/security v1.15.2 | ||
cloud.google.com/go/storage v1.35.1 | ||
github.com/Azure/azure-sdk-for-go/sdk/azcore v1.9.0 | ||
cloud.google.com/go/iam v1.1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of these bumps required? Looks like a lot more than what would be needed for the changes in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. Will come back to this at the end.
return exist | ||
} | ||
|
||
func validateOrganizationConfig(config *IIDAttestorConfig) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this config validation live in iid.go where the config is actually parsed? Either that, or we should call this method from inside our constructor/configure method here. Feels weird to parse in iid.go and then call a private function in this file to validate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this config validation live in iid.go where the config is actually parsed? Either that, or we should call this method from inside our constructor/configure method here. Feels weird to parse in iid.go and then call a private function in this file to validate
Yeah make sense moved it to iid.go
file
func (o *organizationValidation) getRetries() int { | ||
o.mutex.RLock() | ||
defer o.mutex.RUnlock() | ||
return o.retries | ||
} | ||
|
||
func (o *organizationValidation) decrRetries() int { | ||
// Only decr if its greater than 0 | ||
if o.retries > 0 { | ||
o.retries -= 1 | ||
} | ||
|
||
return o.retries | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here should be factored such that locking is consistent. IMO, the closer to the relevant call site we can take the lock, the better (i.e. getRetries over decrRetries). As is, it's really hard to read and tell if a lock is missing or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done refactored.
AccountListTTL string `hcl:"org_account_map_ttl"` | ||
} | ||
|
||
type organizationValidation struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this struct and its related config would benefit from a different name, e.g. "orgValidator"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👌
return o.retries | ||
} | ||
|
||
func (o *organizationValidation) configure(config *orgValidationConfig) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function feels like it should be newOrgValidator
instead. No need to have the error free constructor to hook into New
, we can call this from the Configure
method instead and handle the errors as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil | ||
} | ||
|
||
func (o *organizationValidation) setLogger(log hclog.Logger) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed? Can we include logger on config struct instead? Seems like a lock is missing, but moot point if we remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hooking into existing implementation. And that implementation is missing lock, the way I was thinking is if we add the lock in the existing implementation that would protect all the members, that means in turn protecting orgvalidation logger.
Why we needed, this was my thought - I guess spire when invokes the plugin it makes call for setting up logger so i thought we needed this to change the log level or even propagate logger.
But lmk if my understanding isn't correct.
// If it part of the organisation then validation should be succesfull if not attestation should fail, on enabling this verification method. | ||
// This could be alternative for not explictly maintaing allowed list of account ids. | ||
// Method pulls the list of accounts from organization and caches it for certain time, cache time can be configured. | ||
func (o *organizationValidation) ValidateAccountBelongstoOrg(ctx context.Context, orgClient organizations.ListAccountsAPIClient, accoundIDofNode string) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps IsMemberAccount
or IsChildAccount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to IsMemberAccount
orgAccountList, isStale, err := o.checkIfOrgAccountListIsStale(ctx) | ||
if err != nil { | ||
return false, err | ||
} | ||
// refresh the account map | ||
if isStale { | ||
orgAccountList, err = o.fetchAccountsListFromOrg(ctx, orgClient, false) | ||
if err != nil { | ||
return false, err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this would benefit from some refactoring - for example, we probably only need one call site here that simply returns the latest up-to-date list, whether it be from cache or recently reloaded ... e.g. o.memberAccounts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this lazy loading of the account list an attempt to optimize the total number of steady state calls to these API endpoints? At first I thought it was a simplicity decision, but the more I read the more I wonder if a goroutine with ticker-based refresh would be even simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this would benefit from some refactoring - for example, we probably only need one call site here that simply returns the latest up-to-date list, whether it be from cache or recently reloaded ... e.g. o.memberAccounts
Refactored the code and added function which act as middle ware, and handle this part.
Is this lazy loading of the account list an attempt to optimize the total number of steady state calls to these API endpoints? At first I thought it was a simplicity decision, but the more I read the more I wonder if a goroutine with ticker-based refresh would be even simpler.
Well this will not be lazy loading as its main flow of request, and cache is updated before response is returned. With ticker based refresh I am not very confident if it will solve the problem, as it will refresh the cache in particular time but it arises the same issue of having cache with particular ttl.
Maybe we can jump on call to discuss this.
// If this return false we will check `X` number of times in that window, to avoid | ||
// genuine accounts which are newly added | ||
exist := checkIfAccountIdExist(orgAccountList, accoundIDofNode) | ||
if !exist { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should probably be dedicated functions for these kinds of things, like o.reloadAccountList
and o.cacheMiss()
, where the latter can conditionally call the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, added new function to handle this.
//Build new org accounts list | ||
orgAccountsMap := make(map[string]bool) | ||
|
||
// Update the org account list cache with ACTIVE accounts & handle pagination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the other possible states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be suspended or pending closure. More can be found here. Tried to capture this in test cases as well.
|
||
// Make sure : map is not updated from different go routine. | ||
// * if we make this call before ttl expires, we are doing retry/catchburst that means we need to bypass this validation | ||
if !catchBurst && len(o.orgListAccountMap) != 0 && time.Now().UTC().Sub(o.orgListAccountMapCreationTime) <= time.Duration(o.orgAccountListCacheTTL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.Now().UTC().Sub(o.orgListAccountMapCreationTime) <= time.Duration(o.orgAccountListCacheTTL is used in multiple places, may we move to a function?
and can you add some unit tests for this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, moved to function.
and can you add some unit tests for this code?
Created new file for unit testing organization all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done moved to function. And added tests for it. As this is private function, let me know if its not acceptable in the way I wrote tests for this as well as other functions in this file.
// Get the list of accounts | ||
listAccountsOp, err := orgClient.ListAccounts(ctx, &organizations.ListAccountsInput{}) | ||
if err != nil { | ||
return nil, fmt.Errorf("issue while getting list of accounts: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("issue while getting list of accounts: %v", err) | |
return nil, fmt.Errorf("issue while getting list of accounts: %w", err) |
NextToken: listAccountsOp.NextToken, | ||
}) | ||
if err != nil { | ||
return nil, fmt.Errorf("issue while getting list of accounts in pagination: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("issue while getting list of accounts in pagination: %v", err) | |
return nil, fmt.Errorf("issue while getting list of accounts in pagination: %w", err) |
Hello @MarcosDY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @MarcosDY Thank you so much for your review again ❤️ . I have replied to comments and made changes with best of my knowledge, I might need your help to confirm few things.
@@ -272,18 +296,35 @@ func (p *IIDAttestorPlugin) Configure(_ context.Context, req *configv1.Configure | |||
return nil, status.Errorf(codes.InvalidArgument, "invalid partition %q, must be one of: %v", config.Partition, partitions) | |||
} | |||
|
|||
// Check if Feature flag for account belongs to organization is enabled. | |||
orgConfig := &orgValidationConfig{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcosDY so i added new variable so that I dont touch existing code too much.
In this case i didnt want to spoil client config
function. As struct is copied in configure function, we will need to perform empty check against the struct and it will look something like below, please lmk which one you prefer or if there is better way to write this validation. And I will change accordingly.
func (cc *clientsCache) configure(config SessionConfig, orgConfig orgValidationConfig) {
cc.mtx.Lock()
cc.clients = make(map[string]*cacheEntry)
cc.config = &config
orgConf := &orgValidationConfig{}
if orgConf != &orgConfig {
orgConf = &orgConfig
}
cc.orgConfig = orgConf
cc.mtx.Unlock()
}
if len(checkTTL) > 0 { | ||
t, err := time.ParseDuration(checkTTL) | ||
if err != nil { | ||
return status.Errorf(codes.InvalidArgument, "make sure %v if configured, should be in hours and is suffix with required `m` for time duration in minute ex. 5m. or remove the : %v, in the block : %v. Default TTL will be : %v, for feature node attestation using account id verification", orgAccountListTTL, orgAccountListTTL, "verify_organization", orgAccountDefaultListTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds way nicer 😅
Refactored with some tweak
"Please ensure that %q if configured, it should be in duration and is suffixed with required 'm' for time duration in minute ex. '5m'. Otherwise, remove the: %q, in the block. Default TTL will be: %q"
|
||
// refresh the account map | ||
if isStale { | ||
_, err = o.reloadAccountList(ctx, orgClient, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this was confusing. Well i refactored the logic, I think there was no point in using err
in method : checkIfOrgAccountListIsStale
as it is simple comparison and error was always nil.
So now this is how it happens
- First it will check if it is stale
- If stale : it will reload the account list and
return true
- else simple false.
Does this look ok ?
|
||
// If cache miss happens, refresh list with new cache and check if element exist | ||
func (o *orgValidator) refreshCache(ctx context.Context, orgClient organizations.ListAccountsAPIClient) (map[string]bool, error) { | ||
remTries := o.getRetries() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry is resetted here at the end of function when new accounts are fetched.
|
||
orgAccountList := make(map[string]bool) | ||
if remTries <= 0 { | ||
return orgAccountList, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it required to return an empty list when there are no retries?
Before retrying we check if account id exist in our cache/map or not. If it doesnt exist we go for retries, now if the retries are exhausted either we can return the cache/map we have already checked before or we can return the empty map. So I went with later one.
can no we get into a situation where we start to return empty maps always?
For this condition to happen we need two scenarios
one: account id is not part of existing cache.
second: our retries are exhausted.
I am thinking we can get into the scenario when there is malicious attestation request trying to validate. That will exhaust the retries and it will return the empty map. But in that case as well we are not overwriting the main cache, we are just returning empty map for that particular request.
Not overwriting the main cache is kind of guard rail, which will reduce blast radius to per request even if it returns the empty map.
Am unable to think of scenario which would bypass this condition, but please lmk if you have anything specific case in mind.
|
||
// Make sure : map is not updated from different go routine. | ||
// * if we make this call before ttl expires, we are doing retry/catchburst that means we need to bypass this validation | ||
if !catchBurst && len(o.orgListAccountMap) != 0 && time.Now().UTC().Sub(o.orgListAccountMapCreationTime) <= time.Duration(o.orgAccountListCacheTTL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, moved to function.
and can you add some unit tests for this code?
Created new file for unit testing organization all together.
|
||
// Make sure : map is not updated from different go routine. | ||
// * if we make this call before ttl expires, we are doing retry/catchburst that means we need to bypass this validation | ||
if !catchBurst && len(o.orgListAccountMap) != 0 && time.Now().UTC().Sub(o.orgListAccountMapCreationTime) <= time.Duration(o.orgAccountListCacheTTL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done moved to function. And added tests for it. As this is private function, let me know if its not acceptable in the way I wrote tests for this as well as other functions in this file.
go.mod
Outdated
@@ -1,6 +1,6 @@ | |||
module github.com/spiffe/spire | |||
|
|||
go 1.22 | |||
go 1.22.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without 0
it fails on mac m1 golang/go#65568
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you replace with 1.22.2?
c4da7c5
to
8989d67
Compare
8989d67
to
f6189a1
Compare
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Co-authored-by: Evan Gilman <evan2645@gmail.com> Signed-off-by: Rushikesh Butley <www.rushi47@gmail.com>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Co-authored-by: Marcos Yacob <marcosyacob@gmail.com> Signed-off-by: Rushikesh Butley <www.rushi47@gmail.com>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Signed-off-by: Rushikesh Butley <rbutley@confluent.io>
Pull Request check list
Affected functionality
This PR adds new feature to AWS Node Attester Plugin. It's optional feature, if its enabled it provides additional check executed at very first step during node attestation. It verifies, if node's
aws account id
is part ofAWS Organization
or not. If it's not part of organization, attestation request will be rejected.More on this feature can be read in the issue
Description of change
Which issue this PR fixes
Fixes : issue