Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectable C, D, E's #83

Closed
wants to merge 5 commits into from
Closed

Conversation

davidkydd
Copy link
Collaborator

Replace COLLECTOR_LIST config with ENABLED_COLLECTORS, ENABLED_DIAGNOSERS, ENABLED_EXPORTERS.

Basically this works by creating a type-registry, by instantiating a map of type-names -> instance of the type, and then filters the map against a []string, so that only types with names which are present in the []string are selected for execution.

There are some alternative (and more comprehensive) ideas for a type registry here: https://stackoverflow.com/questions/23030884/is-there-a-way-to-create-an-instance-of-a-struct-from-a-string

I think "dolmen"s answer looks good, as it would let us get rid of the "GetName()" method and just use the name of the struct directly, whilst also allowing us to only instantiate the types we end up executing. But I feel that could probably be done as a later refactor.

diagnosers := []interfaces.Diagnoser{}

//read list of diagnosers that are enabled
enabledDiagnoserString, found := os.LookupEnv("ENABLED_DIAGNOSERS")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are ENABLED_XXX new variables? If yes, is it possible to remove them completely?
We already discussed this point and I think for a first iteration, having
//if not defined, default to all diagnosers enabled
enabledDiagnoserString = "networkconfig networkoutbound"
is a what we want.
The user should only have to choose exporters

What do you think? @Tatsinnit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes new variables - yes we discussed this and I argue that it is a useful capability to support across all collectors / diagnosers / exporters. Arc need collectors to be configurable like this also, and it is only a matter of time until diagnosers need to be as well. No harm in making them default to all on (the user doesn't have to choose) but also support overriding like this.

Copy link
Member

@Tatsinnit Tatsinnit Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for discussion guys! Currently, with what I know current behaviour if this tool is, I will plus 1 to not adding new variables without thinking whole causal effect it will have, especially some of the osm, smi work is utilising collector_list to uniquely identify new work.

How adding new variable is needing clarity in this design:

  • Breaking change as to what is the default working behaviour?
  • Why are we adding these new variable? What is the use case, what requirement needs it as selectable option?
  • This new ENABLED_COLLECTORS will be supplied via deployment.yaml file and we are getting rid of COLLECTOR_LIST ?

Thanks 🙏

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, COLLECTOR_LIST is intended to be deprecated. There are no backwards compatibility issues or breaking changes with removing it as far as I am aware, as long as azCLI and VSCode extension haven't yet changed to use it?

Arc scenario requires being able to only run some collectors, not the full set. Values will be provided via deployment yaml, example here: https://github.com/davidkydd/aks-periscope/blob/0071201dd41cd159e87e5a574bdfbcc271c122b1/deployment/aks-periscope-arc.yaml#L147

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Adding this as a thought : So, adding new configurable vars, will become a breaking change for az-cli and vscode as well because those default vars need to be maintained somehow in most cleanest non scattered manner (not sure how). (some proposal was here) I will think more about this but I think at this point see this discussion here: https://github.com/Azure/aks-periscope/pull/75/files#r661261039

  • Essentially, in current state when we remove the "Collector_List", and adding new vars it will break things, and any new changes on top will be backward breaking change, Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this issue with backcompat since if we removed this the log would still be collected as before (with some more, but this can't be an issue)

Copy link
Member

@Tatsinnit Tatsinnit Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡Also, Going back and possibly asking again why are we adding this again, please? ENABLED_COLLECTORS, ENABLED_DIAGNOSERS, ENABLED_EXPORTERS.

  • Is this ARC specific requirement which is the key scenario here or is this the refactor? I see what Arnaud is asking, collecting default logs, how it is an issue? (Given if this is arc they already have a connectedClsuter flag which they can differentiate their logic with.
  • Might be worth to add the the associated Yaml files to see it all flowing through is another idea?

Please correct me Enabled Collectors is collector list equivalent? and enabled diagnoses is new diagnosed equivalent of selectable diagnoses and not sure about need for enabled exporter What is it or this is directly related to ZIP_AND_EXPORT? (Wonder why selectable diagnosers)

also cc: @arnaud-tincelin in case you know. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaud-tincelin the backcompat issue is to do with integration with az-cli and vscode extensions, and would prevent the issue which @Tatsinnit described here being a breaking change: #83 (comment)

Yes ENABLED_COLLECTORS is a direct replacement for COLLECTOR_LIST - Arc also have a requirement for a LocalMachineExporter and we need a way to selectively enable and disable this functionality,

The ability to determine which components (collectors, diagnosers, exporters) are executed via the deployment yaml is a generally useful feature to support, as it allows the user to control the behaviour of periscope at a coarse but clearly defined level. We have already seen from just a few feature requests that it is necessary to be able to control which collectors and exporters are executed, so we should be proactive and enable this capability for diagnosers at the same time.

Therefore:
We should standardize on naming and either add the three configs:

ENABLED_COLLECTORS
ENABLED_EXPORTERS
ENABLED_DIAGNOSERS

or the equivalent:

COLLECTOR_LIST
EXPORTER_LIST
DIAGNOSER_LIST

The form ENABLED_XYZ is more explicit than XYZ_LIST as to what it actually does, which is why I am suggesting we use the ENABLED_XYZ form - imagine later on if we want to invert this logic for some reason (a very plausible scenario) and add a DISABLED_COLLECTORS config flag, for the requirement that an environment can define what collectors cannot run on it, but so that it will automatically include and run any new collectors that are added by default.

In this scenario COLLECTOR_LIST would become ambiguous as to its function (is it the list of collectors that are enabled, or disabled?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My position is as follow:

  • expoters shall be variabilized as they depend on the ecosystem around periscope
  • diagnosers and collectors don't have the same kind of dependencies and shall not be variabilized. Periscope provides a bunch of logs, it's up to the user to use them or not, to filter them, sort them etc ... later on
  • I see no issue with backcompat, these are env variables. If the code don't use them it won't break anything. Why the need to have these vars anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going around in circles... Arc scenarios need the ability to disable some of the current collectors. Back compat is for the existing CLI implementation which might make use of COLLECTOR_LIST already (Tats can confirm)

@@ -32,6 +32,10 @@ func NewAzureBlobExporter(creationTime, hostname string) *AzureBlobExporter {
}
}

func (exporter *AzureBlobExporter) GetName() string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need GetName for exporters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can use it to select them, currently only one exporter (azureblob) but I have PR in the wings for localmachine as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do hope we are not going to select any structure based on a display name :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not? This is perfect for the interim - as I mentioned in initial description, a better idea is to use a type registry and remove the need for GetName to be defined at all, but this is overkill for now.

Using the display name means we are not burdened with the unnecessary complexity of introducing an intermediate concept whose only purpose is to map from one name to another, and which we would have to refactor to remove if and when we move to a type registry anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to select exporters for now anyway so I believe this is just creating noise. I'd rather come up with an elegant solution the day we need to do this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well - the day we need to do this is today :) this PR would have included the motivation for doing this (the localMachineExporter) but on request I split the original PR up into multiple smaller PR, and haven't submitted the localMachineExporter one yet as I only want to have to rebase in one direction.

How is it inelegant to be doing this in the same way as collectors and diagnosers, and without having to introduce any new concepts? The pre-existing COLLECTOR_LIST is an inelegant solution by comparison as it introduces another layer of indirection in the middle

func selectCollectors(allCollectorsByName map[string]interfaces.Collector) []interfaces.Collector {
collectors := []interfaces.Collector{}

enabledCollectorString, found := os.LookupEnv("ENABLED_COLLECTORS")
Copy link
Member

@Tatsinnit Tatsinnit Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 What will be associated deployment.yaml changes will look like? For this new var ENABLED_COLLECTORS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabledCollectorString, found := os.LookupEnv("ENABLED_COLLECTORS")
if !found {
//if not defined, default to all collectors enabled
enabledCollectorString = "containerlogs dns helm iptables kubeletcmd kubeobjects networkoutbound nodelogs osm smi systemlogs systemperf"
Copy link
Member

@Tatsinnit Tatsinnit Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 If this is default list of collectors, then this is incorrect because osm, smi both are by conditional checks, which are based one OSM or SMI as value in collectorList.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectorList value of OSM currently selects both OSM and SMI collectors to run. Value of SMI selects just SMI collector. Intention is for default behaviour being to run ALL collectors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for current value of collectorlist = OSM see

if contains(collectorList, "OSM") {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, see the two lines under that one 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try remove the need to update this string as a requirement of adding a new collector. Should dovetail well with the change to use a proper type registry mentioned in initial PR description.

exporters := []interfaces.Exporter{}

//read list of exporters that are enabled
enabledExportersString, found := os.LookupEnv("ENABLED_EXPORTERS")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡What will be associated deployment.yaml changes will look like? For this new var.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runDiagnosers(diagnosers, exporters, diagnoserGrp)
diagnoserGrp.Wait()

zipAndExportString, found := os.LookupEnv("ZIP_AND_EXPORT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡What will be associated deployment.yaml changes will look like? For this new var ZIP_AND_EXPORT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/davidkydd/aks-periscope/blob/0071201dd41cd159e87e5a574bdfbcc271c122b1/deployment/aks-periscope-arc.yaml#L164 - minor nitpick but if you want to try find a way that the bool value doesn't have to be wrapped in double quotes that would be preferable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest not to modify this part as we should remove this zip function asap. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 +1 to what Arnaud is trying to convey, I think ZIP_AND_EXPORT is a new design somehow (or Use case probably), like mentioned above when true what does it achieve?

  • i.e. in current already re-factored in-memory files get uploaded as logs, and zipped in the end.

This new flag adds new behaviour. Why is this flag required at all ZIP_AND_EXPORT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The benefit of removing the zip function is that there is less processing done on the node running periscope.
  2. The benefit of using* the zip function is that there are less bits needed to be sent over the network.

I agree that the current usage of the zip functionality is not optimized. However as there are benefits in both zipping and not-zipping, instead of removing it we should make it configurable.

Zipping has less benefit when exporting to AzureBlob, because the traffic will usually all be within the MSFT network and hence size is less critical (though still important).

Arc have asked for a LocalMachineExporter, and the CLI is then used to download periscope outputs directly from the node. This should be zipped first as size is likely more important consideration.

The current function of ZIP_AND_EXPORT is a minimal change which supports the Arc LocalMachineExporter scenario, and allows other clients to still disable the ZIP functionality as an optimization if they wish to just rely on the current behaviour of the AzureBlobExporter.

Later - we should refactor the AzureBlobExporter so that it also respects the ZIP_AND_EXPORT config (or some equivalent config), which will allow the user to configure the behaviour, and decide which of 1. or 2. is more important for their scenario. In memory files should either be zipped and exported to AzureBlob, or just exported as files - doing both is redundant. The ZIP_AND_EXPORT setting (or whatever we decide to call it) should be configurable for each Exporter, so that users can if they wish, have the AzureBlobExporter exporting un-zipped files, and the LocalMachineExporter exporting a zipped file. Zipping should only be performed once and the file shared by exporters for efficiency.

Copy link
Member

@Tatsinnit Tatsinnit Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡Also, Why do we need LocalMachineExporter ? Please help in understanding: what benchmarking do we have for the amount of processing cycle you mentioned in the comment above please, it uses right now and improvement this will do? re: ZIP function because we don't want to increase the amount of processing cycles which periscope consumes on the customer cluster,

There are many open ended questions again, like what all goals this PR eventually have.

We handle the in-memory logs and what current mechanism is to make less noise and in the end zip and export, this LocalMachineExporter what is the use case for this? Like in what scenario this will happen. Currently this tool don't support it at-all because the pods in each nodes upload these logs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why restrict zip functionality to localMachineExporter? It is a general feature and other exporters in future might wish to make use of it - even the AzureBlobExporter.

Yes we are sending more data than necessary currently - with this config flag other tools / consumers of periscope can choose to disable the zip feature if they wish.

Also note this PR isn't aimed at and doesn't change the current behaviour of the zip (the flag value defaults to true)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc need LocalMachineExporter for the case where a cluster does not have access to egress to Azure storage due to firewall or other egress restrictions.

We don't need to benchmark - it is a general statement that zip requires CPU and reduces file size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Cool, so this PR #57 is somehow connected to this work, i.e. this is ground up for for achieving localMachineExporter? Because if it is for localMachineExporter we should give some thought to the actual use of local node file save vs where_to_in_local_save?

  • suggestion: I would recommend to open a WorkItem clearly summarise or detail the purpose of these refactor, probably at this repo level for this and possibly what the end goal for refactor, or this is all-in-one refactor PR?

  • Regarding data and other discussion, I would recommend thinking of client-go adaptation. I might land something as first adaptation point but any other PR should first think of how we can use client-go for various other PR(s) and is it feasible, (or need discussion etc)

Thanks 🙏

Copy link
Collaborator Author

@davidkydd davidkydd Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not PR #57 exactly but yes - I have an equivalent LocalMachineExporter which is rebased on the latest changes that have since gone into master.

This PR does not change the location of the node file save, if we want to debate or change this we can do this later - outside of this PR.

This is a refactor to enable the current set of requirements which we have discussed many times - I don't see the value in opening new workitems for this work that is already completed?

client-go work is orthogonal to this PR, as it does not involve kubectl commands

@davidkydd davidkydd closed this Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants