Selectable C, D, E's #83

davidkydd · 2021-07-13T02:48:55Z

Replace COLLECTOR_LIST config with ENABLED_COLLECTORS, ENABLED_DIAGNOSERS, ENABLED_EXPORTERS.

Basically this works by creating a type-registry, by instantiating a map of type-names -> instance of the type, and then filters the map against a []string, so that only types with names which are present in the []string are selected for execution.

There are some alternative (and more comprehensive) ideas for a type registry here: https://stackoverflow.com/questions/23030884/is-there-a-way-to-create-an-instance-of-a-struct-from-a-string

I think "dolmen"s answer looks good, as it would let us get rid of the "GetName()" method and just use the name of the struct directly, whilst also allowing us to only instantiate the types we end up executing. But I feel that could probably be done as a later refactor.

…ERS, ENABLED_EXPORTERS

arnaud-tincelin · 2021-07-16T21:46:52Z

cmd/aks-periscope/aks-periscope.go

+	diagnosers := []interfaces.Diagnoser{}
+
+	//read list of diagnosers that are enabled
+	enabledDiagnoserString, found := os.LookupEnv("ENABLED_DIAGNOSERS")


Are ENABLED_XXX new variables? If yes, is it possible to remove them completely?
We already discussed this point and I think for a first iteration, having
//if not defined, default to all diagnosers enabled
enabledDiagnoserString = "networkconfig networkoutbound"
is a what we want.
The user should only have to choose exporters

What do you think? @Tatsinnit

Yes new variables - yes we discussed this and I argue that it is a useful capability to support across all collectors / diagnosers / exporters. Arc need collectors to be configurable like this also, and it is only a matter of time until diagnosers need to be as well. No harm in making them default to all on (the user doesn't have to choose) but also support overriding like this.

Thanks for discussion guys! Currently, with what I know current behaviour if this tool is, I will plus 1 to not adding new variables without thinking whole causal effect it will have, especially some of the osm, smi work is utilising collector_list to uniquely identify new work.

Currently the collectors/diagnosers are set of default periscope collectors, as the OSM and SMI (Service Mesh team) added we have added this logic https://github.com/Azure/aks-periscope/blob/master/cmd/aks-periscope/aks-periscope.go#L76 to handle those scenarios.

How adding new variable is needing clarity in this design:

Breaking change as to what is the default working behaviour?

Why are we adding these new variable? What is the use case, what requirement needs it as selectable option?

This new ENABLED_COLLECTORS will be supplied via deployment.yaml file and we are getting rid of COLLECTOR_LIST ?

Thanks 🙏

Correct, COLLECTOR_LIST is intended to be deprecated. There are no backwards compatibility issues or breaking changes with removing it as far as I am aware, as long as azCLI and VSCode extension haven't yet changed to use it?

Arc scenario requires being able to only run some collectors, not the full set. Values will be provided via deployment yaml, example here: https://github.com/davidkydd/aks-periscope/blob/0071201dd41cd159e87e5a574bdfbcc271c122b1/deployment/aks-periscope-arc.yaml#L147

💡 Adding this as a thought : So, adding new configurable vars, will become a breaking change for az-cli and vscode as well because those default vars need to be maintained somehow in most cleanest non scattered manner (not sure how). (some proposal was here) I will think more about this but I think at this point see this discussion here: https://github.com/Azure/aks-periscope/pull/75/files#r661261039

Essentially, in current state when we remove the "Collector_List", and adding new vars it will break things, and any new changes on top will be backward breaking change, Thanks.

I don't see this issue with backcompat since if we removed this the log would still be collected as before (with some more, but this can't be an issue)

💡Also, Going back and possibly asking again why are we adding this again, please? ENABLED_COLLECTORS, ENABLED_DIAGNOSERS, ENABLED_EXPORTERS.

Is this ARC specific requirement which is the key scenario here or is this the refactor? I see what Arnaud is asking, collecting default logs, how it is an issue? (Given if this is arc they already have a connectedClsuter flag which they can differentiate their logic with.

Might be worth to add the the associated Yaml files to see it all flowing through is another idea?

Please correct me Enabled Collectors is collector list equivalent? and enabled diagnoses is new diagnosed equivalent of selectable diagnoses and not sure about need for enabled exporter What is it or this is directly related to ZIP_AND_EXPORT? (Wonder why selectable diagnosers)

also cc: @arnaud-tincelin in case you know. Thanks.

@arnaud-tincelin the backcompat issue is to do with integration with az-cli and vscode extensions, and would prevent the issue which @Tatsinnit described here being a breaking change: #83 (comment)

Yes ENABLED_COLLECTORS is a direct replacement for COLLECTOR_LIST - Arc also have a requirement for a LocalMachineExporter and we need a way to selectively enable and disable this functionality,

The ability to determine which components (collectors, diagnosers, exporters) are executed via the deployment yaml is a generally useful feature to support, as it allows the user to control the behaviour of periscope at a coarse but clearly defined level. We have already seen from just a few feature requests that it is necessary to be able to control which collectors and exporters are executed, so we should be proactive and enable this capability for diagnosers at the same time.

Therefore:
We should standardize on naming and either add the three configs:

ENABLED_COLLECTORS
ENABLED_EXPORTERS
ENABLED_DIAGNOSERS

or the equivalent:

COLLECTOR_LIST
EXPORTER_LIST
DIAGNOSER_LIST

The form ENABLED_XYZ is more explicit than XYZ_LIST as to what it actually does, which is why I am suggesting we use the ENABLED_XYZ form - imagine later on if we want to invert this logic for some reason (a very plausible scenario) and add a DISABLED_COLLECTORS config flag, for the requirement that an environment can define what collectors cannot run on it, but so that it will automatically include and run any new collectors that are added by default.

In this scenario COLLECTOR_LIST would become ambiguous as to its function (is it the list of collectors that are enabled, or disabled?)

My position is as follow:

expoters shall be variabilized as they depend on the ecosystem around periscope

diagnosers and collectors don't have the same kind of dependencies and shall not be variabilized. Periscope provides a bunch of logs, it's up to the user to use them or not, to filter them, sort them etc ... later on

I see no issue with backcompat, these are env variables. If the code don't use them it won't break anything. Why the need to have these vars anyway?

We are going around in circles... Arc scenarios need the ability to disable some of the current collectors. Back compat is for the existing CLI implementation which might make use of COLLECTOR_LIST already (Tats can confirm)

arnaud-tincelin · 2021-07-16T21:50:33Z

pkg/exporter/azureblob_exporter.go

@@ -32,6 +32,10 @@ func NewAzureBlobExporter(creationTime, hostname string) *AzureBlobExporter {
 	}
 }

+func (exporter *AzureBlobExporter) GetName() string {


Why do we need GetName for exporters?

so we can use it to select them, currently only one exporter (azureblob) but I have PR in the wings for localmachine as well.

I do hope we are not going to select any structure based on a display name :D

why not? This is perfect for the interim - as I mentioned in initial description, a better idea is to use a type registry and remove the need for GetName to be defined at all, but this is overkill for now.

Using the display name means we are not burdened with the unnecessary complexity of introducing an intermediate concept whose only purpose is to map from one name to another, and which we would have to refactor to remove if and when we move to a type registry anyway.

There is no need to select exporters for now anyway so I believe this is just creating noise. I'd rather come up with an elegant solution the day we need to do this.

well - the day we need to do this is today :) this PR would have included the motivation for doing this (the localMachineExporter) but on request I split the original PR up into multiple smaller PR, and haven't submitted the localMachineExporter one yet as I only want to have to rebase in one direction.

How is it inelegant to be doing this in the same way as collectors and diagnosers, and without having to introduce any new concepts? The pre-existing COLLECTOR_LIST is an inelegant solution by comparison as it introduces another layer of indirection in the middle

Tatsinnit · 2021-07-20T07:55:29Z

cmd/aks-periscope/aks-periscope.go

+func selectCollectors(allCollectorsByName map[string]interfaces.Collector) []interfaces.Collector {
+	collectors := []interfaces.Collector{}
+
+	enabledCollectorString, found := os.LookupEnv("ENABLED_COLLECTORS")


💡 What will be associated deployment.yaml changes will look like? For this new var ENABLED_COLLECTORS

https://github.com/davidkydd/aks-periscope/blob/0071201dd41cd159e87e5a574bdfbcc271c122b1/deployment/aks-periscope-arc.yaml#L147

Tatsinnit · 2021-07-20T08:01:23Z

cmd/aks-periscope/aks-periscope.go

+	enabledCollectorString, found := os.LookupEnv("ENABLED_COLLECTORS")
+	if !found {
+		//if not defined, default to all collectors enabled
+		enabledCollectorString = "containerlogs dns helm iptables kubeletcmd kubeobjects networkoutbound nodelogs osm smi systemlogs systemperf"


💡 If this is default list of collectors, then this is incorrect because osm, smi both are by conditional checks, which are based one OSM or SMI as value in collectorList.

CollectorList value of OSM currently selects both OSM and SMI collectors to run. Value of SMI selects just SMI collector. Intention is for default behaviour being to run ALL collectors

for current value of collectorlist = OSM see

aks-periscope/cmd/aks-periscope/aks-periscope.go

Line 76 in a1bdc52

if contains(collectorList, "OSM") {

Well, see the two lines under that one 🙂

We should try remove the need to update this string as a requirement of adding a new collector. Should dovetail well with the change to use a proper type registry mentioned in initial PR description.

Tatsinnit · 2021-07-20T08:10:26Z

cmd/aks-periscope/aks-periscope.go

+	exporters := []interfaces.Exporter{}
+
+	//read list of exporters that are enabled
+	enabledExportersString, found := os.LookupEnv("ENABLED_EXPORTERS")


💡What will be associated deployment.yaml changes will look like? For this new var.

https://github.com/davidkydd/aks-periscope/blob/0071201dd41cd159e87e5a574bdfbcc271c122b1/deployment/aks-periscope-arc.yaml#L163

Tatsinnit · 2021-07-20T08:10:46Z

cmd/aks-periscope/aks-periscope.go

+	runDiagnosers(diagnosers, exporters, diagnoserGrp)
+	diagnoserGrp.Wait()
+
+	zipAndExportString, found := os.LookupEnv("ZIP_AND_EXPORT")


💡What will be associated deployment.yaml changes will look like? For this new var ZIP_AND_EXPORT

https://github.com/davidkydd/aks-periscope/blob/0071201dd41cd159e87e5a574bdfbcc271c122b1/deployment/aks-periscope-arc.yaml#L164 - minor nitpick but if you want to try find a way that the bool value doesn't have to be wrapped in double quotes that would be preferable

I'd suggest not to modify this part as we should remove this zip function asap. What do you think?

💡 +1 to what Arnaud is trying to convey, I think ZIP_AND_EXPORT is a new design somehow (or Use case probably), like mentioned above when true what does it achieve?

i.e. in current already re-factored in-memory files get uploaded as logs, and zipped in the end.

This new flag adds new behaviour. Why is this flag required at all ZIP_AND_EXPORT?

The benefit of removing the zip function is that there is less processing done on the node running periscope.

The benefit of using* the zip function is that there are less bits needed to be sent over the network.

I agree that the current usage of the zip functionality is not optimized. However as there are benefits in both zipping and not-zipping, instead of removing it we should make it configurable.

Zipping has less benefit when exporting to AzureBlob, because the traffic will usually all be within the MSFT network and hence size is less critical (though still important).

Arc have asked for a LocalMachineExporter, and the CLI is then used to download periscope outputs directly from the node. This should be zipped first as size is likely more important consideration.

The current function of ZIP_AND_EXPORT is a minimal change which supports the Arc LocalMachineExporter scenario, and allows other clients to still disable the ZIP functionality as an optimization if they wish to just rely on the current behaviour of the AzureBlobExporter.

Later - we should refactor the AzureBlobExporter so that it also respects the ZIP_AND_EXPORT config (or some equivalent config), which will allow the user to configure the behaviour, and decide which of 1. or 2. is more important for their scenario. In memory files should either be zipped and exported to AzureBlob, or just exported as files - doing both is redundant. The ZIP_AND_EXPORT setting (or whatever we decide to call it) should be configurable for each Exporter, so that users can if they wish, have the AzureBlobExporter exporting un-zipped files, and the LocalMachineExporter exporting a zipped file. Zipping should only be performed once and the file shared by exporters for efficiency.

💡Also, Why do we need LocalMachineExporter ? Please help in understanding: what benchmarking do we have for the amount of processing cycle you mentioned in the comment above please, it uses right now and improvement this will do? re: ZIP function because we don't want to increase the amount of processing cycles which periscope consumes on the customer cluster,

There are many open ended questions again, like what all goals this PR eventually have.

We handle the in-memory logs and what current mechanism is to make less noise and in the end zip and export, this LocalMachineExporter what is the use case for this? Like in what scenario this will happen. Currently this tool don't support it at-all because the pods in each nodes upload these logs.

why restrict zip functionality to localMachineExporter? It is a general feature and other exporters in future might wish to make use of it - even the AzureBlobExporter.

Yes we are sending more data than necessary currently - with this config flag other tools / consumers of periscope can choose to disable the zip feature if they wish.

Also note this PR isn't aimed at and doesn't change the current behaviour of the zip (the flag value defaults to true)

Arc need LocalMachineExporter for the case where a cluster does not have access to egress to Azure storage due to firewall or other egress restrictions.

We don't need to benchmark - it is a general statement that zip requires CPU and reduces file size

💡 Cool, so this PR #57 is somehow connected to this work, i.e. this is ground up for for achieving localMachineExporter? Because if it is for localMachineExporter we should give some thought to the actual use of local node file save vs where_to_in_local_save?

suggestion: I would recommend to open a WorkItem clearly summarise or detail the purpose of these refactor, probably at this repo level for this and possibly what the end goal for refactor, or this is all-in-one refactor PR?

Regarding data and other discussion, I would recommend thinking of client-go adaptation. I might land something as first adaptation point but any other PR should first think of how we can use client-go for various other PR(s) and is it feasible, (or need discussion etc)

Thanks 🙏

Not PR #57 exactly but yes - I have an equivalent LocalMachineExporter which is rebased on the latest changes that have since gone into master.

This PR does not change the location of the node file save, if we want to debate or change this we can do this later - outside of this PR.

This is a refactor to enable the current set of requirements which we have discussed many times - I don't see the value in opening new workitems for this work that is already completed?

client-go work is orthogonal to this PR, as it does not involve kubectl commands

replace collectorList config with ENABLED_COLLECTORS, ENABLED_DIAGNOS…

45a113b

…ERS, ENABLED_EXPORTERS

davidkydd requested review from Tatsinnit and arnaud-tincelin July 13, 2021 02:48

derped on the params

f785e59

arnaud-tincelin suggested changes Jul 16, 2021

View reviewed changes

davidkydd added 2 commits July 19, 2021 10:10

Merge branch 'master' into selectableCDE

7c35102

Update aks-periscope.go

4e7de74

Tatsinnit assigned davidkydd Jul 20, 2021

Tatsinnit reviewed Jul 20, 2021

View reviewed changes

Merge branch 'upstream-master' into selectableCDE

7606ecd

davidkydd closed this Aug 4, 2021

Selectable C, D, E's #83

Selectable C, D, E's #83

Conversation

davidkydd commented Jul 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tatsinnit Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tatsinnit Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tatsinnit Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tatsinnit Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tatsinnit Jul 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidkydd Aug 2, 2021 • edited Loading

Choose a reason for hiding this comment

Tatsinnit Jul 20, 2021 •

edited

Loading

Tatsinnit Jul 28, 2021 •

edited

Loading

Tatsinnit Jul 20, 2021 •

edited

Loading

Tatsinnit Jul 20, 2021 •

edited

Loading

Tatsinnit Jul 30, 2021 •

edited

Loading

davidkydd Aug 2, 2021 •

edited

Loading