Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Jaeger CR #193

Merged
merged 21 commits into from
Feb 26, 2019
Merged

Conversation

jpkrohling
Copy link
Contributor

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

@jpkrohling
Copy link
Contributor Author

This change is Reviewable

@jpkrohling
Copy link
Contributor Author

jpkrohling commented Feb 6, 2019

This PR is an early draft, meant only to get feedback about the direction.

The main changes here are:

  • strategy.S is now a type, instead of an interface, and holds arrays of objects that compose that strategy (like, list of deployments, list of config maps)
  • newAllInOneStrategy and similar methods now return a strategy.S
  • the handleCreate and handleUpdate functions in the controller were removed and we now have an apply method, accepting the strategy. This calls more specialized functions, like applyDeployments, passing what's the desired state for deployments
  • applyDeployments receives the desired state and compares with the current state by querying the cluster. It then builds an Inventory, which is composed of three arrays: Create (things that don't exist yet), Update (things that already exist) and Delete (things that no longer exist).

Still missing:

  • Polishing of the code
  • Fix some broken tests and compilation issues
  • Manual tests (what we have to far is only a draft of how it would work in theory)
  • e2e tests
  • accounts
  • configMaps
  • cronJobs
  • daemonSets
  • dependencies
  • deployments
  • ingresses
  • routes
  • services

Things to decide:

  • The code is more aware of the Kubernetes objects (Deployment vs. runtime.Object), which follows what we have in other parts of the code. I think we can make the code generic in a few places without major problems, but we'll end up having some duplication in some places. In the end, I think this is a fair price to pay, but wanted your opinions on that as well.

Copy link
Contributor

@objectiser objectiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good approach - only comment is that it seems like any deployments no longer required will be deleted at the same time others are being created/updated. Would it be better to defer the deletions until the create/updates are ready?

@jpkrohling
Copy link
Contributor Author

Good point. This validates then that going for concrete types (Deployment) has clear advantages against the generic runtime.Object.

@jpkrohling
Copy link
Contributor Author

I did some manual tests with this PR in its current state and I believe it's already at a state where we can consider merging, even if updating other components are missing (services, config maps, ...)

Here's what I did for testing:

$ kubectl apply -f deploy/examples/simplest.yaml 
jaeger.io.jaegertracing/simplest created


$ kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
cassandra-0                 1/1       Running   0          5h
cassandra-1                 1/1       Running   3          5h
cassandra-2                 1/1       Running   0          5h
elasticsearch-0             1/1       Running   0          5h
simplest-6586f6bb95-mbsvv   1/1       Running   0          17m


$ kubectl apply -f deploy/examples/simplest.yaml  ### changed the Spec to be like the simple-prod
jaeger.io.jaegertracing/simplest configured


$ kubectl get pods
NAME                                  READY     STATUS        RESTARTS   AGE
cassandra-0                           1/1       Running       0          5h
cassandra-1                           1/1       Running       3          5h
cassandra-2                           1/1       Running       0          5h
elasticsearch-0                       1/1       Running       0          5h
simplest-6586f6bb95-mbsvv             0/1       Terminating   0          17m
simplest-collector-784544999c-bqnkp   0/1       Running       0          7s
simplest-query-77499c65d-2cp5c        1/1       Running       0          7s


$ kubectl get pods
NAME                                  READY     STATUS    RESTARTS   AGE
cassandra-0                           1/1       Running   0          5h
cassandra-1                           1/1       Running   3          5h
cassandra-2                           1/1       Running   0          5h
elasticsearch-0                       1/1       Running   0          5h
simplest-collector-784544999c-bqnkp   1/1       Running   0          10s
simplest-query-77499c65d-2cp5c        1/1       Running   0          10s


$ kubectl apply -f deploy/examples/simplest.yaml  #### back to the original simplest.yaml 
jaeger.io.jaegertracing/simplest configured


$ kubectl get pods
NAME                                  READY     STATUS              RESTARTS   AGE
cassandra-0                           1/1       Running             0          5h
cassandra-1                           1/1       Running             3          5h
cassandra-2                           1/1       Running             0          5h
elasticsearch-0                       1/1       Running             0          5h
simplest-6586f6bb95-bvs2r             0/1       ContainerCreating   0          2s
simplest-collector-784544999c-bqnkp   0/1       Terminating         0          53s
simplest-query-77499c65d-2cp5c        0/1       Terminating         0          53s


$ kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
cassandra-0                 1/1       Running   0          5h
cassandra-1                 1/1       Running   3          5h
cassandra-2                 1/1       Running   0          5h
elasticsearch-0             1/1       Running   0          5h
simplest-6586f6bb95-bvs2r   0/1       Running   0          4s

@jpkrohling jpkrohling changed the title WIP - Update Jaeger CR Update Jaeger CR Feb 8, 2019
@codecov
Copy link

codecov bot commented Feb 8, 2019

Codecov Report

Merging #193 into master will decrease coverage by 4.82%.
The diff coverage is 81.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #193      +/-   ##
==========================================
- Coverage   95.18%   90.35%   -4.83%     
==========================================
  Files          35       59      +24     
  Lines        1848     2645     +797     
==========================================
+ Hits         1759     2390     +631     
- Misses         75      164      +89     
- Partials       14       91      +77
Impacted Files Coverage Δ
pkg/storage/elasticsearch.go 67.16% <ø> (+4.34%) ⬆️
pkg/account/oauth-proxy.go 100% <100%> (ø) ⬆️
pkg/storage/dependency.go 100% <100%> (ø) ⬆️
pkg/strategy/streaming.go 100% <100%> (+3.63%) ⬆️
pkg/route/query.go 100% <100%> (ø) ⬆️
pkg/deployment/ingester.go 100% <100%> (ø) ⬆️
pkg/ingress/query.go 100% <100%> (ø) ⬆️
pkg/storage/elasticsearch_role.go 100% <100%> (ø) ⬆️
pkg/service/query.go 100% <100%> (ø) ⬆️
pkg/cronjob/spark_dependencies.go 97.77% <100%> (+0.21%) ⬆️
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e005fc...86cbee3. Read the comment docs.

@objectiser
Copy link
Contributor

@jpkrohling Looks promising, although main concern is that the new deployment is not confirmed as available before the old is removed - so there will be a time period where no traces are collected.

If we could solve this problem, then atleast if the user is using the same backend storage, it should be able to cope with continously collecting spans through the transition.

It also means that if the new deployment was to fail, we can abort and still have the old deployment available and continuing to collect.

One option is to deal with this scenario as a separate PR, which would be fine - as long as it was addressed soon.

@jpkrohling
Copy link
Contributor Author

Looks promising, although main concern is that the new deployment is not confirmed as available before the old is removed - so there will be a time period where no traces are collected.

Right, I think we discussed that but I missed it. I'll implement this next, still in this PR.

@jpkrohling jpkrohling force-pushed the UpdateJaegerInstance branch 2 times, most recently from bf746f5 to 32671d8 Compare February 11, 2019 14:34
@jpkrohling jpkrohling changed the title Update Jaeger CR WIP - Update Jaeger CR Feb 12, 2019
@jpkrohling
Copy link
Contributor Author

Marking as WIP, as there seems to be a problem with the unit test.

@jpkrohling jpkrohling changed the title WIP - Update Jaeger CR Update Jaeger CR Feb 12, 2019
@jpkrohling
Copy link
Contributor Author

@objectiser this is now ready to be reviewed.

pkg/controller/jaeger/jaeger_controller.go Outdated Show resolved Hide resolved
pkg/controller/jaeger/jaeger_controller.go Outdated Show resolved Hide resolved
pkg/controller/jaeger/jaeger_controller.go Outdated Show resolved Hide resolved
pkg/inventory/deployment_test.go Outdated Show resolved Hide resolved
pkg/strategy/all-in-one_test.go Outdated Show resolved Hide resolved
pkg/strategy/controller_test.go Outdated Show resolved Hide resolved
pkg/strategy/production_test.go Outdated Show resolved Hide resolved
pkg/strategy/streaming_test.go Outdated Show resolved Hide resolved
@objectiser
Copy link
Contributor

objectiser commented Feb 12, 2019

@jpkrohling Tested by deploying simplest.yaml with the name field changed to simple-prod. Once running, I then applied simple-prod.yaml - it started the separate query and collector, but didn't terminate the all-in-one deployment.

$ kubectl get pods
NAME                                     READY   STATUS    RESTARTS   AGE
elasticsearch-0                          1/1     Running   0          7m6s
simple-prod-66bcdbb775-zztnm             1/1     Running   0          3m44s
simple-prod-collector-86db8fb7d5-4pxl9   1/1     Running   0          2m2s
simple-prod-query-74f6c86476-wz8dw       1/1     Running   0          2m2s

UPDATE: Ignore, this was using the wrong image.

@jpkrohling jpkrohling changed the title Update Jaeger CR WIP - Update Jaeger CR Feb 13, 2019
@jpkrohling
Copy link
Contributor Author

As talked on IRC, this PR is missing the basic functionality of creating/updating non-deployment objects :-) I'm updating it with the other parts, but wanted to leave it for a short review while I fix a bug with config maps.

The second commit from this PR contains the changes to support other non-deployment objects (missing a few ones, WIP).

We talked about having a generic vs. concrete approach and decided in favor of the concrete approach. The second PR shows that there's almost a 1-to-1 code duplication for the other resources, but I think the code is simple enough to justify this duplication. Making this code more generic would make things less flexible and would increase complexity, as there's no common interface between all the objects other than runtime.Object, which would require type checking whenever we need to use the object's name (as runtime.Object itself has no Name/Namespace properties).

Even though I believe the way it's done in this PR is the right choice, I'd like to get your opinion before moving forward.

@jpkrohling
Copy link
Contributor Author

The third commit in this PR shows what's the required code for individual object types (ingresses, in the example).

Copy link
Contributor

@objectiser objectiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the separate types for now - possibly some of the more straightforward types could be handled in a generic way eventually.

pkg/controller/jaeger/ingress.go Show resolved Hide resolved
@jpkrohling jpkrohling changed the title WIP - Update Jaeger CR Update Jaeger CR Feb 21, 2019
@jpkrohling
Copy link
Contributor Author

@pavolloffay : could you please review this one as @objectiser is away?

@pavolloffay
Copy link
Member

sure I will have a look. I would like to first fix the e2e tests so ES can be merged.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling
Copy link
Contributor Author

I fixed the OpenShift problems (v3.11), but there's still one point open that might be a bug in the operator SDK. When running in Kubernetes, the reconcile loop is being called only when the CR is changed, while in OpenShift it's being called every second. Not sure yet why, but I opened a thread in the operator-framework mailing list addressing that.

I also addressed the comments from the last review, except one (left a question on the item itself).

@pavolloffay
Copy link
Member

It would be good to add a link to operator ML to this thread.

Do you want me to re-review?

@jpkrohling
Copy link
Contributor Author

It would be good to add a link to operator ML to this thread.

https://groups.google.com/d/msg/operator-framework/DZynMacOgoY/RWaSFxFUBwAJ

I'll track this in a separate issue so that it won't block this PR (#203 is probably related)

Do you want me to re-review?

Yes, please.

Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests are a bit shallow. They do tests change in annotations - basically ObjectMeta which fine but they should more importantly tests if core fields of objects get updated.

It seems that my previous comments were not fixed, there is repeated logging and missing documentation for the update - how to update and what is supported.

pkg/controller/jaeger/account.go Show resolved Hide resolved
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@pavolloffay
Copy link
Member

running on minishift produces following errors. THe serviceaccount error repeates over time.

 ✘ 130 ploffay  ~/projects/golang/src/github.com/jaegertracing/jaeger-operator   PR193  make run WATCH_NAMESPACE=myproject                                                                                                                        11:33 
INFO[0000] Versions                                      arch=amd64 operator-sdk=v0.4.1 os=linux version=go1.11.1
INFO[0000] Auto-detected the platform                    platform=openshift
INFO[0000] Starting the Cmd.                            
INFO[0010] Storage type wasn't provided for the Jaeger instance 'simplest'. Falling back to 'memory' 
ERRO[0011] failed to apply the changes                   error="Deployment.apps \"simplest\" not found" instance=simplest namespace=myproject
INFO[0012] Storage type wasn't provided for the Jaeger instance 'simplest'. Falling back to 'memory' 
INFO[0024] Configured Jaeger instance                    instance=simplest namespace=myproject
ERRO[0024] failed to apply the changes                   error="routes.route.openshift.io \"simplest\" already exists" instance=simplest namespace=myproject
INFO[0025] Configured Jaeger instance                    instance=simplest namespace=myproject
ERRO[0025] failed to apply the changes                   error="Operation cannot be fulfilled on serviceaccounts \"simplest\": the object has been modified; please apply your changes to the latest version and try again" instance=simplest namespace=myproject
INFO[0026] Configured Jaeger instance                    instance=simplest namespace=myproject
INFO[0026] Configured Jaeger instance                    instance=simplest namespace=myproject
ERRO[0026] failed to apply the changes                   error="Operation cannot be fulfilled on serviceaccounts \"simplest-ui-proxy\": the object has been modified; please apply your changes to the latest version and try again" instance=simplest namespace=myproject
INFO[0027] Configured Jaeger instance                    instance=simplest namespace=myproject
ERRO[0027] failed to apply the changes                   error="Operation cannot be fulfilled on serviceaccounts \"simplest-ui-proxy\": the object has been modified; please apply your changes to the latest version and try again" instance=simplest namespace=myproject

@jpkrohling
Copy link
Contributor Author

running on minishift produces following errors. THe serviceaccount error repeates over time.

Are you running on a clean OpenShift, or could this be something left behind from previous iterations? I do get the service account one from time to time, but should be relatively innocuous, once #231 is fixed.

@pavolloffay
Copy link
Member

It's clean OpenShift. The #231 causes the error to pop up more ofter, maybe there is something wrong with SA bc other objects are not logged with the same modified error.

@pavolloffay
Copy link
Member

The problem is that ResourceVersion is SA is always different as it runs the loop. Adding tp.ResourceVersion = v.ResourceVersion to inventory/sa fixes the problem - although I am not sure whether it's correct to do so.

@jpkrohling
Copy link
Contributor Author

The resource version field is the reason we update individual fields :) We are not supposed to touch that one: https://groups.google.com/forum/#!topic/operator-framework/KqgrZ3v2P9U

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling
Copy link
Contributor Author

They do tests change in annotations - basically ObjectMeta which fine but they should more importantly tests if core fields of objects get updated.

I added a few more tests for fields within the Spec (or related) objects.

@pavolloffay
Copy link
Member

The resource version field is the reason we update individual fields :) We are not supposed to touch that one: https://groups.google.com/forum/#!topic/operator-framework/KqgrZ3v2P9U

thanks for the pointer. There is definitelly something fishy with SA or secrets. I have noticed that every loop creates new secrets.

oc get secrets | wc -l                                                                                                                                           1:54 
597

@jpkrohling
Copy link
Contributor Author

There is definitelly something fishy with SA or secrets. I have noticed that every loop creates new secrets.

That's a great hint! I did not notice this before, let me see if I can figure out what's wrong.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling
Copy link
Contributor Author

So, I tracked the problem down to the reconcile logic, where we store back the CR as we might have done some reconciliation. Once that is done, I don't see #231 anymore and the number of service accounts becomes stable.

I'll track that in a different issue, to unblock this PR.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling
Copy link
Contributor Author

e2e tests are passing, will merge this:

$ make test
Running unit tests...
?   	github.com/jaegertracing/jaeger-operator/cmd	[no test files]
?   	github.com/jaegertracing/jaeger-operator/cmd/manager	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/account	0.017s	coverage: 100.0% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/apis	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/apis/io/v1alpha1	0.037s	coverage: 15.8% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/cmd/start	[no test files]
?   	github.com/jaegertracing/jaeger-operator/pkg/cmd/version	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/config/sampling	0.007s	coverage: 94.1% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/config/ui	0.015s	coverage: 94.1% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/controller	[no test files]
?   	github.com/jaegertracing/jaeger-operator/pkg/controller/deployment	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/controller/jaeger	2.339s	coverage: 73.9% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/cronjob	0.080s	coverage: 93.9% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/deployment	0.079s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/ingress	0.101s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/inject	0.026s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/inventory	0.059s	coverage: 91.8% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/route	0.059s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/service	0.040s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/storage	0.631s	coverage: 93.2% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/storage/elasticsearch/v1alpha1	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/strategy	0.015s	coverage: 92.1% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/util	0.010s	coverage: 100.0% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/version	[no test files]
Formatting code...
Building...
Sending build context to Docker daemon  130.8MB
Step 1/6 : FROM centos
 ---> 1e1148e4cc2c
Step 2/6 : RUN INSTALL_PKGS="       openssl       " &&     yum install -y $INSTALL_PKGS &&     rpm -V $INSTALL_PKGS &&     yum clean all &&     mkdir /tmp/_working_dir &&     chmod og+w /tmp/_working_dir
 ---> Using cache
 ---> 82728e7bfc7b
Step 3/6 : COPY scripts/* /scripts/
 ---> Using cache
 ---> 42400106dc1c
Step 4/6 : USER nobody
 ---> Using cache
 ---> a590f9fcf745
Step 5/6 : ADD build/_output/bin/jaeger-operator /usr/local/bin/jaeger-operator
 ---> 55e2e02811d3
Step 6/6 : ENTRYPOINT ["/usr/local/bin/jaeger-operator"]
 ---> Running in 666da8612545
Removing intermediate container 666da8612545
 ---> 1fc614a5959e
Successfully built 1fc614a5959e
Successfully tagged jpkroehling/jaeger-operator:latest
Pushing image jpkroehling/jaeger-operator:latest...
Running end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	371.189s

@jpkrohling jpkrohling merged commit 0337cc9 into jaegertracing:master Feb 26, 2019
}
objs, err := es.CreateElasticsearchObjects(cDep.Spec.Template.Spec.ServiceAccountName, queryDep.Spec.Template.Spec.ServiceAccountName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpkrohling this PR broke Elasticsearch functionality, note that es.CreateElasticsearchObjects was returning ES CR in the object array. Now the ES CR is not being processed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants