Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes operator not setting folder permissions before de-escalating permissions #1363

Open
psarossy opened this issue May 23, 2023 · 10 comments
Labels
bug Something isn't working never-stale Issue or PR marked to never go stale

Comments

@psarossy
Copy link

Describe the bug

This is similar to #1327 but with CephFS PVCs. Also at https://stackoverflow.com/questions/67771239/rabbitmq-fails-to-start-with-persistence-storage-on-kubernetes-permission-denie

The pod starts up but has no write access to the mnesia folder

I've deployed the standard example operator and test cluster from: https://rabbitmq.com/kubernetes/operator/quickstart-operator.html

The only modification I've added to the test cluster is that I set the storage-class.

To Reproduce

Steps to reproduce the behavior:

  1. kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
  2. enable persistence with a storage class
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: hello-world
spec:
  persistence:
    storageClassName: nvme-pool-ec62
    storage: 20Gi
  1. kubectl apply -f rabbitmq.yaml

Expected behavior

  1. pod, PVC, PV is provisioned
  2. pod attaches PV
  3. pod starts

At step 3. the process fails as the binary does not have write access to the persistence changes

Stream closed EOF for default/hello-world-server-0 (rabbitmq)                                                                                                                                            
rabbitmq 2023-05-23 19:30:58.187362+00:00 [warning] <0.132.0> Failed to write PID file "/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default.pid": permission denied           
rabbitmq 2023-05-23 19:31:01.039621+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped                                                                                               
rabbitmq 2023-05-23 19:31:01.039898+00:00 [notice] <0.230.0> Logging: switching to configured handler(s); following messages may not be visible in this log output                                       
rabbitmq 2023-05-23 19:31:01.066987+00:00 [notice] <0.230.0> Logging: configured log handlers are now ACTIVE                                                                                             
rabbitmq                                                                                                                                                                                                 
rabbitmq BOOT FAILED                                                                                                                                                                                     
rabbitmq ===========                                                                                                                                                                                     
rabbitmq Error during startup: {error,                                                                                                                                                                   
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                                                                                                                                              
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0> BOOT FAILED                                                                                                                                  
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0> ===========                                                                                                                                  
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0> Error during startup: {error,                                                                                                                
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                           {cannot_create_mnesia_dir,                                                                                         
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                               "/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/",                             
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                               eacces}}                                                                                                       
rabbitmq                           {cannot_create_mnesia_dir,                                                                                                                                            
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                                                                                                                                              
rabbitmq                               "/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/",                                                                                
rabbitmq                               eacces}}                                                                                                                                                          
rabbitmq                                                                                                                                                                                                 
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>   crasher:                                                                                                                                   
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     initial call: application_master:init/4                                                                                                  
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     pid: <0.229.0>                                                                                                                           
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     registered_name: []                                                                                                                      
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     exception exit: {{cannot_create_mnesia_dir,                                                                                              
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>                          "/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/",                                  
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>                          eacces},                                                                                                            
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>                      {rabbit,start,[normal,]]}}                                                                                              
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>       in function  application_master:init/4 (application_master.erl, line 142)                                                              
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     ancestors: [<0.228.0>]                                                                                                                   
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     message_queue_len: 1                                                                                                                     
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     messages: [{'EXIT',<0.230.0>,normal}]                                                                                                    
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     links: [<0.228.0>,<0.44.0>]                                                                                                              
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     dictionary: []                                                                                                                           
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     trap_exit: true                                                                                                                          
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     status: running                                                                                                                          
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     heap_size: 610                                                                                                                           
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     stack_size: 28                                                                                                                           
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     reductions: 178                                                                                                                          
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>   neighbours:                                                                                                                                
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>                                                                                                                                              
rabbitmq 2023-05-23 19:31:02.143479+00:00 [notice] <0.44.0> Application rabbit exited with reason: {{cannot_create_mnesia_dir,"/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/",eacces},{rabbit,start,[normal,]]}}
rabbitmq {"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{cannot_create_mnesia_dir,\"/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/\",eacces},{rabbit,start,[normal,]]}}}"} 
rabbitmq Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{cannot_create_mnesia_dir,"/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/", eacces},{rabbit,start,[normal,]]}}})                                                                                                                                                                     
rabbitmq                                                                                                                                                                                                 
rabbitmq Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done                                                                                                                         
Stream closed EOF for default/hello-world-server-0 (setup-container)

The volume that gets created is owned by root by default as with all other PVCs:

psarossy@artemis: ~/ceph/volumes/csi/csi-vol-f6c8aaf0-8c7b-4bdb-a7dd-fb514c9d3639/26f680c7-a460-4789-bb9d-9b085672b406
$ ls -al                                                                                                                                                                                         [16:11:13]
total 0
drwxr-xr-x 2 root root 0 May 23 11:29 .
drwxr-xr-x 3 root root 2 May 23 11:29 ..

If I change the folder ownership tot UID/GID 999:999 aka rabbitmq:rabbitmq then the pod starts up and works fine.

The statefulset is missing the command to claim the folder as part of init before handing over to the non-privileged user to start the process... Unfortunately this needs to be fixed in the operator as every pod has the same issue when new PVCs are created, as it'll overwrite any changes to the configs, rightfully so.

Version and environment information

  • RabbitMQ: 3.11.10
  • RabbitMQ Cluster Operator: 2.2.0
  • Kubernetes: v1.23.2
  • Cloud provider or hardware configuration: baremetal via kubeadm on Dell servers with Ceph Rook
@psarossy psarossy added the bug Something isn't working label May 23, 2023
@lukebakken
Copy link
Contributor

The statefulset is missing the command to claim the folder as part of init before handing over to the non-privileged user to start the proces

It sounds like you understand the issue well. A pull request to fix it would be very welcome. Thanks.

@psarossy
Copy link
Author

Did some more digging, the Helm recipe has a specific init container to fix this that can be enabled on demand, and with that the pods start up as expected, and work.

Tried to work on getting that option added, but I can't even get the code to build and pass tests without my modifications so gave up after like 2 hours...

@github-actions
Copy link

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

@github-actions github-actions bot added the stale Issue or PR with long period of inactivity label Jul 31, 2023
@Zerpet Zerpet added never-stale Issue or PR marked to never go stale and removed stale Issue or PR with long period of inactivity labels Jul 31, 2023
@Zerpet
Copy link
Collaborator

Zerpet commented Jul 31, 2023

Removing the stale label as this issue is legitimate. I recall some work around this last year or so, we'll have to dig up a bit the history to understand what changed and our motivation around the change.

@jonathandavis805
Copy link

jonathandavis805 commented Jan 18, 2024

I'm running into this issue running cluster-operator v2.6.0 on an eks cluster version 1.28. I followed the docs and got this error: Failed to write PID file "/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-dev-server-0.rabbitmq-dev-nodes.rabbitmq-cluster-dev.pid": permission denied

@jonathandavis805
Copy link

What resolved this for me was in the docs for Using Openshift

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  ...
spec:
  ...
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            securityContext: {}

@psarossy
Copy link
Author

psarossy commented Apr 1, 2024

RE @jonathandavis805 That worked for me as well, but works because it removes some of statefulSet security settings :(

@mkuratczyk
Copy link
Collaborator

If you have this problem, please investigate what changes are necessary and share here. You can pause reconciliation and make modifications to the STS for example.

@mlb5000
Copy link

mlb5000 commented Jul 29, 2024

I have this same problem. Does anyone have a solution? With @jonathandavis805 's it won't even try to start up, as it gets stuck at chown

chown: changing ownership of '/var/lib/rabbitmq/mnesia': Operation not permitted

@gizmotronic
Copy link

You can work around this issue by adding an override to your RabbitmqCluster object:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
  ...
spec:
  ...
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            initContainers:
              - command:
                  - sh
                  - '-c'
                  - >-
                    chown -R 999:999 /var/lib/rabbitmq/mnesia;
                    chmod g+ws /var/lib/rabbitmq/mnesia
                image: alpine:latest
                imagePullPolicy: IfNotPresent
                name: setup-fix-permissions
                securityContext:
                  runAsUser: 0
                volumeMounts:
                  - mountPath: /var/lib/rabbitmq/mnesia/
                    name: persistence

Change the uid and gid for the chown command as needed.

The operator will not accept overrides for statefulSet.spec.template that do not include a value for statefulSet.spec.template.containers. You can use an empty array here without affecting the containers that the operator will define.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working never-stale Issue or PR marked to never go stale
Projects
None yet
Development

No branches or pull requests

7 participants