Description
I've been doing a bit of research into how to actually enable kubernetes to run a single container per osd. It seems from #562 that there is interest in having this feature, and I'd like to get some updates on what @leseb has been thinking and what the future plan is or might be and if there is anything I can get started implementing to get there.
From my own research, I have found that shared PID namespaces are already enabled in k8s but with the caveat that Docker 1.13.1 or higher is reqiuired. It is not yet, to my knowledge, qualified for use with k8s. This feature was enabled by this PR: kubernetes/kubernetes#45236
The TL;DR of the feature is that containers within a pod will automatically share the same PID namespace if Docker 1.13.1+ is used.
The problem is still that I don't see any way for kubernetes currently to do what @leseb has said and what I would also like to see.
In an ideal world someone would ask k8s, deploy me a Ceph cluster from a specific set of labeled machines (storage node), take all the disks and use them to build my Ceph cluster.
The kubernetes 1.0 docs https://kubernetes-v1-4.github.io/docs/user-guide/pods/multi-container/ state:
Containers cannot be added or removed once the pod is created
This thread involves mutable containers within pods, but several comments suggest that inclusion in kubernetes is unlikely: kubernetes/kubernetes#37838
To my current thinking, the only way to do this with kubernetes is to somehow programmatically inspect each node and generate a kubernetes deployment file for the node that contains one container for each disk. For consistency of management over time including replacing failed osds, using /dev/disk/by-...
paths I believe are better than /dev/sdX
which may move around over time. This doesn't seem like a great solution, but it seems like it might be the best option in practicality.
It also seems that the disk introspection #610 method's configmaps might be a way to get to a better alternative solution, but I haven't seen a way to connect the dots yet.
Looking further forward than is warranted currently, if this does provide a good go-forward solution, I think it would be beneficial to create some sort of semi-standardized specification for the configmap file (including disk journaling options for filestore, and database options for bluestore) that we might be able to manually create the configmap file outside of kubernetes.
Activity