Skip to content

Latest commit

 

History

History
270 lines (167 loc) · 38.5 KB

README.md

File metadata and controls

270 lines (167 loc) · 38.5 KB

What happens when ... Kubernetes

Imagine I want to run nginx in a Kubernetes cluster and make it visible to the outside world. We can do this in a one-liner with kubectl:

kubectl run --image=nginx --replicas=3 --port=80 --expose

But what really happens when you run that command?

One of the beautiful things about Kubernetes is that it offers tremendous power while abstracting complexity through user-friendly interfaces. In order to understand this complexity (and therefore what value Kubernetes offers us), we need to follow the path of a request as it travels through a Kubernetes sytem. This repo seeks to explain that lifecycle.

This is a living document. Contributions that add, edit, or delete content where necessary is definitely welcome!

api reconciliation

The first thing that kubectl will do is perform some client-side validation. This ensures that requests that will always fail (e.g. creating a non-supported or outdated resource) will not be sent to kube-apiserver. Examples of failed requests could be creating a non-supported resource type, or using a malformed image name.

After validation, kubectl then begins assembling the request it'll send to kube-apiserver. To do this it uses entities called "generators" to generate a Resource object and then serialize it into JSON.

What may not be obvious is that you can actually specify multiple resource types with kubectl run, not just Deployments. To make that work, kubectl will infer the resource type if the generator name wasn't explicitly specified (using the --generator flag). For example, resources that have --restart-policy=Always are considered Deployments, and those with --restart-policy=Never are considered pods. kubectl will also figure out whether other actions need to be triggered, such as recording the command (for rollouts or auditing), or whether this command is just a dry run (--dry-run).

After realising that we want to create a Deployment, it will use the generator called DeploymentV1Beta1 to generate a runtime object from our provided parameters. It then uses this runtime object to find the appropriate API group and version for our Deployment (which are extensions and v1beta1 respectively), and then assembles a versioned client that is aware of the various REST semantics for the HTTP operation.

But how is kubectl aware of every possible API group? kubectl is pretty smart because it uses a "discovery" process to do this. Since kube-apiserver represents its REST schema using OpenAPI using the /apis path, kubectl can just retrieve this JSON document and use it to learn about what the API looks like. It also caches the OpenAPI to disk in the ~/.kube/schema directory to improve performance (if you want to see this API discovery try turning on the -v flag to the maximum!).

The final step is to perform the HTTP request, since we did not register it as a dry-run. kubectl will then print out a success message based on the desired output format.

client auth

One thing that we neglected to mention is client authentication, so let's look at that now.

In order to send the request successfully, the client needs to be able to authenticate. User credentials are almost always stored in the kubeconfig file which resides on disk. kubectl will try to auto-detect the correct path to the file by doing the following:

  • if --kubeconfig flag is provided, easy peasy, use that.
  • if the $KUBECONFIG environment variable is defined, use that.
  • otherwise look in a predictable directory like ~/.kube, and use the first file found.

After parsing the file, it then determines the current context to use, the current cluster to point to, and any auth information associated with the current user. If the user provided flag-specific values (such as --username) these take precedence and will override kubeconfig. Once it has this information, kubectl populates the client's configuration so that it is able decorate the HTTP request appropriately:

  • x509 certificates are sent using tls.TLSConfig (this also includes the root CA)
  • bearer tokens are sent in the "Authorization" HTTP header
  • username and password are sent via HTTP basic authentication
  • the OpenID auth process is handled manually by the user beforehand, producing a token which is sent like a bearer token

server auth

So our request has been sent, hooray! What next? This is where the kube-apiserver enters the picture. In a nutshell, the kube-server is the primary interface that clients use to persist and retrieve cluster state. To do this well, it needs to be able to verify that the client is who they say there are, this is authentication.

How does the apiserver authenticate requests? When the server first starts, it looks at all the CLI flags the user provided and assembles a list of suitable authenticators. Let's take an example: if a --client-ca-file has been passed in, it appends the x509 authenticator; if it sees --token-auth-file provided, it appends the token authenticator to the list. Every time a request is received, it is run through the authenticator chain until one succeeds:

  • the x509 handler will verify that the HTTP request is encoded with a TLS key signed by the CA root cert
  • the bearer token handler will verify that the provided token (specified in the HTTP Authorization header) exists in the provided file on disk
  • the password auth handler will similarly ensure that the HTTP request's basic auth credentials match its own local state.

If every authenticator fails, the request fails and an aggregate error is returned. If authentication succeeds, the Authorization header is removed from the request, and user information is added to its context. This provides future validators (such as authorization and admission controllers) the ability to access the previously established identity of the user.

authorization

Okay, the request has been sent, and kube-apiserver has successfully verified we are who we say we are. What a relief! However, we're not done yet. We may be who we say we are, but are we allowed to perform this action? Identity and permission are not the same thing, and in order for us to continue, the apiserver needs to authorize us.

The way kube-apiserver handles this is very similar to authentication: based on flag inputs, at start-up it will assemble a chain of authorizers that will be run for every incoming request. If all authorizers deny the request, the request results in a Forbidden response and goes no further down the chain. If a single authorizer approves, the request proceeds.

Some examples of authorizers that ship with v1.8 are:

  • AllowAll and DenyAll, which approve and deny all traffic respectively;
  • webhook, which interacts with an off-cluster HTTP(S) service;
  • ABAC, which enforces policies defined in a static file;
  • RBAC, which enforces RBAC roles which are added by the admin as k8s resources
  • Node, which ensures that node clients, i.e. the kubelet, can only access resources hosted on itself.

[side point]: accessing state with informers

As you might have noticed, some authorization controllers like RBAC and Node are dynamic, in that they need to retrieve cluster state to function. To return to the example of the RBAC authorizer, we know that when a request comes in, the authenticator will save an initial representation of user state for later use. The RBAC authorizer will then use this to retrieve all the roles and role bindings that are associated with the user in etcd. How are controllers supposed to access and modify such resources? It turns out this is a common use case and is solved in Kubernetes with informers.

"A what?!" I hear you ask. An informer is a pattern that allows controllers to subscribe to storage events and easily list resources they're interested in. Apart from providing an abstraction which is nice to work with, it also handles a lot of the nuts and bolts such as caching. Caching is important because it reduces unnecessary kube-apiserver connections, and reduces duplicate serialization costs server- and controller-side. By formalising a model like this, it also allows controllers to interact in a threadsafe manner without having to worry about stepping on anybody else's toes.

In the case of the RBAC authorizor, it will not register any event handlers, but what it will do is use the informer to list over a collection of roles and retrieve a specific resource in a consistent, supported way. Now that we know what informers are and the basics of how they're used, let's leave the rabbit hole and return to our main journey.

admission controllers

Okay so we've authenticated and been authorized at this point, awesome sauce. So what's left? From kube-apiserver's perspective, it believes who we are and permits us to continue, but with Kubernetes, other parts of the system have strong opinions about what should and should not be permitted to happen. Cue admission controllers.

Whilst authorization is focused on answering if a user is authorized to perform an action, admission control is focused on if the wider system will permit the action. They are the last bastion of control before an object is persisted to etcd, so they encapsulate the remaining system checks to ensure an action does not produce unexpected or negative results.

The way admission controllers are initialized is very similar to authenticator and authorizer chains. To promote extensibility, they are stored as plugins in the plugin/pkg/admission directory, made to satisfy a small interface, and are compiled into kubernetes itself. Unlike other control chains we have mentioned, if a single admission controller fails, the whole chain is broken and the request will fail.

Admission controllers are usually categorised into resource management, security, defaulting, and referential consistency. Sometimes an admission controller will permit a request, but reconcile cluster state in accordance with its own policy (the NamespaceExists controller will create a namespace for example). Commonly used resource ACs are:

  • InitialResources which sets default resource limits to the resources for a container based on past usage;
  • LimitRanger which sets defaults for container requests and limits, or enforce upper bounds on certain resources (no more than 2GB of memory, default to 512MB);
  • ResourceQuota which calculates and denies a number of objects (pods, rc, service load balancers) or total consumed resources (cpu, memory, disk) in a namespace.

each object save to etcd

By this point, Kubernetes has fully vetted the incoming request and has permitted it to go forth and prosper. The next step is how kube-apiserver deserializes the request, constructs resources from them, and persists them to the datastore. Let's break that down a bit.

How does kube-apiserver know what to do when it accepts our request? Enter our old friend OpenAPI! As we mentioned earlier, all API operations are formalised into an OpenAPI spec, which lists the path, JSON structures and query parameters. These OpenAPI specs are generated into the pkg/generated/openapi package when kubernetes is built. This spec then populates the apiserver's config. This spec is then iterated over and each API group is installed into a chain of handlers.

  1. When the kube-apiserver binary is run, it creates a server chain, which allows apiserver aggregation
  2. When this happens, a generic apiserver is created that serves as a default implementation.
  3. The generic server then iterates over all the API groups and configures the storage provider
  4. For every API group it also iterates over each of the group versions and installs the REST mappings for each of the group version's routes.
  5. For our specific use case, a POST handler is registered, which in turn will delegate to a create resource handler.

After all this is set up, the server is in a position to respond. By the time a request comes in, this is what will happen:

  1. If the handler chain can match the request to a set pattern (i.e. to the routes we registered), it will dispatch the dedicated handler that was registered for the route. Otherwise it will use a path-based handler. If no paths are registered, a not found handler is invoked.
  2. Luckily for us, we have a registered route called createHandler! What does it do? Well it will first decode the HTTP request and perform basic validation, such as ensuring the JSON they provided correlates with our expectation of the versioned API resource.
  3. Auditing and final admission will occur.
  4. The resource will be saved to etcd by delegating to the storage provider. Usually the etcd key will be the form of <namespace>/<name>, but this is configurable.
  5. Any create errors are caught and finally the storage provider performs a get to ensure the object was created, then invokes any post-create handlers and decorators if additional finalization is required.
  6. The HTTP response is constructed and sent back

initializers

after an object is persisted to the datastore, it is not made fully visible by the apiserver or scheduled until a series of "intializers" have run for the specific resource. if no initializers are set for the resource type, it is made visible immediately. an initializer is a controller that is associated with a resource type and performs logic on the resource before it's made available to the outside world.

as Ahmet Balkan notes in his great blog post, this allows kubernetes to perform some cool bootstrap operations, like:

  • Inject a proxy sidecar container to the pod if it has port 80, or has a particular annotation.
  • Inject a volume with test certificates to all pods in the test namespace automatically.
  • If a Secret is shorter than 20 characters (probably a password), prevent its creation.

intializer configuration objects allow you to declare which initializers should run for certain object types. for example, for every deployment, ensure that MyDeploymentInitializer runs. this would mean that when a Deployment object is received, MyDeploymentInitializer is appended to the object's metadata.initializers.pending field. each initializer runs sequentually and removes itself from the list when it's finished processing.

all throughout this process, a pod's status will be PodInitializing. when this bootstrap finishes the object will be considered initialized and then allow other controllers to continue the creation process.

one question which you might have asked is, how can a userland controller process resources if they're not made visible by the apiserver? this problem is solved by using the includeUninitialized query parameter, which returns unitialized objects.

deployment controller creates replicasets

By this stage, our Deployment record exists in etcd and any initialization logic has completed. But when we think about it, a Deployment is really just a collection of ReplicaSets, each of which are a set of Pods. How does Kubernetes go about creating this topology from one HTTP request? This is where Kubernetes built-in controllers enter the stage.

Kubernetes makes strong use of "controllers", which are scripts that run in the background to reconcile the actual state of the system to the desired state. Each controller has a small responsibility and is run in parallel by kube-controller-manager component. So let's introduce the first controller of our journey, the Deployment controller.

After a deployment record is stored to etcd and initialized, it is made visible via kube-apiserver. When this new resource is available, it is detected by the Deployment controller, whose job it is to listen out for changes to deployment records. In our case, the controller registers a specific callback for create events via an informer.

This handler will be executed when our Deployment first becomes available and will add it to an internal work queue. By the time it gets around to processing our record, the controller will inspect our Deployment and realise that there are no ReplicaSet or Pod records associated with it. It does this by querying kube-apiserver with label selectors.

After realising none exist, it will begin a synchronization process to start resolving state. It does this by rolling out (e.g. creating) a ReplicaSet resource, assigning it a label selector, and giving it the revision number of 1. The ReplicaSet's PodSpec is copied from the Deployment's manifest, as well as other relevant metadata. Sometimes the Deployment record will need to be updated after this as well (for instance if the progress deadline is set).

The status is then updated and it then enters a loop waiting for the deployment to complete. Since the Deployment controller is only concerned about creating ReplicaSets, reconcilation needs to be continued by the next controller, called ReplicaSet controller (whose job is to create Pods).

replicaset controller creates pods

What other controllers come into play when using kubectl run? In the previous step, the Deployments controller created our Deployment's first ReplicaSet but we still have no Pods. This is where the ReplicaSet controller comes into play! Its job is to monitor the lifecycle of ReplicaSets and their dependent resources (Pods). Like most other controllers, it does this by triggering handlers on certain events.

The event we're interested in is creation. When a ReplicaSet is created (courtesy of the Deployments controller) the RS controller inspects the desired state and realizes there is a skew between what exists and what is required. It then seeks to reconcile this state by bumping the number of pods that belong to the replica set. It starts creating them in a careful manner, ensuring that the ReplicaSet's burst count (which it inherited from its parent Deployment) is always matched.

Create operations for Pods are also batched, starting with SlowStartInitialBatchSize and doubling with each successful iteration in a kind of "slow start" operation. This aims to mitigate the risk of swamping kube-apiserver with unnecessary HTTP requests when there are numerous pod bootup failures (for example, due to resource quotas). If we're going to fail, we might as well fail gracefully with minimal impact on other system components!

As we've hinted at before, Kubernetes enforces object hierarchies through Owner References (a field in the child resource where it references the ID of its parent). Not only does this ensure that child resources are garbage-collected once a resource managed by the controller is deleted (cascading deletion), it also provides an effective way for parent resources to not fight over their children (imagine the scenario where two potential parents think they own the same child!).

Another subtle benefit of the Owner Reference design is that it's stateful: if any controller were to restart, that downtime would not affect the wider system since resource topology is independent of the controller's lifecycle. This focus on isolation also creeps in to the design of controllers themselves: they should not operate on resources they don't explicitly own. Controllers should instead be selective in its ownership assertions, non-interfering, and non-sharing.

Anyway, back to owner references! Sometimes there are "orphaned" resources in the system which usually happens when:

  1. a parent is deleted but not its children
  2. garbage collection policies prohibit child deletion

When this occurs, controllers will ensure that orphans are adopted by a new parent. Multiple parents can race to adopt a child, but only one will be successful (the others will receive a validation error).

scheduler assigns node

By this point we have a Deployment, a ReplicaSet and three Pods. Our pods, however, are stuck in a Pending state because they have not yet been scheduled to a Node. The final controller that accomplishes this is the scheduler

The scheduler runs as a standalone component of the control plane and operates in the same way as other controllers: it listens out for events and attempts to reconcile state. In this case, it listens out for pods with an empty NodeName field in their PodSpec and attempts to find a suitable Node that the pod can reside on.

In order to find a suitable pod, a specific scheduling algorithm is used. The way the default scheduling algorithm works is the following:

  1. when the scheduler starts, a chain of predicates are registered. These predicates are like functions that, when evaluated, determine the suitability of a Node to host a pod. For example, if the PodSpec explicitly requests CPU or RAM resources, and a Node cannot meet these requests due to lack of capacity, it will be deselected for the Pod (resource capacity is calculated as the total capacity minus the sum of the resource requests of currently running containers).

  2. once appropriate nodes have been selected, a series of priority functions are run against the remaining Nodes in order to rank their suitability. For example, in order to spread workloads across the system, it will favour nodes that have fewer resource requests than others (since this indicates less workloads running). As it runs these functions, it assigns each node a numerical rank. The highest ranked node is then selected for scheduling.

once the algorithm finds a node, the scheduler then creates a Binding API object whose Name and UID match the Pod, and whose ObjectReference field contains the name of the selected node. this is then POSTed to the apiserver.

when the apiserver receives this Binding object, the registry deserializes the object and updates the following fields on the Pod object: it sets the NodeName to the one in the ObjectReference, it adds any relevant annotations, and sets its PodScheduled status condition to True.

Customising the scheduler: what's interesting is that both predicate and priority functions are extensible and can be defined by using the --policy-config-file flag. This introduces a degree of flexibility. Administrators can also run custom schedulers (controllers with custom processing logic) in standalone Deployments. If a PodSpec contains schedulerName, Kubernetes will hand over scheduling for that pod to whatever scheduler thas has registered itself under that name.

kubelet begins pod sync

Okay, the main controller loop has finished, phew! Let's summarise: the HTTP request passed authentication, authorization, and admission control stages; a Deployment, ReplicaSet, and three Pod resources were persisted to etcd; a series of initializers ran; and each Pod was scheduled to a suitable node. So far, however, the state we've been talking about is purely in etcd. The next steps involve distributing state to worker nodes. The way this happens in Kubernetes is through a component called the kubelet. Let's begin!

The kubelet is an agent that runs on every node in a Kubernetes cluster and is responsible for the lifecycle of a Pod. This means it handles all of the translation logic between the abstraction of a "Pod" (which is just a Kubernetes concept) and container (the building blocks of a Pod). It also handles mounting volumes, container logging, garbage collection, and many more important things.

A useful way of thinking about the kubelet is again like a controller! It queries Pods from kube-apiserver every 20 seconds (this is configurable), for unscheduled pods whose NodeName matches the node the kubelet is running on. Once it has that list, it detects new additions by comparing against its own internal cache and begins to synchronise state if any discrepencies exist.

What's interesting, however, is that the kubelet doesn't have a concept of "starting" a Pod, since as we've already mentioned, Pods aren't actually concrete things. Instead, it handles synchronization in the following way:

  1. if the pod is being created (ours is!), it registers some startup metrics that is used in Prometheus for tracking pod latency.
  2. generates a PodStatus API object, which represents the state of a Pod's current Phase. The Phase of a Pod is a high-level summary of where the Pod is in its lifecycle. Examples include Pending, Running, Succeeded, Failed and Unknown. Generating this state is quite complicated, so let's dive into exactly what happens:
    • first, a chain of PodSyncHandlers is executed sequentially. Each handler checks whether the Pod should still reside on the node. If any of them decide that the Pod no longer belongs there, the Pod's phase will change to PodFailed and it will eventually be evicted from the Node. Examples of these include evicting a Pod after its activeDeadlineSeconds has exceeded (used during Jobs).
    • next, the Pod's Phase is determined by the status of its init and real containers. Since our containers have not been started yet, the containers are classed as waiting. Any pod with a waiting container is considered Pending, which is the case in our situation.
    • finally, the Pod condition is determined by the condition of its containers. Since none of our containers have been created by the container runtime yet, it will set the PodReady condition to False.
  3. After the PodStatus is generated, it will then be sent to the Pod's status manager, which is tasked with asynchronously updating the etcd record via the apiserver.
  4. Next, a series of admission handlers are run to ensure the pod has the correct security permissions to run. These include enforcing AppArmor profiles and NO_NEW_PRIVS. Pods denied at this stage will stay in the Pending state indefinitely.
  5. If the cgroups-per-qos runtime flag has been specified, the kubelet will create cgroups for the pod and apply resource parameters. This is to enable better Quality of Service (QoS) handling for pods.
  6. Data directories are created for the pod. These include the pod dir (usually /var/run/kubelet/pods/<podID>), its volumes dir (<podDir>/volumes) and its plugins dir (<podDir>/plugins).
  7. The volume manager will attach and wait for any relevant volumes defined in Spec.Volumes. Depending on the type of volume being mounted, some pods will need to wait longer (e.g. cloud or NFS volumes).
  8. All secrets defined in Spec.ImagePullSecrets are retrieved from the apiserver so that they can later be injected into the container.
  9. The container runtime then runs the container (described in more detail next)

CRI and pause containers

We're at the point now where most of the set-up is done and the container is ready to be launched. This is step is similar to doing docker run, except it's handled by the kubelet in a much more abstracted way. The software that deploys the container itself is called the container runtime (docker or rkt are examples).

In an effort to be more extensible, since Kubernetes 1.5 the kubelet has been using the Container Runtime Interface (CRI) for interacting with concrete container runtimes. CRI provides an intermediary abstraction between the kubelet and a specific runtime implementation, allowing them to communicate via protocol buffers (it's like an efficient JSON) and a gRPC API (a type of API well-suited to performing Kubernetes operations). By using a defined contract between kubelet and runtime, the implementation details become largely irrelevant because all that matters is the contract. This allows new runtimes to be added with minimal overhead since no core Kubernetes code needs to change, which is pretty cool!

Let's get back to it... When a pod is first started, kubelet invokes the RunPodSandbox remote procedure command (RPC). A "sandbox" is a CRI term to describe a set of containers, which in Kubernetes parlance is a pod. The term is deliberately loose so it doesn't lose meaning for other runtimes that may not use containers (such as with hypervisor-based runtimes, where a sandbox might represent a VM).

In our case, we're using Docker. In this runtime, creating a sandbox involves creating a "pause" container which. A pause container is pretty much like a parent container, since it hosts a lot of the pod-level resources that workload containers will end up using. Examples of these "resources" are Linux namespaces (IPC, network, PID). If you're not familiar with how containers work in Linux, let's take a quick refresher. The Linux kernel has the concept of a namespace which allows the system to carve out a dedicated set of resources (CPU or memory for example) and offer it to a process as if it's the only thing in the world using them. Cgroups is the way in which Linux governs resource allocation (it's kinda like the cop that polices things). Docker uses both of these Kernel features to host a process that has guaranteed resources and enforced isolation. For more information, check out What even is a Container.

The pause container provides a way to host all of these namespaces and allow sibling containers to share them. By being a part of the same network namespace, one end-user benefit we see is that containers in a pod can refer to one another using localhost. The second role of a pause container is related to how PID namespaces work. In these types of namespaces, processes form a hierarchical tree and the "init" process at the top takes responsibility for "reaping" dead processes. For more information on how this work, check out this blog post. After the pause container has been created, it is checkpointed to disk, and started.

CNI and pod networking

Our pod now has its bare bones: a pause container which hosts all of the namespaces to allow inter-pod communication. But how does networking work and how is it set up?

When the kubelet sets up networking for a pod it delegates the task to a "CNI" plugin. CNI stands for Container Network Interface and operates in a similar way to the Container Runtime Interface. In a nutshell, CNI is an abstraction that allows different network providers to use different networking implementations for containers. Plugins are registered and the kubelet interacts with them by streaming JSON data (config files are located in /etc/cni/net.d) to the relevant CNI binary (located in /opt/cni/bin) via stdin. This is an example of the JSON configuration:

{
    "cniVersion": "0.3.1",
    "name": "bridge",
    "type": "bridge",
    "bridge": "cnio0",
    "isGateway": true,
    "ipMasq": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "${POD_CIDR}"}]
        ],
        "routes": [{"dst": "0.0.0.0/0"}]
    }
}

It also specifies additional metadata for pod, such as its name and namespace via the CNI_ARGS environment variable.

What happens next is dependent on the CNI plugin, but let's look at the bridge CNI plugin:

  1. The plugin will first set up a local Linux bridge in the root network namespace to serve all containers on that host
  2. It will then insert an interface (one end of a veth pair) into the pause container's network namespace and attach the other end connected to the bridge. The best way to think about a veth pair is like a tube: one side is connected to the container and the other side is in the root network namespace, allowing packets to travel inbetween.
  3. It should then assign an IP to the pause container's interface and set up the routes. This will result in the Pod having its own IP address. IP assignment is delegated to the IPAM providers specified to the JSON configuration.
    • IPAM plugins are similar to main network plugins: they are invoked via a binary and have a standardised interface. Each must determine the IP/subnet of the container's interface, along with the gateway and routes, and return this information back to the main plugin. The most common IPAM plugin is called host-local and allocates IP addresses out of a predefined set of address ranges. It stores the state locally on the host filesystem, therefore ensuring uniqueness of IP addresses on a single host.
  4. For DNS, the kubelet will specify the internal DNS server IP address to the CNI plugin, which will ensure that the container's resolv.conf file is set appropriately.

Once the process is complete, the plugin will return JSON data back to the kubelet indicating the result of the operation.

Inter-host networking

So far we've described how containers connect to the host, but how do hosts communicate? This will obviously happen if two Pods on different machines want to communicate?

This is usually accomplished using a concept called overlay networking, which is a way to dynamically sychronize routes across multiple hosts. One populate overlay network provide is Flannel. When installed, its core responsible is to provide a layer-3 IPv4 network between multiple nodes in a cluster. Flannel does not control how containers are networked to the host (this is the job of CNI remember), but rather how the traffic is transported between hosts. To do this, it selects a subnet for the host and registers it with etcd. It then keeps a local representation of the cluster routes and encapsulates outgoing packets in UDP datagrams, ensuring it reaches the right host. For more information, check out CoreOS's documentation.

containers are started

Okay, phew. All the networking shenanigans are done and out of the way. What's left? Well we need to actually start out workload containers.

Once the sandbox has finished initializing and is active, the kubelet can begin creating containers for it. It first starts any init containers as defined in the PodSpec, and will then start the main containers themselves. The process for doing is this:

  1. Pull the image for the container. Any secrets that are defined in the PodSpec are used for private registries.
  2. Create the container via CRI. It does this by populating a ContainerConfig struct (in which the command, image, labels, mounts, devices, environment variables etc. are defined) from the parent PodSpec and then sending that via protobufs to the CRI plugin. For Docker, it deserializes the payload and populates its own config structures to send to the Daemon API. In the process it applies a few metadata labels (such container type, log path, sandbox ID) to the container.
  3. It then registers the container with CPU manager, which is a new alpha feature in 1.8 that assigns containers to sets of CPUs on the local node by using the UpdateContainerResources CRI method.
  4. The container is then started.
  5. If any post-start container lifecycle hooks are registered, they are run. Hooks can either be of the type Exec (executes a specific command inside the container) or HTTP (performs a HTTP request against a container endpoint). If the PostStart hook takes too long to run, hangs, or fails, the container will never reach a running state.

After all this, we should have 3 containers running on one or more worker nodes. All of the networking, volumes and secrets have been populated by the kubelet and made into containers via the CRI plugin.