An open source system for automating deployment, scaling, and operations of applications.

Monday, January 15, 2018

Core Workloads API GA

DaemonSet, Deployment, ReplicaSet, and StatefulSet are GA


Editor’s Note: We’re happy to announce that the Core Workloads API is GA in Kubernetes 1.9! This blog post from Kenneth Owens reviews how Core Workloads got to GA from its origins, reveals changes in 1.9, and talks about what you can expect going forward.

In the Beginning …

There were Pods, tightly coupled containers that share resource requirements, networking, storage, and a lifecycle. Pods were useful, but, as it turns out, users wanted to seamlessly, reproducibly, and automatically create many identical replicas of the same Pod, so we created ReplicationController.

Replication was a step forward, but what users really needed was higher level orchestration of their replicated Pods. They wanted rolling updates, roll backs, and roll overs. So the OpenShift team created DeploymentConfig. DeploymentConfigs were also useful, and OpenShift users were happy. In order to allow all OSS Kubernetes uses to share in the elation, and to take advantage of set-based label selectors, ReplicaSet and Deployment were added to the extensions/v1beta1 group version providing rolling updates, roll backs, and roll overs for all Kubernetes users.

That mostly solved the problem of orchestrating containerized 12 factor apps on Kubernetes, so the community turned its attention to a different problem. Replicating a Pod <n> times isn’t the right hammer for every nail in your cluster. Sometimes, you need to run a Pod on every Node, or on a subset of Nodes (for example, shared side cars like log shippers and metrics collectors, Kubernetes add-ons, and Distributed File Systems). The state of the art was Pods combined with NodeSelectors, or static Pods, but this is unwieldy. After having grown used to the ease of automation provided by Deployments, users demanded the same features for this category of application, so DaemonSet was added to extension/v1beta1 as well.

For a time, users were content, until they decided that Kubernetes needed to be able to orchestrate more than just 12 factor apps and cluster infrastructure. Whether your architecture is N-tier, service oriented, or micro-service oriented, your 12 factor apps depend on stateful workloads (for example, RDBMSs, distributed key value stores, and messaging queues) to provide services to end users and other applications. These stateful workloads can have availability and durability requirements that can only be achieved by distributed systems, and users were ready to use Kubernetes to orchestrate the entire stack.

While Deployments are great for stateless workloads, they don’t provide the right guarantees for the orchestration of distributed systems. These applications can require stable network identities, ordered, sequential deployment, updates, and deletion, and stable, durable storage. PetSet was added to the apps/v1beta1 group version to address this category of application. Unfortunately, we were less than thoughtful with its naming, and, as we always strive to be an inclusive community, we renamed the kind to StatefulSet.

Finally, we were done.



...Or were we?

Kubernetes 1.8 and apps/v1beta2

Pod, ReplicationController, ReplicaSet, Deployment, DaemonSet, and StatefulSet came to collectively be known as the core workloads API. We could finally orchestrate all of the things, but the API surface was spread across three groups, had many inconsistencies, and left users wondering about the stability of each of the core workloads kinds. It was time to stop adding new features and focus on consistency and stability.

Pod and ReplicationController were at GA stability, and even though you can run a workload in a Pod, it’s a nucleus primitive that belongs in core. As Deployments are the recommended way to manage your stateless apps, moving ReplicationController would serve no purpose. In Kubernetes 1.8, we moved all the other core workloads API kinds (Deployment, DaemonSet, ReplicaSet, and StatefulSet) to the apps/v1beta2 group version. This had the benefit of providing a better aggregation across the API surface, and allowing us to break backward compatibility to fix inconsistencies. Our plan was to promote this new surface to GA, wholesale and as is, when we were satisfied with its completeness. The modifications in this release, which are also implemented in apps/v1, are described below.

Selector Defaulting Deprecated

In prior versions of the apps and extensions groups, label selectors of the core workloads API kinds were, when left unspecified, defaulted to a label selector generated from the kind’s template’s labels.

This was completely incompatible with strategic merge patch and kubectl apply. Moreover, we’ve found that defaulting the value of a field from the value of another field of the same object is an anti-pattern, in general, and particularly dangerous for the API objects used to orchestrate workloads.

Immutable Selectors

Selector mutation, while allowing for some use cases like promotable Deployment canaries, is not handled gracefully by our workload controllers, and we have always strongly cautioned users against it. To provide a consistent, usable, and stable API, selectors were made immutable for all kinds in the workloads API.

We believe that there are better ways to support features like promotable canaries and orchestrated Pod relabeling, but, if restricted selector mutation is a necessary feature for our users, we can relax immutability in the future without breaking backward compatibility.

The development of features like promotable canaries, orchestrated Pod relabeling, and restricted selector mutability is driven by demand signals from our users. If you are currently modifying the selectors of your core workload API objects, please tell us about your use case via a GitHub issue, or by participating in SIG apps.

Default Rolling Updates

Prior to apps/v1beta2, some kinds defaulted their update strategy to something other than RollingUpdate (e.g. app/v1beta1/StatefulSet uses OnDelete by default). We wanted to be confident that RollingUpdate worked well prior to making it the default update strategy, and we couldn’t change the default behavior in released versions without breaking our promise with respect to backward compatibility. In apps/v1beta2 we enabled RollingUpdate for all core workloads kinds by default.

CreatedBy Annotation Deprecated

The "kubernetes.io/created-by" was a legacy hold over from the days before garbage collection. Users should use an object’s ControllerRef from its ownerReferences to determine object ownership. We deprecated this feature in 1.8 and removed it in 1.9.

Scale Subresources

A scale subresource was added to all of the applicable kinds in apps/v1beta2 (DaemonSet scales based on its node selector).

Kubernetes 1.9 and apps/v1

In Kubernetes 1.9, as planned, we promoted the entire core workloads API surface to GA in the apps/v1 group version. We made a few more changes to make the API consistent, but apps/v1 is mostly identical to apps/v1beta2. The reality is that most users have been treating the beta versions of the core workloads API as GA for some time now. Anyone who is still using ReplicationControllers and shying away from DaemonSets, Deployments, and StatefulSets, due to a perceived lack of stability, should plan migrate their workloads (where applicable) to apps/v1. The minor changes that were made during promotion are described below.

Garbage Collection Defaults to Delete

Prior to apps/v1 the default garbage collection policy for Pods in a DaemonSet, Deployment, ReplicaSet, or StatefulSet, was to orphan the Pods. That is, if you deleted one of these kinds, the Pods that they owned would not be deleted automatically unless cascading deletion was explicitly specified. If you use kubectl, you probably didn’t notice this, as these kinds are scaled to zero prior to deletion. In apps/v1 all core worloads API objects will now, by default, be deleted when their owner is deleted. For most users, this change is transparent.
Status Conditions

Prior to apps/v1 only Deployment and ReplicaSet had Conditions in their Status objects. For consistency's sake, either all of the objects or none of them should have conditions. After some debate, we decided that Conditions are useful, and we added Conditions to StatefulSetStatus and DaemonSetStatus. The StatefulSet and DaemonSet controllers currently don’t populate them, but we may choose communicate conditions to clients, via this mechanism, in the future.


Scale Subresource Migrated to autoscale/v1

We originally added a scale subresource to the apps group. This was the wrong direction for integration with the autoscaling, and, at some point, we would like to use custom metrics to autoscale StatefulSets. So the apps/v1 group version uses the autoscaling/v1 scale subresource.

Migration and Deprecation

The question most you’re probably asking now is, “What’s my migration path onto apps/v1 and how soon should I plan on migrating?” All of the group versions prior to apps/v1 are deprecated as of Kubernetes 1.9, and all new code should be developed against apps/v1, but, as discussed above, many of our users treat extensions/v1beta1 as if it were GA. We realize this, and the minimum support timelines in our deprecation policy are just that, minimums.

In future releases, before completely removing any of the group versions, we will disable them by default in the API Server. At this point, you will still be able to use the group version, but you will have to explicitly enable it. We will also provide utilities to upgrade the storage version of the API objects to apps/v1. Remember, all of the versions of the core workloads kinds are bidirectionally convertible. If you want to manually update your core workloads API objects now, you can use kubectl convert to convert manifests between group versions.

What’s Next?

The core workloads API surface is stable, but it’s still software, and software is never complete. We often add features to stable APIs to support new use cases, and we will likely do so for the core workloads API as well. GA stability means that any new features that we do add will be strictly backward compatible with the existing API surface. From this point forward, nothing we do will break our backwards compatibility guarantees. If you’re looking to participate in the evolution of this portion of the API, please feel free to get involved in GitHub or to participate in SIG Apps.

--Kenneth Owens, Software Engineer, Google

Friday, January 12, 2018

Introducing client-go version 6

The Kubernetes API server exposes a REST interface consumable by any client. client-go is the official client library for the Go programming language. It is used both internally by Kubernetes itself (for example, inside kubectl) as well as by numerous external consumers:operators like the etcd-operator or prometheus-operator; higher level frameworks like KubeLess and OpenShift; and many more.

The version 6 update to client-go adds support for Kubernetes 1.9, allowing access to the latest Kubernetes features. While the changelog contains all the gory details, this blog post highlights the most prominent changes and intends to guide on how to upgrade from version 5.

This blog post is one of a number of efforts to make client-go more accessible to third party consumers. Easier access is a joint effort by a number of people from numerous companies, all meeting in the #client-go-docs channel of the Kubernetes Slack. We are happy to hear feedback and ideas for further improvement, and of course appreciate anybody who wants to contribute.

API group changes

The following API group promotions are part of Kubernetes 1.9:
  • Workload objects (Deployments, DaemonSets, ReplicaSets, and StatefulSets) have been promoted to the apps/v1 API group in Kubernetes 1.9. client-go follows this transition and allows developers to use the latest version by importing the k8s.io/api/apps/v1 package instead of k8s.io/api/apps/v1beta1 and by using Clientset.AppsV1().
  • Admission Webhook Registration has been promoted to the admissionregistration.k8s.io/v1beta1 API group in Kubernetes 1.9. The former ExternalAdmissionHookConfiguration type has been replaced by the incompatible ValidatingWebhookConfiguration and MutatingWebhookConfiguration types. Moreover, the webhook admission payload type AdmissionReview in admission.k8s.io has been promoted to v1beta1. Note that versioned objects are now passed to webhooks. Refer to the admission webhook documentation for details.

Validation for CustomResources

In Kubernetes 1.8 we introduced CustomResourceDefinitions (CRD) pre-persistence schema validation as an alpha feature. With 1.9, the feature got promoted to beta and will be enabled by default. As a client-go user, you will find the API types at k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1beta1.

The OpenAPI v3 schema can be defined in the CRD spec as:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata: ...
spec:
 ...
 validation:
   openAPIV3Schema:
     properties:
       spec:
         properties:
           version:
               type: string
               enum:
               - "v1.0.0"
               - "v1.0.1"
           replicas:
               type: integer
               minimum: 1
               maximum: 10

The schema in the above CRD applies following validations for the instance:
  1. spec.version must be a string and must be either “v1.0.0” or “v1.0.1”.
  2. spec.replicas must be an integer and must have a minimum value of 1 and a maximum value of 10.
A CustomResource with invalid values for spec.version (v1.0.2) and spec.replicas (15) will be rejected:

apiVersion: mygroup.example.com/v1
kind: App
metadata:
 name: example-app
spec:
 version: "v1.0.2"
 replicas: 15
$ kubectl create -f app.yaml
The App "example-app" is invalid: []: Invalid value: map[string]interface {}{"apiVersion":"mygroup.example.com/v1", "kind":"App", "metadata":map[string]interface {}{"creationTimestamp":"2017-08-31T20:52:54Z", "uid":"5c674651-8e8e-11e7-86ad-f0761cb232d1", "selfLink":"", "clusterName":"", "name":"example-app", "namespace":"default", "deletionTimestamp":interface {}(nil), "deletionGracePeriodSeconds":(*int64)(nil)}, "spec":map[string]interface {}{"replicas":15, "version":"v1.0.2"}}:
validation failure list:
spec.replicas in body should be less than or equal to 10
spec.version in body should be one of [v1.0.0 v1.0.1]

Note that with Admission Webhooks, Kubernetes 1.9 provides another beta feature to validate objects before they are created or updated. Starting with 1.9, these webhooks also allow mutation of objects (for example, to set defaults or to inject values). Of course, webhooks work with CRDs as well. Moreover, webhooks can be used to implement validations that are not easily expressible with CRD validation. Note that webhooks are harder to implement than CRD validation, so for many purposes, CRD validation is the right tool.

Creating namespaced informers

Often objects in one namespace or only with certain labels are to be processed in a controller. Informers now allow you to tweak the ListOptions used to query the API server to list and watch objects. Uninitialized objects (for consumption by initializers) can be made visible by setting IncludeUnitialized to true. All this can be done using the new NewFilteredSharedInformerFactory constructor for shared informers:
import “k8s.io/client-go/informers”
...
sharedInformers := informers.NewFilteredSharedInformerFactory(
client,
30*time.Minute,
“some-namespace”,
func(opt *metav1.ListOptions) {
opt.LabelSelector = “foo=bar”
},
)  

Note that the corresponding lister will only know about the objects matching the namespace and the given ListOptions. Note that the same restrictions apply for a List or Watch call on a client.

This production code example of a cert-manager demonstrates how namespace informers can be used in real code.

Polymorphic scale client

Historically, only types in the extensions API group would work with autogenerated Scale clients. Furthermore, different API groups use different Scale types for their /scale subresources. To remedy these issues, k8s.io/client-go/scale provides a polymorphic scale client to scale different resources in different API groups in a coherent way:

import (

apimeta "k8s.io/apimachinery/pkg/api/meta"
discocache "k8s.io/client-go/discovery/cached"
"k8s.io/client-go/discovery"
"k8s.io/client-go/dynamic"
“k8s.io/client-go/scale”
)
...
cachedDiscovery := discocache.NewMemCacheClient(client.Discovery())
restMapper := discovery.NewDeferredDiscoveryRESTMapper(
cachedDiscovery,
apimeta.InterfacesForUnstructured,
)
scaleKindResolver := scale.NewDiscoveryScaleKindResolver(
client.Discovery(),
)
scaleClient, err := scale.NewForConfig(
client, restMapper,
dynamic.LegacyAPIPathResolverFunc,
scaleKindResolver,
)
scale, err := scaleClient.Scales("default").Get(groupResource, "foo") 

The returned scale object is generic and is exposed as the autoscaling/v1.Scale object. It is backed by an internal Scale type, with conversions defined to and from all the special Scale types in the API groups supporting scaling. We planto extend this to CustomResources in 1.10.

If you’re implementing support for the scale subresource, we recommend that you expose the autoscaling/v1.Scale object.

Type-safe DeepCopy

Deeply copying an object formerly required a call to Scheme.Copy(Object) with the notable disadvantage of losing type safety. A typical piece of code from client-go version 5 required type casting:

newObj, err := runtime.NewScheme().Copy(node)

if err != nil {
   return fmt.Errorf("failed to copy node %v: %s”, node, err)
}

newNode, ok := newObj.(*v1.Node)
if !ok {
   return fmt.Errorf("failed to type-assert node %v", newObj)

}


Thanks to k8s.io/code-generator, Copy has now been replaced by a type-safe DeepCopy method living on each object, allowing you to simplify code significantly both in terms of volume and API error surface:

newNode := node.DeepCopy()

No error handling is necessary: this call never fails. If and only if the node is nil does DeepCopy() return nil.

To copy runtime.Objects there is an additional DeepCopyObject() method in the runtime.Object interface.

With the old method gone for good, clients need to update their copy invocations accordingly.

Code generation and CustomResources

Using client-go’s dynamic client to access CustomResources is discouraged and superseded by type-safe code using the generators in k8s.io/code-generator. Check out the Deep Dive on the Open Shift blog to learn about using code generation with client-go.

Comment Blocks

You can now place tags in the comment block just above a type or function, or in the second block above. There is no distinction anymore between these two comment blocks. This used to a be a source of subtle errors when using the generators:

// second block above
// +k8s:some-tag

// first block above
// +k8s:another-tag
type Foo struct {}

Custom Client Methods

You can now use extended tag definitions to create custom verbs . This lets you expand beyond the verbs defined by HTTP. This opens the door to higher levels of customization.

For example, this block leads to the generation of the method UpdateScale(s *autoscaling.Scale) (*autoscaling.Scale, error):

// genclient:method=UpdateScale,verb=update,subresource=scale,input=k8s.io/kubernetes/pkg/apis/autoscaling.Scale,result=k8s.io/kubernetes/pkg/apis/autoscaling.Scale

Resolving Golang Naming Conflicts

In more complex API groups it’s possible for Kinds, the group name, the Go package name, and the Go group alias name to conflict. This was not handled correctly prior to 1.9. The following tags resolve naming conflicts and make the generated code prettier:


// +groupName=example2.example.com
// +groupGoName=SecondExample

These are usually in the doc.go file of an API package. The first is used as the CustomResource group name when RESTfully speaking to the API server using HTTP. The second is used in the generated Golang code (for example, in the clientset) to access the group version:

clientset.SecondExampleV1()

It’s finally possible to have dots in Go package names. In this section’s example, you would put the groupName snippet into the pkg/apis/example2.example.com directory of your project.

Example projects

Kubernetes 1.9 includes a number of example projects which can serve as a blueprint for your own projects:

Vendoring

In order to update from the previous version 5 to version 6 of client-go, the library itself as well as certain third-party dependencies must be updated. Previously, this process had been tedious due to the fact that a lot of code got refactored or relocated within the existing package layout across releases. Fortunately, far less code had to move in the latest version, which should ease the upgrade procedure for most users.

State of the published repositories

In the past k8s.io/client-go, k8s.io/api, and k8s.io/apimachinery were updated infrequently. Tags (for example, v4.0.0) were created quite some time after the Kubernetes releases. With the 1.9 release we resumed running a nightly bot that updates all the repositories for public consumption, even before manual tagging. This includes the branches:
  • master
  • release-1.8 / release-5.0
  • release-1.9 / release-6.0 
Kubernetes tags (for example, v1.9.1-beta1) are also applied automatically to the published repositories, prefixed with kubernetes- (for example, kubernetes-1.9.1-beta1).

These tags have limited test coverage, but can be used by early adopters of client-go and the other libraries. Moreover, they help to vendor the correct version of k8s.io/api and k8s.io/apimachinery. Note that we only create a v6.0.3-like semantic versioning tag on k8s.io/client-go. The corresponding tag for k8s.io/api and k8s.io/apimachinery is kubernetes-1.9.3.

Also note that only these tags correspond to tested releases of Kubernetes. If you depend on the release branch, e.g., release-1.9, your client is running on unreleased Kubernetes code.

State of vendoring of client-go

In general, the list of which dependencies to vendor is automatically generated and written to the file Godeps/Godeps.json. Only the revisions listed there are tested. This means especially that we do not and cannot test the code-base against master branches of our dependencies. This puts us in the following situation depending on the used vendoring tool:
  • godep reads Godeps/Godeps.json by running godep restore from k8s.io/client-go in your GOPATH. Then use godep save to vendor in your project. godep will choose the correct versions from your GOPATH.
  • glide reads Godeps/Godeps.json automatically from its dependencies including from k8s.io/client-go, both on init and on update. Hence, glide should be mostly automatic as long as there are no conflicts.
  • dep does not currently respect Godeps/Godeps.json in a consistent way, especially not on updates. It is crucial to specify client-go dependencies manually as constraints or overrides, also for non k8s.io/* dependencies. Without those, dep simply chooses the dependency master branches, which can cause problems as they are updated frequently.
  • The Kubernetes and golang/dep community are aware of the problems [issue #1124, issue #1236] and are working together on solutions. Until then special care must be taken.
Please see client-go’s INSTALL.md for more details.

Updating dependencies – golang/dep

Even with the deficiencies of golang/dep today, dep is slowly becoming the de-facto standard in the Go ecosystem. With the necessary care and the awareness of the missing features, dep can be (and is!) used successfully. Here’s a demonstration of how to update a project with client-go 5 to the latest version 6 using dep:

(If you are still running client-go version 4 and want to play it safe by not skipping a release, now is a good time to check out this excellent blog post describing how to upgrade to version 5, put together by our friends at Heptio.)

Before starting, it is important to understand that client-go depends on two other Kubernetes projects: k8s.io/apimachinery and k8s.io/api. In addition, if you are using CRDs, you probably also depend on k8s.io/apiextensions-apiserver for the CRD client. The first exposes lower-level API mechanics (such as schemes, serialization, and type conversion), the second holds API definitions, and the third provides APIs related to CustomResourceDefinitions. In order for client-go to operate correctly, it needs to have its companion libraries vendored in correspondingly matching versions. Each library repository provides a branch named release-<version> where <version> refers to a particular Kubernetes version; for client-go version 6, it is imperative to refer to the release-1.9 branch on each repository.

Assuming the latest version 5 patch release of client-go being vendored through dep, the Gopkg.toml manifest file should look something like this (possibly using branches instead of versions):



[[constraint]]

 name = "k8s.io/api"
 version = "kubernetes-1.8.1"

[[constraint]]
 name = "k8s.io/apimachinery"
 version = "kubernetes-1.8.1"

[[constraint]]
 name = "k8s.io/apiextensions-apiserver"
 version = "kubernetes-1.8.1"

[[constraint]]
 name = "k8s.io/client-go"


 version = "5.0.1"

Note that some of the libraries could be missing if they are not actually needed by the client.

Upgrading to client-go version 6 means bumping the version and tag identifiers as following (emphasis given):




[constraint]]

 name = "k8s.io/api"
 version = "kubernetes-1.9.0"

[[constraint]]
 name = "k8s.io/apimachinery"
 version = "kubernetes-1.9.0"

[[constraint]]
 name = "k8s.io/apiextensions-apiserver"
 version = "kubernetes-1.9.0"

[[constraint]]
 name = "k8s.io/client-go"


 version = "6.0.0"


The result of the upgrade can be found here.

A note of caution: dep cannot capture the complete set of dependencies in a reliable and reproducible fashion as described above. This means that for a 100% future-proof project you have to add constraints (or even overrides) to many other packages listed in client-go’s Godeps/Godeps.json. Be prepared to add them if something breaks. We are working with the golang/dep community to make this an easier and more smooth experience.

Finally, we need to tell dep to upgrade to the specified versions by executing dep ensure. If everything goes well, the output of the command invocation should be empty, with the only indication that it was successful being a number of updated files inside the vendor folder.

If you are using CRDs, you probably also use code-generation. The following block for Gopkg.toml will add the required code-generation packages to your project:

required = [
 "k8s.io/code-generator/cmd/client-gen",
 "k8s.io/code-generator/cmd/conversion-gen",
 "k8s.io/code-generator/cmd/deepcopy-gen",
 "k8s.io/code-generator/cmd/defaulter-gen",
 "k8s.io/code-generator/cmd/informer-gen",
 "k8s.io/code-generator/cmd/lister-gen",
]

[[constraint]]
 branch = "kubernetes-1.9.0"

 name = "k8s.io/code-generator"


Whether you would also like to prune unneeded packages (such as test files) through dep or commit the changes into the VCS at this point is up to you -- but from an upgrade perspective, you should now be ready to harness all the fancy new features that Kubernetes 1.9 brings through client-go.

Thursday, January 11, 2018

Extensible Admission is Beta

In this post we review a feature, available in the Kubernetes API server, that allows you to implement arbitrary control decisions and which has matured considerably in Kubernetes 1.9.

The admission stage of API server processing is one of the most powerful tools for securing a Kubernetes cluster by restricting the objects that can be created, but it has always been limited to compiled code. In 1.9, we promoted webhooks for admission to beta, allowing you to leverage admission from outside the API server process.

What is Admission?

Admission is the phase of handling an API server request that happens before a resource is persisted, but after authorization. Admission gets access to the same information as authorization (user, URL, etc) and the complete body of an API request (for most requests).



The admission phase is composed of individual plugins, each of which are narrowly focused and have semantic knowledge of what they are inspecting. Examples include: PodNodeSelector (influences scheduling decisions), PodSecurityPolicy (prevents escalating containers), and ResourceQuota (enforces resource allocation per namespace).

Admission is split into two phases:
  1. Mutation, which allows modification of the body content itself as well as rejection of an API request.
  2. Validation, which allows introspection queries and rejection of an API request.
An admission plugin can be in both phases, but all mutation happens before validation.

Mutation

The mutation phase of admission allows modification of the resource content before it is persisted. Because the same field can be mutated multiple times while in the admission chain, the order of the admission plugins in the mutation matters.

One example of a mutating admission plugin is the `PodNodeSelector` plugin, which uses an annotation on a namespace `namespace.annotations[“scheduler.alpha.kubernetes.io/node-selector”]` to find a label selector and add it to the `pod.spec.nodeselector` field. This positively restricts which nodes the pods in a particular namespace can land on, as opposed to taints, which provide negative restriction (also with an admission plugin).

Validation

The validation phase of admission allows the enforcement of invariants on particular API resources. The validation phase runs after all mutators finish to ensure that the resource isn’t going to change again.

One example of a validation admission plugin is also the `PodNodeSelector` plugin, which ensures that all pods’ `spec.nodeSelector` fields are constrained by the node selector restrictions on the namespace. Even if a mutating admission plugin tries to change the `spec.nodeSelector` field after the PodNodeSelector runs in the mutating chain, the PodNodeSelector in the validating chain prevents the API resource from being created because it fails validation.

What are admission webhooks?

Admission webhooks allow a Kubernetes installer or a cluster-admin to add mutating and validating admission plugins to the admission chain of `kube-apiserver` as well as any extensions apiserver based on k8s.io/apiserver 1.9, like metrics, service-catalog, or kube-projects, without recompiling them. Both kinds of admission webhooks run at the end of their respective chains and have the same powers and limitations as compiled admission plugins.

What are they good for?

Webhook admission plugins allow for mutation and validation of any resource on any API server, so the possible applications are vast. Some common use-cases include:
  1. Mutation of resources like pods. Istio has talked about doing this to inject side-car containers into pods. You could also write a plugin which forcefully resolves image tags into image SHAs.
  2. Name restrictions. On multi-tenant systems, reserving namespaces has emerged as a use-case.
  3. Complex CustomResource validation. Because the entire object is visible, a clever admission plugin can perform complex validation on dependent fields (A requires B) and even external resources (compare to LimitRanges).
  4. Security response. If you forced image tags into image SHAs, you could write an admission plugin that prevents certain SHAs from running.

Registration

Webhook admission plugins of both types are registered in the API, and all API servers (kube-apiserver and all extension API servers) share a common config for them. During the registration process, a webhook admission plugin describes:

  1. How to connect to the webhook admission server
  2. How to verify the webhook admission server (Is it really the server I expect?)
  3. Where to send the data at that server (which URL path)
  4. Which resources and which HTTP verbs it will handle
  5. What an API server should do on connection failures (for example, if the admission webhook server goes down)
1 apiVersion: admissionregistration.k8s.io/v1beta1
2 kind: ValidatingWebhookConfiguration
3 metadata:
4   name: namespacereservations.admission.online.openshift.io
5 webhooks:
6 - name: namespacereservations.admission.online.openshift.io
7   clientConfig:
8     service:
9       namespace: default
10      name: kubernetes
11     path: /apis/admission.online.openshift.io/v1alpha1/namespacereservations
12    caBundle: KUBE_CA_HERE
13  rules:
14  - operations:
15    - CREATE
16    apiGroups:
17    - ""
18    apiVersions:
19    - "*"
20    resources:
21    - namespaces
22  failurePolicy: Fail

Line 6: `name` - the name for the webhook itself. For mutating webhooks, these are sorted to provide ordering.
Line 7: `clientConfig` - provides information about how to connect to, trust, and send data to the webhook admission server.
Line 13: `rules` - describe when an API server should call this admission plugin. In this case, only for creates of namespaces. You can specify any resource here so specifying creates of `serviceinstances.servicecatalog.k8s.io` is also legal.
Line 22: `failurePolicy` - says what to do if the webhook admission server is unavailable. Choices are “Ignore” (fail open) or “Fail” (fail closed). Failing open makes for unpredictable behavior for all clients.

Authentication and trust


Because webhook admission plugins have a lot of power (remember, they get to see the API resource content of any request sent to them and might modify them for mutating plugins), it is important to consider:
  • How individual API servers verify their connection to the webhook admission server 
  • How the webhook admission server authenticates precisely which API server is contacting it
  • Whether that particular API server has authorization to make the request
There are three major categories of connection:
  1. From kube-apiserver or extension-apiservers to externally hosted admission webhooks (webhooks not hosted in the cluster)
  2. From kube-apiserver to self-hosted admission webhooks
  3. From extension-apiservers to self-hosted admission webhooks
To support these categories, the webhook admission plugins accept a kubeconfig file which describes how to connect to individual servers. For interacting with externally hosted admission webhooks, there is really no alternative to configuring that file manually since the authentication/authorization and access paths are owned by the server you’re hooking to.

For the self-hosted category, a cleverly built webhook admission server and topology can take advantage of the safe defaulting built into the admission plugin and have a secure, portable, zero-config topology that works from any API server.

Simple, secure, portable, zero-config topology

If you build your webhook admission server to also be an extension API server, it becomes possible to aggregate it as a normal API server. This has a number of advantages:
  • Your webhook becomes available like any other API under default kube-apiserver service `kubernetes.default.svc` (e.g. https://kubernetes.default.svc/apis/admission.example.com/v1/mymutatingadmissionreviews). Among other benefits, you can test using `kubectl`.
  • Your webhook automatically (without any config) makes use of the in-cluster authentication and authorization provided by kube-apiserver. You can restrict access to your webhook with normal RBAC rules.
  • Your extension API servers and kube-apiserver automatically (without any config) make use of their in-cluster credentials to communicate with the webhook.
  • Extension API servers do not leak their service account token to your webhook because they go through kube-apiserver, which is a secure front proxy.


Source: https://drive.google.com/a/redhat.com/file/d/12nC9S2fWCbeX_P8nrmL6NgOSIha4HDNp

In short: a secure topology makes use of all security mechanisms of API server aggregation and additionally requires no additional configuration.

Other topologies are possible but require additional manual configuration as well as a lot of effort to create a secure setup, especially when extension API servers like service catalog come into play. The topology above is zero-config and portable to every Kubernetes cluster.

How do I write a webhook admission server?

Writing a full server complete with authentication and authorization can be intimidating. To make it easier, there are projects based on Kubernetes 1.9 that provide a library for building your webhook admission server in 200 lines or less. Take a look at the generic-admission-apiserver and the kubernetes-namespace-reservation projects for the library and an example of how to build your own secure and portable webhook admission server.

With the admission webhooks introduced in 1.9 we’ve made Kubernetes even more adaptable to your needs. We hope this work, driven by both Red Hat and Google, will enable many more workloads and support ecosystem components. (Istio is one example.) Now is a good time to give it a try!

If you’re interested in giving feedback or contributing to this area, join us in the SIG API machinery.