An open source system for automating deployment, scaling, and operations of applications.

Friday, March 31, 2017

Advanced Scheduling in Kubernetes

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.6

The Kubernetes scheduler’s default behavior works well for most cases -- for example, it ensures that pods are only placed on nodes that have sufficient free resources, it ties to spread pods from the same set (ReplicaSet, StatefulSet, etc.) across nodes, it tries to balance out the resource utilization of nodes, etc.

But sometimes you want to control how your pods are scheduled. For example, perhaps you want to ensure that certain pods only schedule on nodes with specialized hardware, or you want to co-locate services that communicate frequently, or you want to dedicate a set of nodes to a particular set of users. Ultimately, you know much more about how your applications should be scheduled and deployed than Kubernetes ever will. So Kubernetes 1.6 offers four advanced scheduling features: node affinity/anti-affinity, taints and tolerations, pod affinity/anti-affinity, and custom schedulers. Each of these features are now in beta in Kubernetes 1.6.

Node Affinity/Anti-Affinity

Node Affinity/Anti-Affinity is one way to set rules on which nodes are selected by the scheduler. This feature is a generalization of the nodeSelector feature which has been in Kubernetes since version 1.0. The rules are defined using the familiar concepts of custom labels on nodes and selectors specified in pods, and they can be either required or preferred, depending on how strictly you want the scheduler to enforce them. 

Required rules must be met for a pod to schedule on a particular node. If no node matches the criteria (plus all of the other normal criteria, such as having enough free resources for the pod’s resource request), then the pod won’t be scheduled. Required rules are specified in the requiredDuringSchedulingIgnoredDuringExecution field of nodeAffinity

For example, if we want to require scheduling on a node that is in the us-central1-a GCE zone of a multi-zone Kubernetes cluster, we can specify the following affinity rule as part of the Pod spec:

       - matchExpressions:
         - key: ""
           operator: In
           values: ["us-central1-a"]

IgnoredDuringExecution” means that the pod will still run if labels on a node change and affinity rules are no longer met. There are future plans to offer requiredDuringSchedulingRequiredDuringExecution which will evict pods from nodes as soon as they don’t satisfy the node affinity rule(s).

Preferred rules mean that if nodes match the rules, they will be chosen first, and only if no preferred nodes are available will non-preferred nodes be chosen. You can prefer instead of require that pods are deployed to us-central1-a by slightly changing the pod spec to use preferredDuringSchedulingIgnoredDuringExecution:

       - matchExpressions:
         - key: ""
           operator: In
           values: ["us-central1-a"]

Node anti-affinity can be achieved by using negative operators. So for instance if we want our pods to avoid us-central1-a we can do this:

       - matchExpressions:
         - key: ""
           operator: NotIn
           values: ["us-central1-a"]

Valid operators you can use are In, NotIn, Exists, DoesNotExist. Gt, and Lt.

Additional use cases for this feature are to restrict scheduling based on nodes’ hardware architecture, operating system version, or specialized hardware. Node affinity/anti-affinity is beta in Kubernetes 1.6. 

Taints and Tolerations

A related feature is “taints and tolerations,” which allows you to mark (“taint”) a node so that no pods can schedule onto it unless a pod explicitly “tolerates” the taint. Marking nodes instead of pods (as in node affinity/anti-affinity) is particularly useful for situations where most pods in the cluster should avoid scheduling onto the node. For example, you might want to mark your master node as schedulable only by Kubernetes system components, or dedicate a set of nodes to a particular group of users, or keep regular pods away from nodes that have special hardware so as to leave room for pods that need the special hardware.

The kubectl command allows you to set taints on nodes, for example:

kubectl taint nodes node1 key=value:NoSchedule

creates a taint that marks the node as unschedulable by any pods that do not have a toleration for taint with key key, value value, and effect NoSchedule. (The other taint effects are PreferNoSchedule, which is the preferred version of NoSchedule, and NoExecute, which means any pods that are running on the node when the taint is applied will be evicted unless they tolerate the taint.) The toleration you would add to a PodSpec to have the corresponding pod tolerate this taint would look like this

- key: "key"
 operator: "Equal"
 value: "value"
 effect: "NoSchedule"

In addition to moving taints and tolerations to beta in Kubernetes 1.6, we have introduced an alpha feature that uses taints and tolerations to allow you to customize how long a pod stays bound to a node when the node experiences a problem like a network partition instead of using the default five minutes. See this section of the documentation for more details.

Pod Affinity/Anti-Affinity

Node affinity/anti-affinity allows you to constrain which nodes a pod can run on based on the nodes’ labels. But what if you want to specify rules about how pods should be placed relative to one another, for example to spread or pack pods within a service or relative to pods in other services? For that you can use pod affinity/anti-affinity, which is also beta in Kubernetes 1.6.

Let’s look at an example. Say you have front-ends in service S1, and they communicate frequently with back-ends that are in service S2 (a “north-south” communication pattern). So you want these two services to be co-located in the same cloud provider zone, but you don’t want to have to choose the zone manually--if the zone fails, you want the pods to be rescheduled to another (single) zone. You can specify this with a pod affinity rule that looks like this (assuming you give the pods of this service a label “service=S2” and the pods of the other service a label “service=S1”):

     - labelSelector:
         - key: service
           operator: In
           values: [“S1”]

As with node affinity/anti-affinity, there is also a preferredDuringSchedulingIgnoredDuringExecution variant.

Pod affinity/anti-affinity is very flexible. Imagine you have profiled the performance of your services and found that containers from service S1 interfere with containers from service S2 when they share the same node, perhaps due to cache interference effects or saturating the network link. Or maybe due to security concerns you never want containers of S1 and S2 to share a node. To implement these rules, just make two changes to the snippet above -- change podAffinity to podAntiAffinity and change topologyKey to

Custom Schedulers

If the Kubernetes scheduler’s various features don’t give you enough control over the scheduling of your workloads, you can delegate responsibility for scheduling arbitrary subsets of pods to your own custom scheduler(s) that run(s) alongside, or instead of, the default Kubernetes scheduler. Multiple schedulers is beta in Kubernetes 1.6.

Each new pod is normally scheduled by the default scheduler. But if you provide the name of your own custom scheduler, the default scheduler will ignore that Pod and allow your scheduler to schedule the Pod to a node. Let’s look at an example.

Here we have a Pod where we specify the schedulerName field:

apiVersion: v1
kind: Pod
 name: nginx
   app: nginx
 schedulerName: my-scheduler
 - name: nginx
   image: nginx:1.10

If we create this Pod without deploying a custom scheduler, the default scheduler will ignore it and it will remain in a Pending state. So we need a custom scheduler that looks for, and schedules, pods whose schedulerName field is my-scheduler.

A custom scheduler can be written in any language and can be as simple or complex as you need. Here is a very simple example of a custom scheduler written in Bash that assigns a node randomly. Note that you need to run this along with kubectl proxy for it to work.

while true;
   for PODNAME in $(kubectl --server $SERVER get pods -o json | jq '.items[] | select(.spec.schedulerName == "my-scheduler") | select(.spec.nodeName == null) |' | tr -d '"')
       NODES=($(kubectl --server $SERVER get nodes -o json | jq '.items[]' | tr -d '"'))
       curl --header "Content-Type:application/json" --request POST --data '{"apiVersion":"v1", "kind": "Binding", "metadata": {"name": "'$PODNAME'"}, "target": {"apiVersion": "v1", "kind"
: "Node", "name": "'$CHOSEN'"}}' http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/
       echo "Assigned $PODNAME to $CHOSEN"
   sleep 1

Learn more

The Kubernetes 1.6 release notes have more information about these features, including details about how to change your configurations if you are already using the alpha version of one or more of these features (this is required, as the move from alpha to beta is a breaking change for these features).


The features described here, both in their alpha and beta forms, were a true community effort, involving engineers from Google, Huawei, IBM, Red Hat and more.

Get Involved

Share your voice at our weekly community meeting
  • Post questions (or answer questions) on Stack Overflow 
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack (room #sig-scheduling)

Many thanks for your contributions.

--Ian Lewis, Developer Advocate, and David Oppenheimer, Software Engineer, Google

Thursday, March 30, 2017

Scalability updates in Kubernetes 1.6: 5,000 node and 150,000 pod clusters

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.6

Last summer we shared updates on Kubernetes scalability, since then we’ve been working hard and are proud to announce that Kubernetes 1.6 can handle 5,000-node clusters with up to 150,000 pods. Moreover, those cluster have even better end-to-end pod startup time than the previous 2,000-node clusters in the 1.3 release; and latency of the API calls are within the one-second SLO.

In this blog post we review what metrics we monitor in our tests and describe our performance results from Kubernetes 1.6. We also discuss what changes we made to achieve the improvements, and our plans for upcoming releases in the area of system scalability.

X-node clusters - what does it mean?

Now that Kubernetes 1.6 is released, it is a good time to review what it means when we say we “support” X-node clusters. As described in detail in a previous blog post, we currently have two performance-related Service Level Objectives (SLO):
  • API-responsiveness: 99% of all API calls return in less than 1s
  • Pod startup time: 99% of pods and their containers (with pre-pulled images) start within 5s.
As before, it is possible to run larger deployments than the stated supported 5,000-node cluster (and users have), but performance may be degraded and it may not meet our strict SLO defined above.

We are aware of the limited scope of these SLOs. There are many aspects of the system that they do not exercise. For example, we do not measure how soon a new pod that is part of a service will be reachable through the service IP address after the pod is started. If you are considering using large Kubernetes clusters and have performance requirements not covered by our SLOs, please contact the Kubernetes Scalability SIG so we can help you understand whether Kubernetes is ready to handle your workload now.

The top scalability-related priority for upcoming Kubernetes releases is to enhance our definition of what it means to support X-node clusters by:
  • refining currently existing SLOs
  • adding more SLOs (that will cover various areas of Kubernetes, including networking)

Kubernetes 1.6 performance metrics at scale

So how does performance in large clusters look in Kubernetes 1.6? The following graph shows the end-to-end pod startup latency with 2000- and 5000-node clusters. For comparison, we also show the same metric from Kubernetes 1.3, which we published in our previous scalability blog post that described support for 2000-node clusters. As you can see, Kubernetes 1.6 has better pod startup latency with both 2000 and 5000 nodes compared to Kubernetes 1.3 with 2000 nodes [1].

The next graph shows API response latency for a 5000-node Kubernetes 1.6 cluster. The latencies at all percentiles are less than 500ms, and even 90th percentile is less than about 100ms.

How did we get here?

Over the past nine months (since the last scalability blog post), there have been a huge number of performance and scalability related changes in Kubernetes. In this post we will focus on the two biggest ones and will briefly enumerate a few others.

etcd v3
In Kubernetes 1.6 we switched the default storage backend (key-value store where the whole cluster state is stored) from etcd v2 to etcd v3. The initial works towards this transition has been started during the 1.3 release cycle. You might wonder why it took us so long, given that:
  • the first stable version of etcd supporting the v3 API was announced on June 30, 2016
  • the new API was designed together with the Kubernetes team to support our needs (from both a feature and scalability perspective)
  • the integration of etcd v3 with Kubernetes had already mostly been finished when etcd v3 was announced (indeed CoreOS used Kubernetes as a proof-of-concept for the new etcd v3 API)
As it turns out, there were a lot of reasons. We will describe the most important ones below.
  • Changing storage in a backward incompatible way, as is in the case for the etcd v2 to v3 migration, is a big change, and thus one for which we needed a strong justification. We found this justification in September when we determined that we would not be able to scale to 5000-node clusters if we continued to use etcd v2 (kubernetes/32361 contains some discussion about it). In particular, what didn’t scale was the watch implementation in etcd v2. In a 5000-node cluster, we need to be able to send at least 500 watch events per second to a single watcher, which wasn’t possible in etcd v2.
  • Once we had the strong incentive to actually update to etcd v3, we started thoroughly testing it. As you might expect, we found some issues. There were some minor bugs in Kubernetes, and in addition we requested a performance improvement in etcd v3’s watch implementation (watch was the main bottleneck in etcd v2 for us). This led to the 3.0.10 etcd patch release.
  • Once those changes had been made, we were convinced that new Kubernetes clusters would work with etcd v3. But the large challenge of migrating existing clusters remained. For this we needed to automate the migration process, thoroughly test the underlying CoreOS etcd upgrade tool, and figure out a contingency plan for rolling back from v3 to v2.
But finally, we are confident that it should work.

Switching storage data format to protobuf
In the Kubernetes 1.3 release, we enabled protobufs as the data format for Kubernetes components to communicate with the API server (in addition to maintaining support for JSON). This gave us a huge performance improvement.

However, we were still using JSON as a format in which data was stored in etcd, even though technically we were ready to change that. The reason for delaying this migration was related to our plans to migrate to etcd v3. Now you are probably wondering how this change was depending on migration to etcd v3. The reason for it was that with etcd v2 we couldn’t really store data in binary format (to workaround it we were additionally base64-encoding the data), whereas with etcd v3 it just worked. So to simplify the transition to etcd v3 and avoid some non-trivial transformation of data stored in etcd during it, we decided to wait with switching storage data format to protobufs until migration to etcd v3 storage backend is done.

Other optimizations
We made tens of optimizations throughout the Kubernetes codebase during the last three releases, including:
  • optimizing the scheduler (which resulted in 5-10x higher scheduling throughput)
  • switching all controllers to a new recommended design using shared informers, which reduced resource consumption of controller-manager - for reference see this document
  • optimizing individual operations in the API server (conversions, deep-copies, patch)
  • reducing memory allocation in the API server (which significantly impacts the latency of API calls)
We want to emphasize that the optimization work we have done during the last few releases, and indeed throughout the history of the project, is a joint effort by many different companies and individuals from the whole Kubernetes community.

What’s next?

People frequently ask how far we are going to go in improving Kubernetes scalability. Currently we do not have plans to increase scalability beyond 5000-node clusters (within our SLOs) in the next few releases. If you need clusters larger than 5000 nodes, we recommend to use federation to aggregate multiple Kubernetes clusters.

However, that doesn’t mean we are going to stop working on scalability and performance. As we mentioned at the beginning of this post, our top priority is to refine our two existing SLOs and introduce new ones that will cover more parts of the system, e.g. networking. This effort has already started within the Scalability SIG. We have made significant progress on how we would like to define performance SLOs, and this work should be finished in the coming month.

Join the effort
If you are interested in scalability and performance, please join our community and help us shape Kubernetes. There are many ways to participate, including:
Thanks for the support and contributions! Read more in-depth posts on what's new in Kubernetes 1.6 here.

-- Wojciech Tyczynski, Software Engineer, Google

[1] We are investigating why 5000-node clusters have better startup time than 2000-node clusters. The current theory is that it is related to running 5000-node experiments using 64-core master and 2000-node experiments using 32-core master.

Wednesday, March 29, 2017

Dynamic Provisioning and Storage Classes in Kubernetes

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.6

Storage is a critical part of running stateful containers, and Kubernetes offers powerful primitives for managing it. Dynamic volume provisioning, a feature unique to Kubernetes, allows storage volumes to be created on-demand. Before dynamic provisioning, cluster administrators had to manually make calls to their cloud or storage provider to provision new storage volumes, and then create PersistentVolume objects to represent them in Kubernetes. With dynamic provisioning, these two steps are automated, eliminating the need for cluster administrators to pre-provision storage. Instead, the storage resources can be dynamically provisioned using the provisioner specified by the StorageClass object (see user-guide). StorageClasses are essentially blueprints that abstract away the underlying storage provider, as well as other parameters, like disk-type (e.g.; solid-state vs standard disks).

StorageClasses use provisioners that are specific to the storage platform or cloud provider to give Kubernetes access to the physical media being used. Several storage provisioners are provided in-tree (see user-guide), but additionally out-of-tree provisioners are now supported (see kubernetes-incubator).

In the Kubernetes 1.6 release, dynamic provisioning has been promoted to stable (having entered beta in 1.4). This is a big step forward in completing the Kubernetes storage automation vision, allowing cluster administrators to control how resources are provisioned and giving users the ability to focus more on their application. With all of these benefits, there are a few important user-facing changes (discussed below) that are important to understand before using Kubernetes 1.6.

Storage Classes and How to Use them

StorageClasses are the foundation of dynamic provisioning, allowing cluster administrators to define abstractions for the underlying storage platform. Users simply refer to a StorageClass by name in the PersistentVolumeClaim (PVC) using the “storageClassName” parameter.

In the following example, a PVC refers to a specific storage class named “gold”.
apiVersion: v1
kind: PersistentVolumeClaim
 name: mypvc
 namespace: testns
 - ReadWriteOnce
     storage: 100Gi
 storageClassName: gold

In order to promote the usage of dynamic provisioning this feature permits the cluster administrator to specify a default StorageClass. When present, the user can create a PVC without having specifying a storageClassName, further reducing the user’s responsibility to be aware of the underlying storage provider. When using default StorageClasses, there are some operational subtleties to be aware of when creating PersistentVolumeClaims (PVCs). This is particularly important if you already have existing PersistentVolumes (PVs) that you want to re-use:
  • PVs that are already “Bound” to PVCs will remain bound with the move to 1.6
    • They will not have a StorageClass associated with them unless the user manually adds it
    • If PVs become “Available” (i.e.; if you delete a PVC and the corresponding PV is recycled), then they are subject to the following
  • If storageClassName is not specified in the PVC, the default storage class will be used for provisioning.
    • Existing, “Available”, PVs that do not have the default storage class label will not be considered for binding to the PVC
  • If storageClassName is set to an empty string (‘’) in the PVC, no storage class will be used (i.e.; dynamic provisioning is disabled for this PVC)
    • Existing, “Available”, PVs (that do not have a specified storageClassName) will be considered for binding to the PVC
  • If storageClassName is set to a specific value, then the matching storage class will be used
    • Existing, “Available”, PVs that have a matching storageClassName will be considered for binding to the PVC
    • If no corresponding storage class exists, the PVC will fail.
To reduce the burden of setting up default StorageClasses in a cluster, beginning with 1.6, Kubernetes installs (via the add-on manager) default storage classes for several cloud providers. To use these default StorageClasses, users do not need refer to them by name – that is, storageClassName need not be specified in the PVC.

The following table provides more detail on default storage classes pre-installed by cloud provider as well as the specific parameters used by these defaults.

Cloud Provider
Default StorageClass Name
Default Provisioner
Amazon Web Services
Microsoft Azure
Google Cloud Platform
VMware vSphere

While these pre-installed default storage classes are chosen to be “reasonable” for most storage users, this guide provides instructions on how to specify your own default.

Dynamically Provisioned Volumes and the Reclaim Policy

All PVs have a reclaim policy associated with them that dictates what happens to a PV once it becomes released from a claim (see user-guide). Since the goal of dynamic provisioning is to completely automate the lifecycle of storage resources, the default reclaim policy for dynamically provisioned volumes is “delete”. This means that when a PersistentVolumeClaim (PVC) is released, the dynamically provisioned volume is de-provisioned (deleted) on the storage provider and the data is likely irretrievable. If this is not the desired behavior, the user must change the reclaim policy on the corresponding PersistentVolume (PV) object after the volume is provisioned.

How do I change the reclaim policy on a dynamically provisioned volume?
You can change the reclaim policy by editing the PV object and changing the “persistentVolumeReclaimPolicy” field to the desired value. For more information on various reclaim policies see user-guide.


How do I use a default StorageClass?
If your cluster has a default StorageClass that meets your needs, then all you need to do is create a PersistentVolumeClaim (PVC) and the default provisioner will take care of the rest – there is no need to specify the storageClassName:
apiVersion: v1
kind: PersistentVolumeClaim
 name: mypvc
 namespace: testns
 - ReadWriteOnce
     storage: 100Gi

Can I add my own storage classes?
Yes. To add your own storage class, first determine which provisioners will work in your cluster. Then, create a StorageClass object with parameters customized to meet your needs (see user-guide for more detail). For many users, the easiest way to create the object is to write a yaml file and apply it with “kubectl create -f”. The following is an example of a StorageClass for Google Cloud Platform named “gold” that creates a “pd-ssd”. Since multiple classes can exist within a cluster, the administrator may leave the default enabled for most workloads (since it uses a “pd-standard”), with the “gold” class reserved for workloads that need extra performance.

kind: StorageClass
 name: gold
 type: pd-ssd

How do I check if I have a default StorageClass Installed?
You can use kubectl to check for StorageClass objects. In the example below there are two storage classes: “gold” and “standard”. The “gold” class is user-defined, and the “standard” class is installed by Kubernetes and is the default.

$ kubectl get sc
NAME                 TYPE
standard (default)

$ kubectl describe storageclass standard
Name:     standard
IsDefaultClass: Yes
Parameters: type=pd-standard
Events:         <none>

Can I delete/turn off the default StorageClasses?
You cannot delete the default storage class objects provided. Since they are installed as cluster addons, they will be recreated if they are deleted.

You can, however, disable the defaulting behavior by removing (or setting to false) the following annotation:

If there are no StorageClass objects marked with the default annotation, then PersistentVolumeClaim objects (without a StorageClass specified) will not trigger dynamic provisioning. They will, instead, fall back to the legacy behavior of binding to an available PersistentVolume object.

Can I assign my existing PVs to a particular StorageClass?
Yes, you can assign a StorageClass to an existing PV by editing the appropriate PV object and adding (or setting) the desired storageClassName field to it.

What happens if I delete a PersistentVolumeClaim (PVC)?
If the volume was dynamically provisioned, then the default reclaim policy is set to “delete”. This means that, by default, when the PVC is deleted, the underlying PV and storage asset will also be deleted. If you want to retain the data stored on the volume, then you must change the reclaim policy from “delete” to “retain” after the PV is provisioned.

--Saad Ali & Michelle Au, Software Engineers, and Matthew De Lio, Product Manager, Google

  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Download Kubernetes