An open source system for automating deployment, scaling, and operations of applications.

Thursday, March 31, 2016

Kubernetes 1.2 and simplifying advanced networking with Ingress

Editor's note: This is the sixth post in a series of in-depth posts on what's new in Kubernetes 1.2. 
Ingress is currently in beta and under active development. 

In Kubernetes, Services and Pods have IPs only routable by the cluster network, by default. All traffic that ends up at an edge router is either dropped or forwarded elsewhere. In Kubernetes 1.2, we’ve made improvements to the Ingress object, to simplify allowing inbound connections to reach the cluster services. It can be configured to give services externally-reachable URLs, load balance traffic, terminate SSL, offer name based virtual hosting and lots more.

Ingress controllers 

Today, with containers or VMs, configuring a web server or load balancer is harder than it should be. Most web server configuration files are very similar. There are some applications that have weird little quirks that tend to throw a wrench in things, but for the most part, you can apply the same logic to them and achieve a desired result. In Kubernetes 1.2, the Ingress resource embodies this idea, and an Ingress controller is meant to handle all the quirks associated with a specific "class" of Ingress (be it a single instance of a load balancer, or a more complicated setup of frontends that provide GSLB, CDN, DDoS protection etc). An Ingress Controller is a daemon, deployed as a Kubernetes Pod, that watches the ApiServer's /ingresses endpoint for updates to the Ingress resource. Its job is to satisfy requests for ingress.

Your Kubernetes cluster must have exactly one Ingress controller that supports TLS for the following example to work. If you’re on a cloud-provider, first check the “kube-system” namespace for an Ingress controller RC. If there isn’t one, you can deploy the nginx controller, or write your own in < 100 lines of code.

Please take a minute to look over the known limitations of existing controllers (gce, nginx).

TLS termination and HTTP load-balancing 

Since the Ingress spans Services, it’s particularly suited for load balancing and centralized security configuration. If you’re familiar with the go programming language, Ingress is like net/http’s “Server” for your entire cluster. The following example shows you how to configure TLS termination. Load balancing is not optional when dealing with ingress traffic, so simply creating the object will configure a load balancer.

First create a test Service. We’ll run a simple echo server for this example so you know exactly what’s going on. The source is here.

$ kubectl run echoheaders --port=8080
$ kubectl expose deployment echoheaders --target-port=8080 

If you’re on a cloud-provider, make sure you can reach the Service from outside the cluster through its node port.
$ NODE_IP=$(kubectl get node `kubectl get po -l run=echoheaders 
--template '{{range .items}}{{.spec.nodeName}}{{end}}'` --template
'{{range $i, $n := .status.addresses}}{{if eq $n.type 
$ NODE_PORT=$(kubectl get svc echoheaders --template '{{range $i, $e 
:= .spec.ports}}{{$e.nodePort}}{{end}}')
This is a sanity check that things are working as expected. If the last step hangs, you might need a firewall rule.

Now lets create our TLS secret:

$ openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout 
/tmp/tls.key -out /tmp/tls.crt -subj "/CN=echoheaders/O=echoheaders"
$ echo "
apiVersion: v1
kind: Secret
 name: tls
 tls.crt: `base64 -w 0 /tmp/tls.crt`
 tls.key: `base64 -w 0 /tmp/tls.key`
" | kubectl create -f

And the Ingress:

$ echo "
apiVersion: extensions/v1beta1
kind: Ingress
 name: test
 - secretName: tls
   serviceName: echoheaders
   servicePort: 8080
" | kubectl create -f -

You should get a load balanced IP soon:

$ kubectl get ing
NAME      RULE      BACKEND            ADDRESS         AGE
test      -         echoheaders:8080   130.X.X.X     4m

And if you wait till the Ingress controller marks your backends as healthy, you should see requests to that IP on :80 getting redirected to :443 and terminated using the given TLS certificates.

$ curl 130.X.X.X
<head><title>301 Moved Permanently</title></head> <body bgcolor="white"> <center><h1>301 Moved Permanently</h1></center>

$ curl https://130.X.X.X -k CLIENT VALUES: client_address= command=GET real path=/

$ curl 130.X.X.X -Lk
CLIENT VALUES: client_address= command=GET real path=/

Future work 

You can read more about the Ingress API or controllers by following the links. The Ingress is still in beta, and we would love your input to grow it. You can contribute by writing controllers or evolving the API. All things related to the meaning of the word “ingress” are in scope, this includes DNS, different TLS modes, SNI, load balancing at layer 4, content caching, more algorithms, better health checks; the list goes on.

There are many ways to participate. If you’re particularly interested in Kubernetes and networking, you’ll be interested in:

And of course for more information about the project in general, go to

-- Prashanth Balasubramanian, Software Engineer

Wednesday, March 30, 2016

Using Spark and Zeppelin to process big data on Kubernetes 1.2

Editor's note: this is the fifth post in a series of in-depth posts on what's new in Kubernetes 1.2 

With big data usage growing exponentially, many Kubernetes customers have expressed interest in running Apache Spark on their Kubernetes clusters to take advantage of the portability and flexibility of containers. Fortunately, with Kubernetes 1.2, you can now have a platform that runs Spark and Zeppelin, and your other applications side-by-side.

Why Zeppelin? 

Apache Zeppelin is a web-based notebook that enables interactive data analytics. As one of its backends, Zeppelin connects to Spark. Zeppelin allows the user to interact with the Spark cluster in a simple way, without having to deal with a command-line interpreter or a Scala compiler.

Why Kubernetes? 

There are many ways to run Spark outside of Kubernetes:

  • You can run it using dedicated resources, in Standalone mode 
  • You can run it on a YARN cluster, co-resident with Hadoop and HDFS 
  • You can run it on a Mesos cluster alongside other Mesos applications 

So why would you run Spark on Kubernetes?

  • A single, unified interface to your cluster: Kubernetes can manage a broad range of workloads; no need to deal with YARN/HDFS for data processing and a separate container orchestrator for your other applications. 
  • Increased server utilization: share nodes between Spark and cloud-native applications. For example, you may have a streaming application running to feed a streaming Spark pipeline, or a nginx pod to serve web traffic — no need to statically partition nodes. 
  • Isolation between workloads: Kubernetes' Quality of Service mechanism allows you to safely co-schedule batch workloads like Spark on the same nodes as latency-sensitive servers. 

Launch Spark 

For this demo, we’ll be using Google Container Engine (GKE), but this should work anywhere you have installed a Kubernetes cluster. First, create a Container Engine cluster with storage-full scopes. These Google Cloud Platform scopes will allow the cluster to write to a private Google Cloud Storage Bucket (we’ll get to why you need that later): 
$ gcloud container clusters create spark --scopes storage-full 
--machine-type n1-standard-4
Note: We’re using n1-standard-4 (which are larger than the default node size) to demonstrate some features of Horizontal Pod Autoscaling. However, Spark works just fine on the default node size of n1-standard-1.

After the cluster’s created, you’re ready to launch Spark on Kubernetes using the config files in the Kubernetes GitHub repo:
$ git clone
$ kubectl create -f kubernetes/examples/spark
‘kubernetes/examples/spark’ is a directory, so this command tells kubectl to create all of the Kubernetes objects defined in all of the YAML files in that directory. You don’t have to clone the entire repository, but it makes the steps of this demo just a little easier.

The pods (especially Apache Zeppelin) are somewhat large, so may take some time for Docker to pull the images. Once everything is running, you should see something similar to the following:
$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
spark-master-controller-v4v4y   1/1       Running   0          21h
spark-worker-controller-7phix   1/1       Running   0          21h
spark-worker-controller-hq9l9   1/1       Running   0          21h
spark-worker-controller-vwei5   1/1       Running   0          21h
zeppelin-controller-t1njl       1/1       Running   0          21h
You can see that Kubernetes is running one instance of Zeppelin, one Spark master and three Spark workers.

Set up the Secure Proxy to Zeppelin 

Next you’ll set up a secure proxy from your local machine to Zeppelin, so you can access the Zeppelin instance running in the cluster from your machine. (Note: You’ll need to change this command to the actual Zeppelin pod that was created on your cluster.)
$ kubectl port-forward zeppelin-controller-t1njl 8080:8080
This establishes a secure link to the Kubernetes cluster and pod (zeppelin-controller-t1njl) and then forwards the port in question (8080) to local port 8080, which will allow you to use Zeppelin safely.

Now that I have Zeppelin up and running, what do I do with it? 

For our example, we’re going to show you how to build a simple movie recommendation model. This is based on the code on the Spark website, modified slightly to make it interesting for Kubernetes. 

Now that the secure proxy is up, visit http://localhost:8080/. You should see an intro page like this:

Click “Import note,” give it an arbitrary name (e.g. “Movies”), and click “Add from URL.” For a URL, enter:
Then click “Import Note.” This will give you a ready-made Zeppelin note for this demo. You should now have a “Movies” notebook (or whatever you named it). If you click that note, you should see a screen similar to this:

You can now click the Play button, near the top-right of the PySpark code block, and you’ll create a new, in-memory movie recommendation model! In the Spark application model, Zeppelin acts as a Spark Driver Program, interacting with the Spark cluster master to get its work done. In this case, the driver program that’s running in the Zeppelin pod fetches the data and sends it to the Spark master, which farms it out to the workers, which crunch out a movie recommendation model using the code from the driver. With a larger data set in Google Cloud Storage (GCS), it would be easy to pull the data from GCS as well. In the next section, we’ll talk about how to save your data to GCS.

Working with Google Cloud Storage (Optional) 

For this demo, we’ll be using Google Cloud Storage, which will let us store our model data beyond the life of a single pod. Spark for Kubernetes is built with the Google Cloud Storage connector built-in. As long as you can access your data from a virtual machine in the Google Container Engine project where your Kubernetes nodes are running, you can access your data with the GCS connector on the Spark image.

If you want, you can change the variables at the top of the note so that the example will actually save and restore a model for the movie recommendation engine — just point those variables at a GCS bucket that you have access to. If you want to create a GCS bucket, you can do something like this on the command line:
$ gsutil mb gs://my-spark-models
You’ll need to change this URI to something that is unique for you. This will create a bucket that you can use in the example above.

Note: Computing the model and saving it is much slower than computing the model and throwing it away. This is expected. However, if you plan to reuse a model, it’s faster to compute the model and save it and then restore it each time you want to use it, rather than throw away and recompute the model each time.

Using Horizontal Pod Autoscaling with Spark (Optional) 

Spark is somewhat elastic to workers coming and going, which means we have an opportunity: we can use use Kubernetes Horizontal Pod Autoscaling to scale-out the Spark worker pool automatically, setting a target CPU threshold for the workers and a minimum/maximum pool size. This obviates the need for having to configure the number of worker replicas manually.

Create the Autoscaler like this (note: if you didn’t change the machine type for the cluster, you probably want to limit the --max to something smaller): 
$ kubectl autoscale --min=1 --cpu-percent=80 --max=10 \
To see the full effect of autoscaling, wait for the replication controller to settle back to one replica. Use ‘kubectl get rc’ and wait for the “replicas” column on spark-worker-controller to fall back to 1.

The workload we ran before ran too quickly to be terribly interesting for HPA. To change the workload to actually run long enough to see autoscaling become active, change the “rank = 100” line in the code to “rank = 200.” After you hit play, the Spark worker pool should rapidly increase to 20 pods. It will take up to 5 minutes after the job completes before the worker pool falls back down to one replica.


In this article, we showed you how to run Spark and Zeppelin on Kubernetes, as well as how to use Google Cloud Storage to store your Spark model and how to use Horizontal Pod Autoscaling to dynamically size your Spark worker pool.

This is the first in a series of articles we’ll be publishing on how to run big data frameworks on Kubernetes — so stay tuned!

Please join our community and help us build the future of Kubernetes! There are many ways to participate. If you’re particularly interested in Kubernetes and big data, you’ll be interested in:
And of course for more information about the project in general, go to

 -- Zach Loafman, Software Engineer, Google

Tuesday, March 29, 2016

AppFormix: Helping Enterprises Operationalize Kubernetes

Today’s guest post is written Sumeet Singh, founder and CEO of AppFormix, a cloud infrastructure performance optimization service helping enterprise operators streamline their cloud operations on any OpenStack or Kubernetes cloud.

If you run clouds for a living, you’re well aware that the tools we've used since the client/server era for monitoring, analytics and optimization just don’t cut it when applied to the agile, dynamic and rapidly changing world of modern cloud infrastructure.

And, if you’re an operator of enterprise clouds, you know that implementing containers and container cluster management is all about giving your application developers a more agile, responsive and efficient cloud infrastructure. Applications are being rewritten and new ones developed – not for legacy environments where relatively static workloads are the norm, but for dynamic, scalable cloud environments. The dynamic nature of cloud native applications coupled with the shift to continuous deployment means that the demands placed by the applications on the infrastructure are constantly changing.

This shift necessitates infrastructure transparency and real-time monitoring and analytics. Without these key pieces, neither applications nor their underlying plumbing can deliver the low-latency user experience end users have come to expect.
AppFormix Architectural Review
From an operational standpoint, it is necessary to understand how applications are consuming infrastructure resources in order to maximize ROI and guarantee SLAs. AppFormix software empowers operators and developers to monitor, visualize, and control how physical resources are utilized by cloud workloads. 

At the center of the software, the AppFormix Data Platform provides a distributed analysis engine that performs configurable, real-time evaluation of in-depth, high-resolution metrics. On each host, the resource-efficient AppFormix Agent collects and evaluates multi-layer metrics from the hardware, virtualization layer, and up to the application. Intelligent agents offer sub-second response times that make it possible to detect and solve problems before they start to impact applications and users. The raw data is associated with the elements that comprise a cloud-native environment: applications, virtual machines, containers, hosts. The AppFormix Agent then publishes metrics and events to a Data Manager that stores and forwards the data to Analytics modules. Events are based on predefined or dynamic conditions set by users or infrastructure operators to make sure that SLAs and policies are being met.

Figure 1: Roll-up summary view of the Kubernetes cluster. Operators and Users can define their SLA policies and AppFormix provides with a real-time view of the health of all elements in the Kubernetes cluster. 

Figure 2: Real-Time visualization of telemetry from a Kubernetes node provides a quick overview of resource utilization on the host as well as resources consumed by the pods and containers. The user defined Labels make is easy to capture namespaces, and other metadata.
Additional subsystems are the Policy Controller and Analytics. The Policy Controller manages policies for resource monitoring, analysis, and control. It also provides role-based access control. The Analytics modules analyze metrics and events produced by Data Platform, enabling correlation across multiple elements to provide higher-level information to operators and developers. The Analytics modules may also configure policies in Policy Controller in response to conditions in the infrastructure.

AppFormix organizes elements of cloud infrastructure around hosts and instances (either containers or virtual machines), and logical groups of such elements. AppFormix integrates with cloud platforms using Adapter modules that discover the physical and virtual elements in the environment and configure those elements into the Policy Controller.

Integrating AppFormix with Kubernetes
Enterprises often run many environments located on- or off-prem, as well as running different compute technologies (VMs, containers, bare metal). The analytics platform we’ve developed at AppFormix gives Kubernetes users a single pane of glass from which to monitor and manage container clusters in private and hybrid environments.

The AppFormix Kubernetes Adapter leverages the REST-based APIs of Kubernetes to discover nodes, pods, containers, services, and replication controllers. With the relational information about each element, Kubernetes Adapter is able to represent all of these elements in our system. A pod is a group of containers. A service and a replication controller are both different types of pod groups. In addition, using the watch endpoint, Kubernetes Adapter stays aware of changes to the environment.

DevOps in the Enterprise with AppFormix
With AppFormix, developers and operators can work collaboratively to optimize applications and infrastructure. Users can access a self-service IT experience that delivers visibility into CPU, memory, storage, and network consumption by each layer of the stack: physical hardware, platform, and application software. 

  • Real-time multi-layer performance metrics - In real-time, developers can view multi-layer metrics that show container resource consumption in context of the physical node on which it executes. With this context, developers can determine if application performance is limited by the physical infrastructure, due to contention or resource exhaustion, or by application design.  
  • Proactive resource control - AppFormix Health Analytics provides policy-based actions in response to conditions in the cluster. For example, when resource consumption exceeds threshold on a worker node, Health Analytics can remove the node from the scheduling pool by invoking Kubernetes REST APIs. This dynamic control is driven by real-time monitoring at each node.
  • Capacity planning - Kubernetes will schedule workloads, but operators need to understand how the resources are being utilized. What resources have the most demand? How is demand trending over time? Operators can generate reports that provide necessary data for capacity planning.

As you can see, we’re working hard to give Kubernetes users a useful, performant toolset for both OpenStack and Kubernetes environments that allows operators to deliver self-service IT to their application developers. We’re excited to be partner contributing to the Kubernetes ecosystem and community.

-- Sumeet Singh, Founder and CEO, AppFormix

Building highly available applications using Kubernetes new multi-zone clusters (a.k.a. "Ubernetes Lite")

Editor's note: this is the third post in a series of in-depth posts on what's new in Kubernetes 1.2


One of the most frequently-requested features for Kubernetes is the ability to run applications across multiple zones. And with good reason — developers need to deploy applications across multiple domains, to improve availability in the advent of a single zone outage.

Kubernetes 1.2, released two weeks ago, adds support for running a single cluster across multiple failure zones (GCP calls them simply "zones," Amazon calls them "availability zones," here we'll refer to them as "zones"). This is the first step in a broader effort to allow federating multiple Kubernetes clusters together (sometimes referred to by the affectionate nickname "Ubernetes"). This initial version (referred to as "Ubernetes Lite") offers improved application availability by spreading applications across multiple zones within a single cloud provider.

Multi-zone clusters are deliberately simple, and by design, very easy to use — no Kubernetes API changes were required, and no application changes either. You simply deploy your existing Kubernetes application into a new-style multi-zone cluster, and your application automatically becomes resilient to zone failures.

Now into some details . . .  

Ubernetes Lite works by leveraging the Kubernetes platform’s extensibility through labels. Today, when nodes are started, labels are added to every node in the system. With Ubernetes Lite, the system has been extended to also add information about the zone it's being run in. With that, the scheduler can make intelligent decisions about placing application instances.

Specifically, the scheduler already spreads pods to minimize the impact of any single node failure. With Ubernetes Lite, via SelectorSpreadPriority, the scheduler will make a best-effort placement to spread across zones as well. We should note, if the zones in your cluster are heterogenous (e.g., different numbers of nodes or different types of nodes), you may not be able to achieve even spreading of your pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.

This improved labeling also applies to storage. When persistent volumes are created, the PersistentVolumeLabel admission controller automatically adds zone labels to them. The scheduler (via the VolumeZonePredicate predicate) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones.


We're now going to walk through setting up and using a multi-zone cluster on both Google Compute Engine (GCE) and Amazon EC2 using the default kube-up script that ships with Kubernetes. Though we highlight GCE and EC2, this functionality is available in any Kubernetes 1.2 deployment where you can make changes during cluster setup. This functionality will also be available in Google Container Engine (GKE) shortly.

Bringing up your cluster 

Creating a multi-zone deployment for Kubernetes is the same as for a single-zone cluster, but you’ll need to pass an environment variable ("MULTIZONE”) to tell the cluster to manage multiple zones. We’ll start by creating a multi-zone-aware cluster on GCE and/or EC2.

KUBE_GCE_ZONE=us-central1-a NUM_NODES=3 bash
KUBE_AWS_ZONE=us-west-2a NUM_NODES=3 bash
At the end of this command, you will have brought up a cluster that is ready to manage nodes running in multiple zones. You’ll also have brought up NUM_NODES nodes and the cluster's control plane (i.e., the Kubernetes master), all in the zone specified by KUBE_{GCE,AWS}_ZONE. In a future iteration of Ubernetes Lite, we’ll support a HA control plane, where the master components are replicated across zones. Until then, the master will become unavailable if the zone where it is running fails. However, containers that are running in all zones will continue to run and be restarted by Kubelet if they fail, thus the application itself will tolerate such a zone failure.

Nodes are labeled 

To see the additional metadata added to the node, simply view all the labels for your cluster (the example here is on GCE):
$ kubectl get nodes --show-labels

NAME                     STATUS                     AGE       LABELS
kubernetes-master        Ready,SchedulingDisabled   6m,failure-domain.beta.kubernetes.
kubernetes-minion-87j9   Ready                      6m,failure-domain.beta.kubernetes.
kubernetes-minion-9vlv   Ready                      6m,failure-domain.beta.kubernetes.
kubernetes-minion-a12q   Ready                      6m,failure-domain.beta.kubernetes.
The scheduler will use the labels attached to each of the nodes ( for the region, and for the zone) in its scheduling decisions.

Add more nodes in a second zone 

Let's add another set of nodes to the existing cluster, but running in a different zone (us-central1-b for GCE, us-west-2b for EC2). We run kube-up again, but by specifying KUBE_USE_EXISTING_MASTER=1 kube-up will not create a new master, but will reuse one that was previously created.

KUBE_GCE_ZONE=us-central1-b NUM_NODES=3 kubernetes/cluster/
On EC2, we also need to specify the network CIDR for the additional subnet, along with the master internal IP address:
MASTER_INTERNAL_IP= kubernetes/cluster/
View the nodes again; 3 more nodes will have been launched and labelled (the example here is on GCE):
$ kubectl get nodes --show-labels

NAME                     STATUS                     AGE       LABELS
kubernetes-master        Ready,SchedulingDisabled   16m,failure-domain.beta.kubernetes.
kubernetes-minion-281d   Ready                      2m,failure-domain.beta.kubernetes.
kubernetes-minion-87j9   Ready                      16m,failure-domain.beta.kubernetes.
kubernetes-minion-9vlv   Ready                      16m,failure-domain.beta.kubernetes.
kubernetes-minion-a12q   Ready                      17m,failure-domain.beta.kubernetes.
kubernetes-minion-pp2f   Ready                      2m,failure-domain.beta.kubernetes.
kubernetes-minion-wf8i   Ready                      2m,failure-domain.beta.kubernetes.
Let’s add one more zone:

KUBE_GCE_ZONE=us-central1-f NUM_NODES=3 kubernetes/cluster/
MASTER_INTERNAL_IP= kubernetes/cluster/
Verify that you now have nodes in 3 zones:
kubectl get nodes --show-labels
Highly available apps, here we come.

Deploying a multi-zone application 

Create the guestbook-go example, which includes a ReplicationController of size 3, running a simple web app. Download all the files from here, and execute the following command (the command assumes you downloaded them to a directory named “guestbook-go”:
kubectl create -f guestbook-go/
You’re done! Your application is now spread across all 3 zones. Prove it to yourself with the following commands:
$  kubectl describe pod -l app=guestbook | grep Node
Node:       kubernetes-minion-9vlv/
Node:       kubernetes-minion-281d/
Node:       kubernetes-minion-olsh/

$ kubectl get node kubernetes-minion-9vlv kubernetes-minion-281d 
kubernetes-minion-olsh --show-labels
NAME                     STATUS    AGE       LABELS
kubernetes-minion-9vlv   Ready     34m,failure-domain.beta.kubernetes.
kubernetes-minion-281d   Ready     20m,failure-domain.beta.kubernetes.
kubernetes-minion-olsh   Ready     3m,failure-domain.beta.kubernetes.
Further, load-balancers automatically span all zones in a cluster; the guestbook-go example includes an example load-balanced service:
$ kubectl describe service guestbook | grep LoadBalancer.Ingress
LoadBalancer Ingress:


$ curl -s http://${ip}:3000/env | grep HOSTNAME
  "HOSTNAME": "guestbook-44sep",

$ (for i in `seq 20`; do curl -s http://${ip}:3000/env | grep HOSTNAME; done)  
| sort | uniq
  "HOSTNAME": "guestbook-44sep",
  "HOSTNAME": "guestbook-hum5n",
  "HOSTNAME": "guestbook-ppm40",
The load balancer correctly targets all the pods, even though they’re in multiple zones.

Shutting down the cluster 

When you're done, clean up:

KUBE_GCE_ZONE=us-central1-f kubernetes/cluster/
KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/


A core philosophy for Kubernetes is to abstract away the complexity of running highly available, distributed applications. As you can see here, other than a small amount of work at cluster spin-up time, all the complexity of launching application instances across multiple failure domains requires no additional work by application developers, as it should be. And we’re just getting started!

Please join our community and help us build the future of Kubernetes! There are many ways to participate. If you’re particularly interested in scalability, you’ll be interested in:

And of course for more information about the project in general, go to

 -- Quinton Hoole, Staff Software Engineer, Google, and Justin Santa Barbara

Monday, March 28, 2016

How container metadata changes your point of view

Today’s guest post is brought to you by Apurva Davé, VP of Marketing at Sysdig, who’ll discuss using Kubernetes metadata & Sysdig to understand what’s going on in your Kubernetes cluster. 

Sure, metadata is a fancy word. It actually means “data that describes other data.” While that definition isn’t all that helpful, it turns out metadata itself is especially helpful in container environments. When you have any complex system, the availability of metadata helps you sort and process the variety of data coming out of that system, so that you can get to the heart of an issue with less headache.

In a Kubernetes environment, metadata can be a crucial tool for organizing and understanding the way containers are orchestrated across your many services, machines, availability zones or (in the future) multiple clouds. This metadata can also be consumed by other services running on top of your Kubernetes system and can help you manage your applications.

We’ll take a look at some examples of this below, but first...

A quick intro to Kubernetes metadata 

Kubernetes metadata is abundant in the form of labels and annotations. Labels are designed to be identifying metadata for your infrastructure, whereas annotations are designed to be non-identifying. For both, they’re simply generic key:value pairs that look like this:
"labels": {
  "key1" : "value1",
  "key2" : "value2"
Labels are not designed to be unique; you can expect any number of objects in your environment to carry the same label, and you can expect that an object could have many labels.

What are some examples of labels you might use? Here are just a few. WARNING: Once you start, you might find more than a few ways to use this functionality!

  • Environment: Dev, Prod, Test, UAT 
  • Customer: Cust A, Cust B, Cust C 
  • Tier: Frontend, Backend 
  • App: Cache, Web, Database, Auth 

In addition to custom labels you might define, Kubernetes also automatically applies labels to your system with useful metadata. Default labels supply key identifying information about your entire Kubernetes hierarchy: Pods, Services, Replication Controllers,and Namespaces.

Putting your metadata to work 

Once you spend a little time with Kubernetes, you’ll see that labels have one particularly powerful application that makes them essential:

Kubernetes labels allows you to easily move between a “physical” view of your hosts and containers, and a “logical” view of your applications and micro-services. 

At its core, a platform like Kubernetes is designed to orchestrate the optimal use of underlying physical resources. This is a powerful way to consume private or public cloud resources very efficiently, and sometimes you need to visualize those physical resources. In reality, however, most of the time you care about the performance of the service first and foremost.

But in a Kubernetes world, achieving that high utilization means a service’s containers may be scattered all over the place! So how do you actually measure the service’s performance? That’s where the metadata comes in. With Kubernetes metadata, you can create a deep understanding of your service’s performance, regardless of where the underlying containers are physically located.

Paint me a picture 

Let’s look at a quick example to make this more concrete: monitoring your application. Let’s work with a small, 3 node deployment running on GKE. For visualizing the environment we’ll use Sysdig Cloud. Here’s a list of the the nodes — note the “gke” prepended to the name of each host. We see some basic performance details like CPU, memory and network.

Each of these hosts has a number of containers running on it. Drilling down on the hosts, we see the containers associated with each:

Simply scanning this list of containers on a single host, I don’t see much organization to the responsibilities of these objects. For example, some of these containers run Kubernetes services (like kube-ui) and we presume others have to do with the application running (like javaapp.x).

Now let’s use some of the metadata provided by Kubernetes to take an application-centric view of the system. Let’s start by creating a hierarchy of components based on labels, in this order:

Kubernetes namespace -> replication controller -> pod -> container

This aggregates containers at corresponding levels based on the above labels. In the app UI below, this aggregation and hierarchy are shown in the grey “grouping” bar above the data about our hosts. As you can see, we have a “prod” namespace with a group of services (replication controllers) below it. Each of those replication controllers can then consist of multiple pods, which are in turn made up of containers.

In addition to organizing containers via labels, this view also aggregates metrics across relevant containers, giving a singular view into the performance of a namespace or replication controller.

In other words, with this aggregated view based on metadata, you can now start by monitoring and troubleshooting services, and drill into hosts and containers only if needed. 

Let’s do one more thing with this environment — let’s use the metadata to create a visual representation of services and the topology of their communications. Here you see our containers organized by services, but also a map-like view that shows you how these services relate to each other.

The boxes represent services that are aggregates of containers (the number in the upper right of each box tells you how many containers), and the lines represent communications between services and their latencies.

This kind of view provides yet another logical, instead of physical, view of how these application components are working together. From here I can understand service performance, relationships and underlying resource consumption (CPU in this example).

Metadata: love it, use it 

This is a pretty quick tour of metadata, but I hope it inspires you to spend a little time thinking about the relevance to your own system and how you could leverage it. Here we built a pretty simple example — apps and services — but imagine collecting metadata across your apps, environments, software components and cloud providers. You could quickly assess performance differences across any slice of this infrastructure effectively, all while Kubernetes is efficiently scheduling resource usage.

Get started with metadata for visualizing these resources today, and in a followup post we’ll talk about the power of adaptive alerting based on metadata.

-- Apurva Davé is a closet Kubernetes fanatic, loves data, and oh yeah is also the VP of Marketing at Sysdig.

1000 nodes and beyond: updates to Kubernetes performance and scalability in 1.2

Editor's note: this is the first in a series of in-depth posts on what's new in Kubernetes 1.2

We're proud to announce that with the release of 1.2, Kubernetes now supports 1000-node clusters, with a reduction of 80% in 99th percentile tail latency for most API operations. This means in just six months, we've increased our overall scale by 10 times while maintaining a great user experience  the 99th percentile pod startup times are less than 3 seconds, and 99th percentile latency of most API operations is tens of milliseconds (the exception being LIST operations, which take hundreds of milliseconds in very large clusters).

Words are fine, but nothing speaks louder than a demo. Check this out!

In the above video, you saw the cluster scale up to 10 M queries per second (QPS) over 1,000 nodes, including a rolling update, with zero downtime and no impact to tail latency. That’s big enough to be one of the top 100 sites on the Internet!

In this blog post, we’ll cover the work we did to achieve this result, and discuss some of our future plans for scaling even higher.


We benchmark Kubernetes scalability against the following Service Level Objectives (SLOs):
  1. API responsiveness1: 99% of all API calls return in less than 1s 
  2. Pod startup time: 99% of pods and their containers (with pre-pulled images) start within 5s. 
We say Kubernetes scales to a certain number of nodes only if both of these SLOs are met. We continuously collect and report the measurements described above as part of the project test framework. This battery of tests breaks down into two parts: API responsiveness and Pod Startup Time.

API responsiveness for user-level abstractions2 

Kubernetes offers high-level abstractions for users to represent their applications. For example, the ReplicationController is an abstraction representing a collection of pods. Listing all ReplicationControllers or listing all pods from a given ReplicationController is a very common use case. On the other hand, there is little reason someone would want to list all pods in the system  for example, 30,000 pods (1000 nodes with 30 pods per node) represent ~150MB of data (~5kB/pod * 30k pods). So this test uses ReplicationControllers.

For this test (assuming N to be number of nodes in the cluster), we:
  1. Create roughly 3xN ReplicationControllers of different sizes (5, 30 and 250 replicas), which altogether have 30xN replicas. We spread their creation over time (i.e. we don’t start all of them at once) and wait until all of them are running. 

  2. Perform a few operations on every ReplicationController (scale it, list all its instances, etc.), spreading those over time, and measuring the latency of each operation. This is similar to what a real user might do in the course of normal cluster operation. 

  3. Stop and delete all ReplicationControllers in the system. 
For results of this test see the “Metrics for Kubernetes 1.2” section below.

For the v1.3 release, we plan to extend this test by also creating Services, Deployments, DaemonSets, and other API objects.

Pod startup end-to-end latency3 

Users are also very interested in how long it takes Kubernetes to schedule and start a pod. This is true not only upon initial creation, but also when a ReplicationController needs to create a replacement pod to take over from one whose node failed.

We (assuming N to be the number of nodes in the cluster):
  1. Create a single ReplicationController with 30xN replicas and wait until all of them are running. We are also running high-density tests, with 100xN replicas, but with fewer nodes in the cluster. 

  2. Launch a series of single-pod ReplicationControllers - one every 200ms. For each, we measure “total end-to-end startup time” (defined below). 

  3. Stop and delete all pods and replication controllers in the system. 
We define “total end-to-end startup time” as the time from the moment the client sends the API server a request to create a ReplicationController, to the moment when “running & ready” pod status is returned to the client via watch. That means that “pod startup time” includes the ReplicationController being created and in turn creating a pod, scheduler scheduling that pod, Kubernetes setting up intra-pod networking, starting containers, waiting until the pod is successfully responding to health-checks, and then finally waiting until the pod has reported its status back to the API server and then API server reported it via watch to the client.

While we could have decreased the “pod startup time” substantially by excluding for example waiting for report via watch, or creating pods directly rather than through ReplicationControllers, we believe that a broad definition that maps to the most realistic use cases is the best for real users to understand the performance they can expect from the system.

Metrics from Kubernetes 1.2 

So what was the result?We run our tests on Google Compute Engine, setting the size of the master VM based on on the size of the Kubernetes cluster. In particular for 1000-node clusters we use a n1-standard-32 VM for the master (32 cores, 120GB RAM).

API responsiveness 

The following two charts present a comparison of 99th percentile API call latencies for the Kubernetes 1.2 release and the 1.0 release on 100-node clusters. (Smaller bars are better)

We present results for LIST operations separately, since these latencies are significantly higher. Note that we slightly modified our tests in the meantime, so running current tests against v1.0 would result in higher latencies than they used to.

We also ran these tests against 1000-node clusters. Note: We did not support clusters larger than 100 on GKE, so we do not have metrics to compare these results to. However, customers have reported running on 1,000+ node clusters since Kubernetes 1.0.

Since LIST operations are significantly larger, we again present them separately: All latencies, in both cluster sizes, are well within our 1 second SLO.

Pod startup end-to-end latency 

The results for “pod startup latency” (as defined in the “Pod-Startup end-to-end latency” section) are presented in the following graph. For reference we are presenting also results from v1.0 for 100-node clusters in the first part of the graph.

As you can see, we substantially reduced tail latency in 100-node clusters, and now deliver low pod startup latency up to the largest cluster sizes we have measured. It is noteworthy that the metrics for 1000-node clusters, for both API latency and pod startup latency, are generally better than those reported for 100-node clusters just six months ago!

How did we make these improvements? 

To make these significant gains in scale and performance over the past six months, we made a number of improvements across the whole system. Some of the most important ones are listed below.

  • Created a “read cache” at the API server level 
    ( )

    Since most Kubernetes control logic operates on an ordered, consistent snapshot kept up-to-date by etcd watches (via the API server), a slight delay in that arrival of that data has no impact on the correct operation of the cluster. These independent controller loops, distributed by design for extensibility of the system, are happy to trade a bit of latency for an increase in overall throughput.

    In Kubernetes 1.2 we exploited this fact to improve performance and scalability by adding an API server read cache. With this change, the API server’s clients can read data from an in-memory cache in the API server instead of reading it from etcd. The cache is updated directly from etcd via watch in the background. Those clients that can tolerate latency in retrieving data (usually the lag of cache is on the order of tens of milliseconds) can be served entirely from cache, reducing the load on etcd and increasing the throughput of the server. This is a continuation of an optimization begun in v1.1, where we added support for serving watch directly from the API server instead of etcd:

  • Thanks to contributions from Wojciech Tyczynski at Google and Clayton Coleman and Timothy St. Clair at Red Hat, we were able to join careful system design with the unique advantages of etcd to improve the scalability and performance of Kubernetes. 
  • Introduce a “Pod Lifecycle Event Generator” (PLEG) in the Kubelet (

    Kubernetes 1.2 also improved density from a pods-per-node perspective  for v1.2 we test and advertise up to 100 pods on a single node (vs 30 pods in the 1.1 release). This improvement was possible because of diligent work by the Kubernetes community through an implementation of the Pod Lifecycle Event Generator (PLEG).

    The Kubelet (the Kubernetes node agent) has a worker thread per pod which is responsible for managing the pod’s lifecycle. In earlier releases each worker would periodically poll the underlying container runtime (Docker) to detect state changes, and perform any necessary actions to ensure the node’s state matched the desired state (e.g. by starting and stopping containers). As pod density increased, concurrent polling from each worker would overwhelm the Docker runtime, leading to serious reliability and performance issues (including additional CPU utilization which was one of the limiting factors for scaling up).

    To address this problem we introduced a new Kubelet subcomponent  the PLEG  to centralize state change detection and generate lifecycle events for the workers. With concurrent polling eliminated, we were able to lower the steady-state CPU usage of Kubelet and the container runtime by 4x. This also allowed us to adopt a shorter polling period, so as to detect and react to changes more quickly. 

  • Improved scheduler throughput Kubernetes community members from CoreOS (Hongchao Deng and Xiang Li) helped to dive deep into the Kubernetes scheduler and dramatically improve throughput without sacrificing accuracy or flexibility. They cut total time to schedule 30,000 pods by nearly 1400%! You can read a great blog post on how they approached the problem here: 

  • A more efficient JSON parser Go’s standard library includes a flexible and easy-to-use JSON parser that can encode and decode any Go struct using the reflection API. But that flexibility comes with a cost reflection allocates lots of small objects that have to be tracked and garbage collected by the runtime. Our profiling bore that out, showing that a large chunk of both client and server time was spent in serialization. Given that our types don’t change frequently, we suspected that a significant amount of reflection could be bypassed through code generation.

    After surveying the Go JSON landscape and conducting some initial tests, we found the ugorji codec library offered the most significant speedups - a 200% improvement in encoding and decoding JSON when using generated serializers, with a significant reduction in object allocations. After contributing fixes to the upstream library to deal with some of our complex structures, we switched Kubernetes and the go-etcd client library over. Along with some other important optimizations in the layers above and below JSON, we were able to slash the cost in CPU time of almost all API operations, especially reads. 

Kubernetes 1.3 and Beyond 

Of course, our job is not finished. We will continue to invest in improving Kubernetes performance, as we would like it to scale to many thousands of nodes, just like Google’s Borg. Thanks to our investment in testing infrastructure and our focus on how teams use containers in production, we have already identified the next steps on our path to improving scale. 

On deck for Kubernetes 1.3: 
  1.  Our main bottleneck is still the API server, which spends the majority of its time just marshaling and unmarshaling JSON objects. We plan to add support for protocol buffers to the API as an optional path for inter-component communication and for storing objects in etcd. Users will still be able to use JSON to communicate with the API server, but since the majority of Kubernetes communication is intra-cluster (API server to node, scheduler to API server, etc.) we expect a significant reduction in CPU and memory usage on the master. 

  2.  Kubernetes uses labels to identify sets of objects; For example, identifying which pods belong to a given ReplicationController requires iterating over all pods in a namespace and choosing those that match the controller’s label selector. The addition of an efficient indexer for labels that can take advantage of the existing API object cache will make it possible to quickly find the objects that match a label selector, making this common operation much faster. 

  3. Scheduling decisions are based on a number of different factors, including spreading pods based on requested resources, spreading pods with the same selectors (e.g. from the same Service, ReplicationController, Job, etc.), presence of needed container images on the node, etc. Those calculations, in particular selector spreading, have many opportunities for improvement  see for just one suggested change. 

  4. We are also excited about the upcoming etcd v3.0 release, which was designed with Kubernetes use case in mind  it will both improve performance and introduce new features. Contributors from CoreOS have already begun laying the groundwork for moving Kubernetes to etcd v3.0 (see 
While this list does not capture all the efforts around performance, we are optimistic we will achieve as big a performance gain as we saw going from Kubernetes 1.0 to 1.2. 


In the last six months we’ve significantly improved Kubernetes scalability, allowing v1.2 to run 1000-node clusters with the same excellent responsiveness (as measured by our SLOs) as we were previously achieving only on much smaller clusters. But that isn’t enough  we want to push Kubernetes even further and faster. Kubernetes v1.3 will improve the system’s scalability and responsiveness further, while continuing to add features that make it easier to build and run the most demanding container-based applications. 

Please join our community and help us build the future of Kubernetes! There are many ways to participate. If you’re particularly interested in scalability, you’ll be interested in: 
 And of course for more information about the project in general, go to

- Wojciech Tyczynski, Software Engineer, Google

1We exclude operations on “events” since these are more like system logs and are not required for the system to operate properly.
2This is test/e2e/load.go from the Kubernetes github repository.
3This is test/e2e/density.go test from the Kubernetes github repository 
4We are looking into optimizing this in the next release, but for now using a smaller master can result in significant (order of magnitude) performance degradation. We encourage anyone running benchmarking against Kubernetes or attempting to replicate these findings to use a similarly sized master, or performance will suffer.

Five Days of Kubernetes 1.2

The Kubernetes project has had some huge milestones over the past few weeks. We released Kubernetes 1.2, had our first conference in Europe, and were accepted into the Cloud Native Computing Foundation. While we catch our breath, we would like to take a moment to highlight some of the great work contributed by the community since our last milestone, just four months ago.

Our mission is to make building distributed systems easy and accessible for all. While Kubernetes 1.2 has LOTS of new features, there are a few that really highlight the strides we’re making towards that goal. Over the course of the next week, we’ll be publishing a series of in-depth posts covering what’s new, so come back daily this week to read about the new features that continue to make Kubernetes the easiest way to run containers at scale. Thanks, and stay tuned!

You can follow us on twitter here @Kubernetesio

--David Aronchick, Senior Product Manager for Kubernetes, Google