An open source system for automating deployment, scaling, and operations of applications.

Monday, June 29, 2015

The Distributed System ToolKit: Patterns for Composite Containers

Having had the privilege of presenting some ideas from Kubernetes at DockerCon 2015, I thought I would make a blog post to share some of these ideas for those of you who couldn’t be there.

Over the past two years containers have become an increasingly popular way to package and deploy code. Container images solve many real-world problems with existing packaging and deployment tools, but in addition to these significant benefits, containers offer us an opportunity to fundamentally re-think the way we build distributed applications. Just as service oriented architectures (SOA) encouraged the decomposition of applications into modular, focused services, containers should encourage the further decomposition of these services into closely cooperating modular containers.  By virtue of establishing a boundary, containers enable users to build their services using modular, reusable components, and this in turn leads to services that are more reliable, more scalable and faster to build than applications built from monolithic containers.

In many ways the switch from VMs to containers is like the switch from monolithic programs of the 1970s and early 80s to modular object-oriented programs of the late 1980s and onward. The abstraction layer provided by the container image has a great deal in common with the abstraction boundary of the class in object-oriented programming, and it allows the same opportunities to improve developer productivity and application quality.  Just like the right way to code is the separation of concerns into modular objects, the right way to package applications in containers is the separation of concerns into modular containers.  Fundamentally  this means breaking up not just the overall application, but also the pieces within any one server into multiple modular containers that are easy to parameterize and re-use. In this way, just like the standard libraries that are ubiquitous in modern languages, most application developers can compose together modular containers that are written by others, and build their applications more quickly and with higher quality components.

The benefits of thinking in terms of modular containers are enormous, in particular, modular containers provide the following:
  • Speed application development, since containers can be re-used between teams and even larger communities
  • Codify expert knowledge, since everyone collaborates on a single containerized implementation that reflects best-practices rather than a myriad of different home-grown containers with roughly the same functionality
  • Enable agile teams, since the container boundary is a natural boundary and contract for team responsibilities
  • Provide separation of concerns and focus on specific functionality that reduces spaghetti dependencies and un-testable components

Building an application from modular containers means thinking about symbiotic groups of containers that cooperate to provide a service, not one container per service.  In Kubernetes, the embodiment of this modular container service is a Pod.  A Pod is a group of containers that share resources like file systems, kernel namespaces and an IP address.  The Pod is the atomic unit of scheduling in a Kubernetes cluster, precisely because the symbiotic nature of the containers in the Pod require that they be co-scheduled onto the same machine, and the only way to reliably achieve this is by making container groups atomic scheduling units.

When you start thinking in terms of Pods, there are naturally some general patterns of modular application development that re-occur multiple times.  I’m confident that as we move forward in the development of Kubernetes more of these patterns will be identified, but here are three that we see commonly:

Example #1: Sidecar containers

Sidecar containers extend and enhance the "main" container, they take existing containers and make them better.  As an example, consider a container that runs the Nginx web server.  Add a different container that syncs the file system with a git repository, share the file system between the containers and you have built Git push-to-deploy.  But you’ve done it in a modular manner where the git synchronizer can be built by a different team, and can be reused across many different web servers (Apache, Python, Tomcat, etc).  Because of this modularity, you only have to write and test your git synchronizer once and reuse it across numerous apps. And if someone else writes it, you don’t even need to do that.

Example #2: Ambassador containers

Ambassador containers proxy a local connection to the world.  As an example, consider a Redis cluster with read-replicas and a single write master.  You can create a Pod that groups your main application with a Redis ambassador container.  The ambassador is a proxy is responsible for splitting reads and writes and sending them on to the appropriate servers.  Because these two containers share a network namespace, they share an IP address and your application can open a connection on “localhost” and find the proxy without any service discovery.  As far as your main application is concerned, it is simply connecting to a Redis server on localhost.  This is powerful, not just because of separation of concerns and the fact that different teams can easily own the components, but also because in the development environment, you can simply skip the proxy and connect directly to a Redis server that is running on localhost.

Example #3: Adapter containers

Adapter containers standardize and normalize output.  Consider the task of monitoring N different applications.  Each application may be built with a different way of exporting monitoring data. (e.g. JMX, StatsD, application specific statistics) but every monitoring system expects a consistent and uniform data model for the monitoring data it collects.  By using the adapter pattern of composite containers, you can transform the heterogeneous monitoring data from different systems into a single unified representation by creating Pods that groups the application containers with adapters that know how to do the transformation.  Again because these Pods share namespaces and file systems, the coordination of these two containers is simple and straightforward.

In all of these cases, we've used the container boundary as an encapsulation/abstraction boundary that allows us to build modular, reusable components that we combine to build out applications.  This reuse enables us to more effectively share containers between different developers, reuse our code across multiple applications, and generally build more reliable, robust distributed systems more quickly.  I hope you’ve seen how Pods and composite container patterns can enable you to build robust distributed systems more quickly, and achieve container code re-use.  To try these patterns out yourself in your own applications. I encourage you to go check out open source Kubernetes or Google Container Engine.

- Brendan Burns, Software Engineer at Google

Thursday, June 25, 2015

Slides: Cluster Management with Kubernetes, talk given at the University of Edinburgh

On Friday 5 June 2015 I gave a talk called Cluster Management with Kubernetes to a general audience at the University of Edinburgh. The talk includes an example of a music store system with a Kibana front end UI and an Elasticsearch based back end which helps to make concrete concepts like pods, replication controllers and services. Please use the gears on the bottom row to select speaker notes. For a larger view of the slides please follow the link Cluster Management with Kubernetes.

Thursday, June 11, 2015

Cluster Level Logging with Kubernetes

A Kubernetes cluster will typically be humming along running many system and application pods. How does the system administrator collect, manage and query the logs of the system pods? How does a user query the logs of their application which is composed of many pods which may be restarted or automatically generated by the Kubernetes system? These questions are addressed by the Kubernetes cluster level logging services.

Cluster level logging for Kubernetes allows us to collect logs which persist beyond the lifetime of the pod’s container images or the lifetime of the pod or even cluster. In this article we assume that a Kubernetes cluster has been created with cluster level logging support for sending logs to Google Cloud Logging. This is an option when creating a Google Container Engine (GKE) cluster, and is enabled by default for the open source Google Compute Engine (GCE) Kubernetes distribution. After a cluster has been created you will have a collection of system pods running that support monitoring, logging and DNS resolution for names of Kubernetes services:

$ kubectl get pods
NAME                                           READY     REASON    RESTARTS   AGE
fluentd-cloud-logging-kubernetes-minion-0f64   1/1       Running   0          32m
fluentd-cloud-logging-kubernetes-minion-27gf   1/1       Running   0          32m
fluentd-cloud-logging-kubernetes-minion-pk22   1/1       Running   0          31m
fluentd-cloud-logging-kubernetes-minion-20ej   1/1       Running   0          31m
kube-dns-v3-pk22                               3/3       Running   0          32m

monitoring-heapster-v1-20ej                    0/1       Running   9          32m

Here is the same information in a picture which shows how the pods might be placed on specific nodes.

Here is a close up of what is running on each node.

The first diagram shows four nodes created on a GCE cluster with the name of each VM node on a purple background. The internal and public IPs of each node are shown on gray boxes and the pods running in each node are shown in green boxes. Each pod box shows the name of the pod and the namespace it runs in, the IP address of the pod and the images which are run as part of the pod’s execution. Here we see that every node is running a fluentd-cloud-logging pod which is collecting the log output of the containers running on the same node and sending them to Google Cloud Logging. A pod which provides a cluster DNS service runs on one of the nodes and a pod which provides monitoring support runs on another node.

To help explain how cluster level logging works let’s start off with a synthetic log generator pod specification counter-pod.yaml:

 apiVersion: v1
 kind: Pod
   name: counter
   - name: count
     image: ubuntu:14.04
     args: [bash, -c, 
            'for ((i = 0; ; i++)); do echo "$i: $(date)"; sleep 1; done']

This pod specification has one container which runs a bash script when the container is born. This script simply writes out the value of a counter and the date once per second and runs indefinitely. Let’s create the pod.

$ kubectl create -f counter-pod.yaml


We can observe the running pod:
$ kubectl get pods
NAME                                           READY     REASON    RESTARTS   AGE
counter                                        1/1       Running   0          5m
fluentd-cloud-logging-kubernetes-minion-0f64   1/1       Running   0          55m
fluentd-cloud-logging-kubernetes-minion-27gf   1/1       Running   0          55m
fluentd-cloud-logging-kubernetes-minion-pk22   1/1       Running   0          55m
fluentd-cloud-logging-kubernetes-minion-20ej   1/1       Running   0          55m
kube-dns-v3-pk22                               3/3       Running   0          55m
monitoring-heapster-v1-20ej                    0/1       Running   9          56m

This step may take a few minutes to download the ubuntu:14.04 image during which the pod status will be shown as Pending.

One of the nodes is now running the counter pod:

When the pod status changes to Running we can use the kubectl logs command to view the output of this counter pod.

$ kubectl logs counter
0: Tue Jun  2 21:37:31 UTC 2015
1: Tue Jun  2 21:37:32 UTC 2015
2: Tue Jun  2 21:37:33 UTC 2015
3: Tue Jun  2 21:37:34 UTC 2015
4: Tue Jun  2 21:37:35 UTC 2015
5: Tue Jun  2 21:37:36 UTC 2015

This command fetches the log text from the Docker log file for the image that is running in this container. We can connect to the running container and observe the running counter bash script.

$ kubectl exec -i counter bash
ps aux
root         1  0.0  0.0  17976  2888 ?        Ss   00:02   0:00 bash -c for ((i = 0; ; i++)); do echo "$i: $(date)"; sleep 1; done
root       468  0.0  0.0  17968  2904 ?        Ss   00:05   0:00 bash
root       479  0.0  0.0   4348   812 ?        S    00:05   0:00 sleep 1
root       480  0.0  0.0  15572  2212 ?        R    00:05   0:00 ps aux

What happens if for any reason the image in this pod is killed off and then restarted by Kubernetes? Will we still see the log lines from the previous invocation of the container followed by the log lines for the started container? Or will we lose the log lines from the original container’s execution and only see the log lines for the new container? Let’s find out. First let’s stop the currently running counter.

$ kubectl stop pod counter

Now let’s restart the counter.

$ kubectl create -f counter-pod.yaml

Let’s wait for the container to restart and get the log lines again.

$ kubectl logs counter
0: Tue Jun  2 21:51:40 UTC 2015
1: Tue Jun  2 21:51:41 UTC 2015
2: Tue Jun  2 21:51:42 UTC 2015
3: Tue Jun  2 21:51:43 UTC 2015
4: Tue Jun  2 21:51:44 UTC 2015
5: Tue Jun  2 21:51:45 UTC 2015
6: Tue Jun  2 21:51:46 UTC 2015
7: Tue Jun  2 21:51:47 UTC 2015
8: Tue Jun  2 21:51:48 UTC 2015

Oh no! We’ve lost the log lines from the first invocation of the container in this pod! Ideally, we want to preserve all the log lines from each invocation of each container in the pod. Furthermore, even if the pod is restarted we would still like to preserve all the log lines that were ever emitted by the containers in the pod. But don’t fear, this is the functionality provided by cluster level logging in Kubernetes. When a cluster is created, the standard output and standard error output of each container can be ingested using a Fluentd agent running on each node into either Google Cloud Logging or into Elasticsearch and viewed with Kibana. This blog article focuses on Google Cloud Logging.

When a Kubernetes cluster is created with logging to Google Cloud Logging enabled, the system creates a pod called fluentd-cloud-logging on each node of the cluster to collect Docker container logs. These pods were shown at the start of this blog article in the response to the first get pods command.

This log collection pod has a specification which looks something like this fluentd-gcp.yaml:

apiVersion: v1
kind: Pod
 name: fluentd-cloud-logging
 - name: fluentd-cloud-logging
   - name: FLUENTD_ARGS
     value: -qq
   - name: containers
     mountPath: /var/lib/docker/containers
 - name: containers
     path: /var/lib/docker/containers

This pod specification maps the the directory on the host containing the Docker log files, /var/lib/docker/containers, to a directory inside the container which has the same path. The pod runs one image,, which is configured to collect the Docker log files from the logs directory and ingest them into Google Cloud Logging. One instance of this pod runs on each node of the cluster. Kubernetes will notice if this pod fails and automatically restart it.

We can click on the Logs item under the Monitoring section of the Google Developer Console and select the logs for the counter container, which will be called kubernetes.counter_default_count.  This identifies the name of the pod (counter), the namespace (default) and the name of the container (count) for which the log collection occurred. Using this name we can select just the logs for our counter container from the drop down menu:

When we view the logs in the Developer Console we observe the logs for both invocations of the container.


Note the first container counted to 108 and then it was terminated. When the next container image restarted the counting process resumed from 0. Similarly if we deleted the pod and restarted it we would capture the logs for all instances of the containers in the pod whenever the pod was running.

Logs ingested into Google Cloud Logging may be exported to various other destinations including Google Cloud Storage buckets and BigQuery. Use the Exports tab in the Cloud Logging console to specify where logs should be streamed to (or follow this link to the settings tab).

We could query the ingested logs from BigQuery using the SQL query which reports the counter log lines showing the newest lines first.

SELECT metadata.timestamp, structPayload.log FROM [mylogs.kubernetes_counter_default_count_20150611] ORDER BY metadata.timestamp DESC

Here is some sample output:


We could also fetch the logs from Google Cloud Storage buckets to our desktop or laptop and then search them locally. The following command fetches logs for the counter pod running in a cluster which is itself in a GCE project called myproject. Only logs for the date 2015-06-11 are fetched.

$ gsutil -m cp -r gs://myproject/kubernetes.counter_default_count/2015/06/11 .

Now we can run queries over the ingested logs. The example below uses the jq program to extract just the log lines.

$ cat 21\:00\:00_21\:59\:59_S0.json | jq '.structPayload.log'
"0: Thu Jun 11 21:39:38 UTC 2015\n"
"1: Thu Jun 11 21:39:39 UTC 2015\n"
"2: Thu Jun 11 21:39:40 UTC 2015\n"
"3: Thu Jun 11 21:39:41 UTC 2015\n"
"4: Thu Jun 11 21:39:42 UTC 2015\n"
"5: Thu Jun 11 21:39:43 UTC 2015\n"
"6: Thu Jun 11 21:39:44 UTC 2015\n"
"7: Thu Jun 11 21:39:45 UTC 2015\n"

This article has touched briefly on the underlying mechanisms that support gathering cluster level logs on a Kubernetes deployment. The approach here only works for gathering the standard output and standard error output of the processes running in the pod’s containers. To gather other logs that are stored in files one can use a sidecar container to gather the required files as described at the page Collecting log files within containers with Fluentd and sending them to the Google Cloud Logging service.

Tuesday, June 2, 2015

Weekly Kubernetes Community Hangout Notes - May 22 2015

Every week the Kubernetes contributing community meet virtually over Google Hangouts. We want anyone who's interested to know what's discussed in this forum.

Discussion / Topics
  • Code Freeze
  • Upgrades of cluster
  • E2E test issues

Code Freeze process starts EOD 22-May, including
  • Code Slush -- draining PRs that are active. If there are issues for v1 to raise, please do so today.
    • Drain timeframe will be about 1-week.
  • Community PRs -- plan is to reopen in ~6 weeks.
  • Key areas for fixes in v1 -- docs, the experience.

E2E issues and LGTM process
  • Seen end-to-end tests go red.
  • Plan is to limit merging to on-call. Quinton to communicate.
    • Community committers, please label with LGTM and on-call will merge based on on-call’s judgement.
  • Can we expose Jenkins runs to community? (Paul)
    • Question/concern to work out is securing Jenkins. Short term conclusion: Will look at pushing Jenkins logs into GCS bucket. Lavalamp will follow up with Jeff Grafton.
    • Longer term solution may be a merge queue, where e2e runs for each merge (as opposed to multiple merges). This exists in Openshift today.

Cluster Upgrades for Kubernetes as final v1 feature
  • GCE will use Persistent Disk (PD) to mount new image.
  • OpenShift will follow a tradition update model, with “yum update”.
  • A strawman approach is to have an analog of “kube-push” to update the master, in-place. Feedback in the meeting was
    • Upgrading Docker daemon on the master will kill the master’s pods. Agreed. May consider an ‘upgrade’ phase or explicit step.
    • How is this different than HA master upgrade? See HA case as a superset. The work to do an upgrade would be a prerequisite for HA master upgrade.
  • Mesos scheduler implements a rolling node upgrade.

Attention requested for v1 in the Hangout
  • Downward plug-in #5093.
    • Discussed that it’s an eventually consistent design.
    • In the meeting, the outcome was: seeking a pattern for atomicity of update across multiple piece. Paul to ping Tim when ready to review.
  • Regression in e2e #8499 (Eric Paris)
  • Asking for review of direction, if not review. #8334 (Mark)
  • Handling graceful termination (e.g. sigterm to postgres) is not implemented. #2789 (Clayton)
    • Need is to bump up grace period or finish plumbing. In API, client tools, missing is kubelet does use and we don’t set the timeout (>0) value.
    • Brendan will look into this graceful term issue.
  • Load balancer almost ready by JustinSB.