An open source system for automating deployment, scaling, and operations of applications.

Friday, May 19, 2017

Kubernetes: a monitoring guide

Today’s post is by Jean-Mathieu Saponaro, Research & Analytics Engineer at Datadog, discussing what Kubernetes changes for monitoring, and how you can prepare to properly monitor a containerized infrastructure orchestrated by Kubernetes.

Container technologies are taking the infrastructure world by storm. While containers solve or simplify infrastructure management processes, they also introduce significant complexity in terms of orchestration. That’s where Kubernetes comes to our rescue. Just like a conductor directs an orchestra, Kubernetes oversees our ensemble of containers—starting, stopping, creating, and destroying them automatically to keep our applications humming along.

Kubernetes makes managing a containerized infrastructure much easier by creating levels of abstractions such as pods and services. We no longer have to worry about where applications are running or if they have enough resources to work properly. But that doesn’t change the fact that, in order to ensure good performance, we need to monitor our applications, the containers running them, and Kubernetes itself.

Rethinking monitoring for the Kubernetes era

Just as containers have completely transformed how we think about running services on virtual machines, Kubernetes has changed the way we interact with containers. The good news is that with proper monitoring, the abstraction levels inherent to Kubernetes provide a comprehensive view of your infrastructure, even if the containers and applications are constantly moving. But Kubernetes monitoring requires us to rethink and reorient our strategies, since it differs from monitoring traditional hosts such as VMs or physical machines in several ways.

Tags and labels become essential
With containers and their orchestration completely managed by Kubernetes, labels are now the only way we have to interact with pods and containers. That’s why they are absolutely crucial for monitoring since all metrics and events will be sliced and diced using labels across the different layers of your infrastructure. Defining your labels with a logical and easy-to-understand schema is essential so your metrics will be as useful as possible.

There are now more components to monitor
In traditional, host-centric infrastructure, we were used to monitoring only two layers: applications and the hosts running them. Now with containers in the middle and Kubernetes itself needing to be monitored, there are four different components to monitor and collect metrics from.

Applications are constantly moving
Kubernetes schedules applications dynamically based on scheduling policy, so you don’t always know where applications are running. But they still need to be monitored. That’s why using a monitoring system or tool with service discovery is a must. It will automatically adapt metric collection to moving containers so applications can be continuously monitored without interruption.

Be prepared for distributed clusters
Kubernetes has the ability to distribute containerized applications across multiple data centers and potentially different cloud providers. That means metrics must be collected and aggregated among all these different sources. 
For more details about all these new monitoring challenges inherent to Kubernetes and how to overcome them, we recently published an in-depth Kubernetes monitoring guide. Part 1 of the series covers how to adapt your monitoring strategies to the Kubernetes era.

Metrics to monitor

Whether you use Heapster data or a monitoring tool integrating with Kubernetes and its different APIs, there are several key types of metrics that need to be closely tracked:
  • Running pods and their deployments
  • Usual resource metrics such as CPU, memory usage, and disk I/O
  • Container-native metrics
  • Application metrics for which a service discovery feature in your monitoring tool is essential 
All these metrics should be aggregated using Kubernetes labels and correlated with events from Kubernetes and container technologies.
Part 2 of our series on Kubernetes monitoring guides you through all the data that needs to be collected and tracked.

Collecting these metrics

Whether you want to track these key performance metrics by combining Heapster, a storage backend, and a graphing tool, or by integrating a monitoring tool with the different components of your infrastructure, Part 3, about Kubernetes metric collection, has you covered.
Anchors aweigh!

Using Kubernetes drastically simplifies container management. But it requires us to rethink our monitoring strategies on several fronts, and to make sure all the key metrics from the different components are properly collected, aggregated, and tracked. We hope our monitoring guide will help you to effectively monitor your Kubernetes clusters. Feedback and suggestions are more than welcome.

--Jean-Mathieu Saponaro, Research & Analytics Engineer, Datadog

  • Get involved with the Kubernetes project on GitHub 
  • Post questions (or answer questions) on Stack Overflow 
  • Connect with the community on Slack
  • Follow us on Twitter @Kubernetesio for latest updates

Thursday, May 18, 2017

Kargo Ansible Playbooks foster Collaborative Kubernetes Ops

Today’s guest post is by Rob Hirschfeld, co-founder of open infrastructure automation project, Digital Rebar and co-chair of the SIG Cluster Ops.  

Why Kargo?

Making Kubernetes operationally strong is a widely held priority and I track many deployment efforts around the project. The incubated Kargo project is of particular interest for me because it uses the popular Ansible toolset to build robust, upgradable clusters on both cloud and physical targets. I believe using tools familiar to operators grows our community.

We’re excited to see the breadth of platforms enabled by Kargo and how well it handles a wide range of options like integrating Ceph for StatefulSet persistence and Helm for easier application uploads. Those additions have allowed us to fully integrate the OpenStack Helm charts (demo video).

By working with the upstream source instead of creating different install scripts, we get the benefits of a larger community. This requires some extra development effort; however, we believe helping share operational practices makes the whole community stronger. That was also the motivation behind the SIG-Cluster Ops.

With Kargo delivering robust installs, we can focus on broader operational concerns.

For example, we can now drive parallel deployments, so it’s possible to fully exercise the options enabled by Kargo simultaneously for development and testing.  

That’s helpful to built-test-destroy coordinated Kubernetes installs on CentOS, Red Hat and Ubuntu as part of an automation pipeline. We can also set up a full classroom environment from a single command using Digital Rebar’s providers, tenants and cluster definition JSON.

Let’s explore the classroom example:

First, we define a student cluster in JSON like the snippet below

 "attribs": {
   "k8s-version": "v1.6.0",
   "k8s-kube_network_plugin": "calico",
   "k8s-docker_version": "1.12"
 "name": "cluster01",
 "tenant": "cluster01",
 "public_keys": {
   "cluster01": "ssh-rsa AAAAB....."
 "provider": {
   "name": "google-provider"
 "nodes": [
     "roles": [ "etcd","k8s-addons", "k8s-master" ],
     "count": 1
     "roles": [ "k8s-worker" ],
     "count": 3

Then we run the Digital Rebar workloads reference script which inspects the deployment files to pull out key information.  Basically, it automates the following steps:

rebar provider create {“name”:“google-provider”, [secret stuff]}
rebar tenants create {“name”:“cluster01”}
rebar deployments create [contents from cluster01 file]

The deployments create command will automatically request nodes from the provider. Since we’re using tenants and SSH key additions, each student only gets access to their own cluster. When we’re done, adding the --destroy flag will reverse the process for the nodes and deployments but leave the providers and tenants.

We are invested in operational scripts like this example using Kargo and Digital Rebar because if we cannot manage variation in a consistent way then we’re doomed to operational fragmentation.  

I am excited to see and be part of the community progress towards enterprise-ready Kubernetes operations on both cloud and on-premises. That means I am seeing reasonable patterns emerge with sharable/reusable automation. I strongly recommend watching (or better, collaborating in) these efforts if you are deploying Kubernetes even at experimental scale. Being part of the community requires more upfront effort but returns dividends as you get the benefits of shared experience and improvement.

When deploying at scale, how do you set up a system to be both repeatable and multi-platform without compromising scale or security?

With Kargo and Digital Rebar as a repeatable base, extensions get much faster and easier. Even better, using upstream directly allows improvements to be quickly cycled back into upstream. That means we’re closer to building a community focused on the operational side of Kubernetes with an SRE mindset.

If this is interesting, please engage with us in the Cluster Ops SIG, Kargo or Digital Rebar communities. 

-- Rob Hirschfeld, co-founder of RackN and co-chair of the Cluster Ops SIG

  • Get involved with the Kubernetes project on GitHub
  • Post questions (or answer questions) on Stack Overflow
  • Connect with the community on Slack
  • Follow us on Twitter @Kubernetesio for latest updates

Dancing at the Lip of a Volcano: The Kubernetes Security Process - Explained

Editor's note: Today’s post is by Jess Frazelle of Google and Brandon Philips of CoreOS about the Kubernetes security disclosures and response policy. 

Software running on servers underpins ever growing amounts of the world's commerce, communications, and physical infrastructure. And nearly all of these systems are connected to the internet; which means vital security updates must be applied rapidly. As software developers and IT professionals, we often find ourselves dancing on the edge of a volcano: we may either fall into magma induced oblivion from a security vulnerability exploited before we can fix it, or we may slide off the side of the mountain because of an inadequate process to address security vulnerabilities. 

The Kubernetes community believes that we can help teams restore their footing on this volcano with a foundation built on Kubernetes. And the bedrock of this foundation requires a process for quickly acknowledging, patching, and releasing security updates to an ever growing community of Kubernetes users. 

With over 1,200 contributors and over a million lines of code, each release of Kubernetes is a massive undertaking staffed by brave volunteer release managers. These normal releases are fully transparent and the process happens in public. However, security releases must be handled differently to keep potential attackers in the dark until a fix is made available to users.

We drew inspiration from other open source projects in order to create the Kubernetes security release process. Unlike a regularly scheduled release, a security release must be delivered on an accelerated schedule, and we created the Product Security Team to handle this process.

This team quickly selects a lead to coordinate work and manage communication with the persons that disclosed the vulnerability and the Kubernetes community. The security release process also documents ways to measure vulnerability severity using the Common Vulnerability Scoring System (CVSS) Version 3.0 Calculator. This calculation helps inform decisions on release cadence in the face of holidays or limited developer bandwidth. By making severity criteria transparent we are able to better set expectations and hit critical timelines during an incident where we strive to:
  • Respond to the person or team who reported the vulnerability and staff a development team responsible for a fix within 24 hours
  • Disclose a forthcoming fix to users within 7 days of disclosure
  • Provide advance notice to vendors within 14 days of disclosure
  • Release a fix within 21 days of disclosure

As we continue to harden Kubernetes, the security release process will help ensure that Kubernetes remains a secure platform for internet scale computing. If you are interested in learning more about the security release process please watch the presentation from KubeCon Europe 2017 on YouTube and follow along with the slides. If you are interested in learning more about authentication and authorization in Kubernetes, along with the Kubernetes cluster security model, consider joining Kubernetes SIG Auth. We also hope to see you at security related presentations and panels at the next Kubernetes community event: CoreOS Fest 2017 in San Francisco on May 31 and June 1.

As a thank you to the Kubernetes community, a special 25 percent discount to CoreOS Fest is available using k8s25code or via this special 25 percent off link to register today for CoreOS Fest 2017. 

--Brandon Philips of CoreOS and Jess Frazelle of Google

  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Get involved with the Kubernetes project on GitHub