An open source system for automating deployment, scaling, and operations of applications.

Friday, April 21, 2017

How Bitmovin is Doing Multi-Stage Canary Deployments with Kubernetes in the Cloud and On-Prem

Editor's Note: Today’s post is by Daniel Hoelbling-Inzko, Infrastructure Architect at Bitmovin, a company that provides services that transcode digital video and audio to streaming formats, sharing insights about their use of Kubernetes.

Running a large scale video encoding infrastructure on multiple public clouds is tough. At Bitmovin, we have been doing it successfully for the last few years, but from an engineering perspective, it’s neither been enjoyable nor particularly fun. 

So obviously, one of the main things that really sold us on using Kubernetes, was it’s common abstraction from the different supported cloud providers and the well thought out programming interface it provides. More importantly, the Kubernetes project did not settle for the lowest common denominator approach. Instead, they added the necessary abstract concepts that are required and useful to run containerized workloads in a cloud and then did all the hard work to map these concepts to the different cloud providers and their offerings.

The great stability, speed and operational reliability we saw in our early tests in mid-2016 made the migration to Kubernetes a no-brainer.

And, it didn’t hurt that the vision for scale the Kubernetes project has been pursuing is closely aligned with our own goals as a company. Aiming for >1,000 node clusters might be a lofty goal, but for a fast growing video company like ours, having your infrastructure aim to support future growth is essential. Also, after initial brainstorming for our new infrastructure, we immediately knew that we would be running a huge number of containers and having a system, with the expressed goal of working at global scale, was the perfect fit for us. Now with the recent Kubernetes 1.6 release and its support for 5,000 node clusters, we feel even more validated in our choice of a container orchestration system.

During the testing and migration phase of getting our infrastructure running on Kubernetes, we got quite familiar with the Kubernetes API and the whole ecosystem around it. So when we were looking at expanding our cloud video encoding offering for customers to use in their own datacenters or cloud environments, we quickly decided to leverage Kubernetes as our ubiquitous cloud operating system to base the solution on.

Just a few months later this effort has become our newest service offering: Bitmovin Managed On-Premise encoding. Since all Kubernetes clusters share the same API, adapting our cloud encoding service to also run on Kubernetes enabled us to deploy into our customer’s datacenter, regardless of the hardware infrastructure running underneath. With great tools from the community, like kube-up and turnkey solutions, like Google Container Engine, anyone can easily provision a new Kubernetes cluster, either within their own infrastructure or in their own cloud accounts. 

To give us the maximum flexibility for customers that deploy to bare metal and might not have any custom cloud integrations for Kubernetes yet, we decided to base our solution solely on facilities that are available in any Kubernetes install and don’t require any integration into the surrounding infrastructure (it will even run inside Minikube!). We don’t rely on Services of type LoadBalancer, primarily because enterprise IT is usually reluctant to open up ports to the open internet - and not every bare metal Kubernetes install supports externally provisioned load balancers out of the box. To avoid these issues, we deploy a BitmovinAgent that runs inside the Cluster and polls our API for new encoding jobs without requiring any network setup. This agent then uses the locally available Kubernetes credentials to start up new deployments that run the encoders on the available hardware through the Kubernetes API.

Even without having a full cloud integration available, the consistent scheduling, health checking and monitoring we get from using the Kubernetes API really enabled us to focus on making the encoder work inside a container rather than spending precious engineering resources on integrating a bunch of different hypervisors, machine provisioners and monitoring systems.

Multi-Stage Canary Deployments

Our first encounters with the Kubernetes API were not for the On-Premise encoding product. Building our containerized encoding workflow on Kubernetes was rather a decision we made after seeing how incredibly easy and powerful the Kubernetes platform proved during development and rollout of our Bitmovin API infrastructure. We migrated to Kubernetes around four months ago and it has enabled us to provide rapid development iterations to our service while meeting our requirements of downtime-free deployments and a stable development to production pipeline. To achieve this we came up with an architecture that runs almost a thousand containers and meets the following requirements we had laid out on day one:

  1. Zero downtime deployments for our customers
  2. Continuous deployment to production on each git mainline push
  3. High stability of deployed services for customers

Obviously #2 and #3 are at odds with each other, if each merged feature gets deployed to production right away - how can we ensure these releases are bug-free and don’t have adverse side effects for our customers?

To overcome this oxymoron, we came up with a four-stage canary pipeline for each microservice where we simultaneously deploy to production and keep changes away from customers until the new build has proven to work reliably and correctly in the production environment.

Once a new build is pushed, we deploy it to an internal stage that’s only accessible for our internal tests and the integration test suite. Once the internal test suite passes, QA reports no issues, and we don’t detect any abnormal behavior, we push the new build to our free stage. This means that 5% of our free users would get randomly assigned to this new build. After some time in this stage the build gets promoted to the next stage that gets 5% of our paid users routed to it. Only once the build has successfully passed all 3 of these hurdles, does it get deployed to the production tier, where it will receive all traffic from our remaining users as well as our enterprise customers, which are not part of the paid bucket and never see their traffic routed to a canary track.


This setup makes us a pretty big Kubernetes installation by default, since all of our canary tiers are available at a minimum replication of 2. Since we are currently deploying around 30 microservices (and growing) to our clusters, it adds up to a minimum of 10 pods per service (8 application pods + minimum 2 HAProxy pods that do the canary routing). Although, in reality our preferred standard configuration is usually running 2 internal, 4 free, 4 others and 10 production pods alongside 4 HAProxy pods - totalling around 700 pods in total. This also means that we are running at least 150 services that provide a static ClusterIP to their underlying microservice canary tier.


A typical deployment looks like this:

Services (ClusterIP)
Deployments
#Pods
account-service
account-service-haproxy
4
account-service-internal
account-service-internal-v1.18.0
2
account-service-canary
account-service-canary-v1.17.0
4
account-service-paid
account-service-paid-v1.15.0
4
account-service-production
account-service-production-v1.15.0
10

An example service definition the production track will have the following label selectors:


apiVersion: v1
kind: Service
metadata:
 name: account-service-production
 labels:
   app: account-service-production
   tier: service
   lb: private
spec:
 ports:
 - port: 8080
   name: http
   targetPort: 8080
   protocol: TCP
 selector:
   app: account-service
   tier: service
   track: production

In front of the Kubernetes services, load balancing the different canary versions of the service, lives a small cluster of HAProxy pods that get their haproxy.conf from the Kubernetes ConfigMaps that looks something like this:

frontend http-in
 bind *:80
 log 127.0.0.1 local2 debug

 acl traffic_internal    hdr(X-Traffic-Group) -m str -i INTERNAL
 acl traffic_free        hdr(X-Traffic-Group) -m str -i FREE
 acl traffic_enterprise  hdr(X-Traffic-Group) -m str -i ENTERPRISE

 use_backend internal   if traffic_internal
 use_backend canary     if traffic_free
 use_backend enterprise if traffic_enterprise

 default_backend paid

backend internal
 balance roundrobin
 server internal-lb        user-resource-service-internal:8080   resolvers dns check inter 2000
backend canary
 balance roundrobin
 server canary-lb          user-resource-service-canary:8080     resolvers dns check inter 2000 weight 5
 server production-lb      user-resource-service-production:8080 resolvers dns check inter 2000 weight 95
backend paid
 balance roundrobin
 server canary-paid-lb     user-resource-service-paid:8080       resolvers dns check inter 2000 weight 5
 server production-lb      user-resource-service-production:8080 resolvers dns check inter 2000 weight 95
backend enterprise
 balance roundrobin
 server production-lb      user-resource-service-production:8080 resolvers dns check inter 2000 weight 100

Each HAProxy will inspect a header that gets assigned by our API-Gateway called X-Traffic-Group that determines which bucket of customers this request belongs to. Based on that, a decision is made to hit either a canary deployment or the production deployment.

Obviously, at this scale, kubectl (while still our main day-to-day tool to work on the cluster) doesn’t really give us a good overview of whether everything is actually running as it’s supposed to and what is maybe over or under replicated.

Since we do blue/green deployments, we sometimes forget to shut down the old version after the new one comes up, so some services might be running over replicated and finding these issues in a soup of 25 deployments listed in kubectl is not trivial, to say the least.
So, having a container orchestrator like Kubernetes, that’s very API driven, was really a godsend for us, as it allowed us to write tools that take care of that.

We built tools that either run directly off kubectl (eg bash-scripts) or interact directly with the API and understand our special architecture to give us a quick overview of the system. These tools were mostly built in Go using the client-go library.

One of these tools is worth highlighting, as it’s basically our only way to really see service health at a glance. It goes through all our Kubernetes services that have the tier: service selector and checks if the accompanying HAProxy deployment is available and all pods are running with 4 replicas. It also checks if the 4 services behind the HAProxys (internal, free, others and production) have at least 2 endpoints running. If any of these conditions are not met, we immediately get a notification in Slack and by email.

Managing this many pods with our previous orchestrator proved very unreliable and the overlay network frequently caused issues. Not so with Kubernetes - even doubling our current workload for test purposes worked flawlessly and in general, the cluster has been working like clockwork ever since we installed it.

Another advantage of switching over to Kubernetes was the availability of the kubernetes resource specifications, in addition to the API (which we used to write some internal tools for deployment). This enabled us to have a Git repo with all our Kubernetes specifications, where each track is generated off a common template and only contains placeholders for variable things like the canary track and the names.

All changes to the cluster have to go through tools that modify these resource specifications and get checked into git automatically so, whenever we see issues, we can debug what changes the infrastructure went through over time!

To summarize this post - by migrating our infrastructure to Kubernetes, Bitmovin is able to have:
  • Zero downtime deployments, allowing our customers to encode 24/7 without interruption
  • Fast development to production cycles, enabling us to ship new features faster
  • Multiple levels of quality assurance and high confidence in production deployments
  • Ubiquitous abstractions across cloud architectures and on-premise deployments
  • Stable and reliable health-checking and scheduling of services
  • Custom tooling around our infrastructure to check and validate the system
  • History of deployments (resource specifications in git + custom tooling)

We want to thank the Kubernetes community for the incredible job they have done with the project. The velocity at which the project moves is just breathtaking! Maintaining such a high level of quality and robustness in such a diverse environment is really astonishing. 

--Daniel Hoelbling-Inzko, Infrastructure Architect, Bitmovin

  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Download Kubernetes




Thursday, April 6, 2017

RBAC Support in Kubernetes

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.6


One of the highlights of the Kubernetes 1.6 release is the RBAC authorizer feature moving to beta. RBAC, Role-based access control, is an an authorization mechanism for managing permissions around Kubernetes resources. RBAC allows configuration of flexible authorization policies that can be updated without cluster restarts.

The focus of this post is to highlight some of the interesting new capabilities and best practices.

RBAC vs ABAC

Currently there are several authorization mechanisms available for use with Kubernetes. Authorizers are the mechanisms that decide who is permitted to make what changes to the cluster using the Kubernetes API. This affects things like kubectl, system components, and also certain applications that run in the cluster and manipulate the state of the cluster, like Jenkins with the Kubernetes plugin, or Helm that runs in the cluster and uses the Kubernetes API to install applications in the cluster. Out of the available authorization mechanisms, ABAC and RBAC are the mechanisms local to a Kubernetes cluster that allow configurable permissions policies.

ABAC, Attribute Based Access Control, is a powerful concept. However, as implemented in Kubernetes, ABAC is difficult to manage and understand. It requires ssh and root filesystem access on the master VM of the cluster to make authorization policy changes. For permission changes to take effect the cluster API server must be restarted.

RBAC permission policies are configured using kubectl or the Kubernetes API directly. Users can be authorized to make authorization policy changes using RBAC itself, making it possible to delegate resource management without giving away ssh access to the cluster master. RBAC policies map easily to the resources and operations used in the Kubernetes API.

Based on where the Kubernetes community is focusing their development efforts, going forward RBAC should be preferred over ABAC.

Basic Concepts

The are a few basic ideas behind RBAC that are foundational in understanding it. At its core, RBAC is a way of granting users granular access to Kubernetes API resources.


The connection between user and resources is defined in RBAC using two objects.

Roles
A Role is a collection of permissions. For example, a role could be defined to include read permission on pods and list permission for pods. A ClusterRole is just like a Role, but can be used anywhere in the cluster.

Role Bindings
A RoleBinding maps a Role to a user or set of users, granting that Role's permissions to those users for resources in that namespace. A ClusterRoleBinding allows users to be granted a ClusterRole for authorization across the entire cluster.



Additionally there are cluster roles and cluster role bindings to consider. Cluster roles and cluster role bindings function like roles and role bindings except they have wider scope. The exact differences and how cluster roles and cluster role bindings interact with roles and role bindings are covered in the Kubernetes documentation.

RBAC in Kubernetes

RBAC is now deeply integrated into Kubernetes and used by the system components to grant the permissions necessary for them to function. System roles are typically prefixed with system: so they can be easily recognized.

➜  kubectl get clusterroles --namespace=kube-system
NAME                    KIND
admin ClusterRole.v1beta1.rbac.authorization.k8s.io
cluster-admin ClusterRole.v1beta1.rbac.authorization.k8s.io
edit ClusterRole.v1beta1.rbac.authorization.k8s.io
kubelet-api-admin ClusterRole.v1beta1.rbac.authorization.k8s.io
system:auth-delegator ClusterRole.v1beta1.rbac.authorization.k8s.io
system:basic-user ClusterRole.v1beta1.rbac.authorization.k8s.io
system:controller:attachdetach-controller ClusterRole.v1beta1.rbac.authorization.k8s.io
system:controller:certificate-controller ClusterRole.v1beta1.rbac.authorization.k8s.io
...

The RBAC system roles have been expanded to cover the necessary permissions for running a Kubernetes cluster with RBAC only.

During the permission translation from ABAC to RBAC, some of the permissions that were enabled by default in many deployments of ABAC authorized clusters were identified as unnecessarily broad and were scoped down in RBAC. The area most likely to impact workloads on a cluster is the permissions available to service accounts. With the permissive ABAC configuration, requests from a pod using the pod mounted token to authenticate to the API server have broad authorization. As a concrete example, the curl command at the end of this sequence will return a JSON formatted result when ABAC is enabled and an error when only RBAC is enabled.

➜  kubectl run nginx --image=nginx:latest
➜  kubectl exec -it $(kubectl get pods -o jsonpath='{.items[0].metadata.name}') bash
➜  apt-get update && apt-get install -y curl
➜  curl -ik \
 -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
 https://kubernetes/api/v1/namespaces/default/pods

Any applications you run in your Kubernetes cluster that interact with the Kubernetes API have the potential to be affected by the permissions changes when transitioning from ABAC to RBAC.

To smooth the transition from ABAC to RBAC, you can create Kubernetes 1.6 clusters with both ABAC and RBAC authorizers enabled. When both ABAC and RBAC are enabled, authorization for a resource is granted if either authorization policy grants access. However, under that configuration the most permissive authorizer is used and it will not be possible to use RBAC to fully control permissions.

At this point, RBAC is complete enough that ABAC support should be considered deprecated going forward. It will still remain in Kubernetes for the foreseeable future but development attention is focused on RBAC.

Two different talks at the at the Google Cloud Next conference touched on RBAC related changes in Kubernetes 1.6, jump to the relevant parts here and here. For more detailed information about using RBAC in Kubernetes 1.6 read the full RBAC documentation.

Get Involved

If you’d like to contribute or simply help provide feedback and drive the roadmap, join our community. Specifically interested in security and RBAC related conversation, participate through one of these channels:

Thanks for your support and contributions. Read more in-depth posts on what's new in Kubernetes 1.6 here.


-- Jacob Simpson, Greg Castle & CJ Cullen, Software Engineers at Google


  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Download Kubernetes


Tuesday, April 4, 2017

Configuring Private DNS Zones and Upstream Nameservers in Kubernetes

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.6

Many users have existing domain name zones that they would like to integrate into their Kubernetes DNS namespace. For example, hybrid-cloud users may want to resolve their internal “.corp” domain addresses within the cluster. Other users may have a zone populated by a non-Kubernetes service discovery system (like Consul). We’re pleased to announce that, in Kubernetes 1.6, kube-dns adds support for configurable private DNS zones (often called “stub domains”) and external upstream DNS nameservers. In this blog post, we describe how to configure and use this feature.

Default lookup flow


Kubernetes currently supports two DNS policies specified on a per-pod basis using the dnsPolicy flag: “Default” and “ClusterFirst”. If dnsPolicy is not explicitly specified, then “ClusterFirst” is used:
  • If dnsPolicy is set to “Default”, then the name resolution configuration is inherited from the node the pods run on. Note: this feature cannot be used in conjunction with dnsPolicy: “Default”.
  • If dnsPolicy is set to “ClusterFirst”, then DNS queries will be sent to the kube-dns service. Queries for domains rooted in the configured cluster domain suffix (any address ending in “.cluster.local” in the example above) will be answered by the kube-dns service. All other queries (for example, www.kubernetes.io) will be forwarded to the upstream nameserver inherited from the node.
Before this feature, it was common to introduce stub domains by replacing the upstream DNS with a custom resolver. However, this caused the custom resolver itself to become a critical path for DNS resolution, where issues with scalability and availability may cause the cluster to lose DNS functionality. This feature allows the user to introduce custom resolution without taking over the entire resolution path.

Customizing the DNS Flow

Beginning in Kubernetes 1.6, cluster administrators can specify custom stub domains and upstream nameservers by providing a ConfigMap for kube-dns. For example, the configuration below inserts a single stub domain and two upstream nameservers. As specified, DNS requests with the “.acme.local” suffix will be forwarded to a DNS listening at 1.2.3.4. Additionally, Google Public DNS will serve upstream queries. See ConfigMap Configuration Notes at the end of this section for a few notes about the data format.

apiVersion: v1
kind: ConfigMap
metadata:
 name: kube-dns
 namespace: kube-system
data:
 stubDomains: |
   {“acme.local”: [“1.2.3.4”]}
 upstreamNameservers: |
   [“8.8.8.8”, “8.8.4.4”]

The diagram below shows the flow of DNS queries specified in the configuration above. With the
dnsPolicy set to “ClusterFirst” a DNS query is first sent to the DNS caching layer in kube-dns. From here, the suffix of the request is examined and then forwarded to the appropriate DNS.  In this case, names with the cluster suffix (e.g.; “.cluster.local”) are sent to kube-dns. Names with the stub domain suffix (e.g.; “.acme.local”) will be sent to the configured custom resolver. Finally, requests that do not match any of those suffixes will be forwarded to the upstream DNS.

Below is a table of example domain names and the destination of the queries for those domain names:
Domain name
Server answering the query
kubernetes.default.svc.cluster.local
kube-dns
foo.acme.local
custom DNS (1.2.3.4)
widget.com
upstream DNS (one of 8.8.8.8, 8.8.4.4)

ConfigMap Configuration Notes
  • stubDomains (optional)
    • Format: a JSON map using a DNS suffix key (e.g.; “acme.local”) and a value consisting of a JSON array of DNS IPs.
    • Note: The target nameserver may itself be a Kubernetes service. For instance, you can run your own copy of dnsmasq to export custom DNS names into the ClusterDNS namespace.
  • upstreamNameservers (optional)
    • Format: a JSON array of DNS IPs.
    • Note: If specified, then the values specified replace the nameservers taken by default from the node’s /etc/resolv.conf
    • Limits: a maximum of three upstream nameservers can be specified
Example #1: Adding a Consul DNS Stub Domain

In this example, the user has Consul DNS service discovery system they wish to integrate with kube-dns. The consul domain server is located at 10.150.0.1, and all consul names have the suffix “.consul.local”.  To configure Kubernetes, the cluster administrator simply creates a ConfigMap object as shown below.  Note: in this example, the cluster administrator did not wish to override the node’s upstream nameservers, so they didn’t need to specify the optional upstreamNameservers field.

apiVersion: v1
kind: ConfigMap
metadata:
 name: kube-dns
 namespace: kube-system
data:
 stubDomains: |
   {“consul.local”: [“10.150.0.1”]}

Example #2: Replacing the Upstream Nameservers

In this example the cluster administrator wants to explicitly force all non-cluster DNS lookups to go through their own nameserver at 172.16.0.1.  Again, this is easy to accomplish; they just need to create a ConfigMap with the upstreamNameservers field specifying the desired nameserver.

apiVersion: v1
kind: ConfigMap
metadata:
 name: kube-dns
 namespace: kube-system
data:
 upstreamNameservers: |
   [“172.16.0.1”]

Get involved

If you’d like to contribute or simply help provide feedback and drive the roadmap, join our community. Specifically for network related conversations participate though one of these channels:
Thanks for your support and contributions. Read more in-depth posts on what's new in Kubernetes 1.6 here.


--Bowei Du, Software Engineer and Matthew DeLio, Product Manager, Google


  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Download Kubernetes