An open source system for automating deployment, scaling, and operations of applications.

Friday, February 24, 2017

Deploying PostgreSQL Clusters using StatefulSets

Editor’s note: Today’s guest post is by Jeff McCormick, a developer at Crunchy Data, showing how to build a PostgreSQL cluster using the new Kubernetes StatefulSet feature.

In an earlier post, I described how to deploy a PostgreSQL cluster using Helm, a Kubernetes package manager. The following example provides the steps for building a PostgreSQL cluster using the new Kubernetes StatefulSets feature. 

StatefulSets Example

Step 1 - Create Kubernetes Environment

StatefulSets is a new feature implemented in Kubernetes 1.5 (prior versions it was known as PetSets). As a result, running this example will require an environment based on Kubernetes 1.5.0 or above.  

The example in this blog deploys on Centos7 using kubeadm. Some instructions on what kubeadm provides and how to deploy a Kubernetes cluster is located here.

Step 2 - Install NFS

The example in this blog uses NFS for the Persistent Volumes, but any shared file system would also work (ex: ceph, gluster).  

The example script assumes your NFS server is running locally and your hostname resolves to a known IP address. 

In summary, the steps used to get NFS working on a Centos 7 host are as follows:

sudo setsebool -P virt_use_nfs 1
sudo yum -y install nfs-utils libnfsidmap
sudo systemctl enable rpcbind nfs-server
sudo systemctl start rpcbind nfs-server rpc-statd nfs-idmapd
sudo mkdir /nfsfileshare
sudo chmod 777 /nfsfileshare/
sudo vi /etc/exports
sudo exportfs -r

The /etc/exports file should contain a line similar to this one except with the applicable IP address specified:


After these steps NFS should be running in the test environment.

Step 3 - Clone the Crunchy PostgreSQL Container Suite

The example used in this blog is found at in the Crunchy Containers GitHub repo here. Clone the Crunchy Containers repository to your test Kubernertes host and go to the example:

cd $HOME
git clone
cd crunchy-containers/examples/kube/statefulset

Next, pull down the Crunchy PostgreSQL container image:

docker pull crunchydata/crunchy-postgres:centos7-9.5-1.2.6

Step 4 - Run the Example

To begin, it is necessary to set a few of the environment variables used in the example:

export BUILDBASE=$HOME/crunchy-containers
export CCP_IMAGE_TAG=centos7-9.5-1.2.6

BUILDBASE is where you cloned the repository and CCP_IMAGE_TAG is the container image version we want to use.

Next, run the example:


That script will create several Kubernetes objects including:
  •  Persistent Volumes (pv1, pv2, pv3)
  •  Persistent Volume Claim (pgset-pvc)
  •  Service Account (pgset-sa)
  •  Services (pgset, pgset-master, pgset-replica)
  •  StatefulSet (pgset)
  •  Pods (pgset-0, pgset-1)
At this point, two pods will be running in the Kubernetes environment: 

$ kubectl get pod
pgset-0   1/1       Running   0          2m
pgset-1   1/1       Running   1          2m

Immediately after the pods are created, the deployment will be as depicted below:

Step 5 - What Just Happened?

This example will deploy a StatefulSet, which in turn creates two pods.

The containers in those two pods run the PostgreSQL database. For a PostgreSQL cluster, we need one of the containers to assume the master role and the other containers to assume the replica role. 

So, how do the containers determine who will be the master, and who will be the replica?

This is where the new StateSet mechanics come into play. The StateSet mechanics assign a unique ordinal value to each pod in the set.

The StatefulSets provided unique ordinal value always start with 0. During the initialization of the container, each container examines its assigned ordinal value. An ordinal value of 0 causes the container to assume the master role within the PostgreSQL cluster. For all other ordinal values, the container assumes a replica role. This is a very simple form of discovery made possible by the StatefulSet mechanics.

PostgreSQL replicas are configured to connect to the master database via a Service dedicated to the master database. In order to support this replication, the example creates a separate Service for each of the master role and the replica role. Once the replica has connected, the replica will begin replicating state from the master.  

During the container initialization, a master container will use a Service Account (pgset-sa) to change it’s container label value to match the master Service selector.  Changing the label is important to enable traffic destined to the master database to reach the correct container within the Stateful Set.  All other pods in the set assume the replica Service label by default.

Step 6 - Deployment Diagram

The example results in a deployment depicted below:
In this deployment, there is a Service for the master and a separate Service for the replica.  The replica is connected to the master and replication of state has started.

The Crunchy PostgreSQL container supports other forms of cluster deployment, the style of deployment is dictated by setting the PG_MODE environment variable for the container.  In the case of a StatefulSet deployment, that value is set to: PG_MODE=set

This environment variable is a hint to the container initialization logic as to the style of deployment we intend.

Step 7 - Testing the Example

The tests below assume that the psql client has been installed on the test system.  If if not, the psql client has been previously installed, it can be installed as follows:

sudo yum -y install postgresql

In addition, the tests below assume that the tested environment DNS resolves to the Kube DNS and that the tested environment DNS search path is specified to match the applicable Kube namespace and domain. The master service is named pgset-master and the replica service is named pgset-replica.

Test the master as follows (the password is password):

psql -h pgset-master -U postgres postgres -c 'table pg_stat_replication'

If things are working, the command above will return output indicating that a single replica is connecting to the master.

Next, test the replica as follows:

psql -h pgset-replica -U postgres postgres  -c 'create table foo (id int)'

The command above should fail as the replica is read-only within a PostgreSQL cluster.

Next, scale up the set as follows:

kubectl scale statefulset pgset --replicas=3

The command above should successfully create a new replica pod called pgset-2 as depicted below:

Step 8 - Persistence Explained

Take a look at the persisted PostgreSQL data files on the resulting NFS mount path:

$ ls -l /nfsfileshare/
total 12
drwx------ 20   26   26 4096 Jan 17 16:35 pgset-0
drwx------ 20   26   26 4096 Jan 17 16:35 pgset-1
drwx------ 20   26   26 4096 Jan 17 16:48 pgset-2

Each container in the stateful set binds to the single NFS Persistent Volume Claim (pgset-pvc) created in the example script.  

Since NFS and the PVC can be shared, each pod can write to this NFS path.  

The container is designed to create a subdirectory on that path using the pod host name for uniqueness.


StatefulSets is an exciting feature added to Kubernetes for container builders that are implementing clustering. The ordinal values assigned to the set provide a very simple mechanism to make clustering decisions when deploying a PostgreSQL cluster.  

--Jeff McCormick, Developer, Crunchy Data

Tuesday, February 21, 2017

Containers as a Service, the foundation for next generation PaaS

Today’s post is by Brendan Burns, Partner Architect, at Microsoft & Kubernetes co-founder.

Containers are revolutionizing the way that people build, package and deploy software. But what is often overlooked is how they are revolutionizing the way that people build the software that builds, packages and deploys software. (it’s ok if you have to read that sentence twice…) Today, and in a talk at Container World tomorrow, I’m taking a look at how container orchestrators like Kubernetes form the foundation for next generation platform as a service (PaaS). In particular, I’m interested in how cloud container as a service (CaaS) platforms like Azure Container Service, Google Container Engine and others are becoming the new infrastructure layer that PaaS is built upon.

To see this, it’s important to consider the set of services that have traditionally been provided by PaaS platforms:

  • Source code and executable packaging and distribution
  • Reliable, zero-downtime rollout of software versions
  • Healing, auto-scaling, load balancing

When you look at this list, it’s clear that most of these traditional “PaaS” roles have now been taken over by containers. The container image and container image build tooling has become the way to package up your application. Container registries have become the way to distribute your application across the world. Reliable software rollout is achieved using orchestrator concepts like Deployment in Kubernetes, and service healing, auto-scaling and load-balancing are all properties of an application deployed in Kubernetes using ReplicaSets and Services.

What then is left for PaaS? Is PaaS going to be replaced by container as a service? I think the answer is “no.” The piece that is left for PaaS is the part that was always the most important part of PaaS in the first place, and that’s the opinionated developer experience. In addition to all of the generic parts of PaaS that I listed above, the most important part of a PaaS has always been the way in which the developer experience and application framework made developers more productive within the boundaries of the platform. PaaS enables developers to go from source code on their laptop to a world-wide scalable service in less than an hour. That’s hugely powerful. 

However, in the world of traditional PaaS, the skills needed to build PaaS infrastructure itself, the software on which the user’s software ran, required very strong skills and experience with distributed systems. Consequently, PaaS tended to be built by distributed system engineers rather than experts in a particular vertical developer experience. This means that PaaS platforms tended towards general purpose infrastructure rather than targeting specific verticals. Recently, we have seen this start to change, first with PaaS targeted at mobile API backends, and later with PaaS targeting “function as a service”. However, these products were still built from the ground up on top of raw infrastructure.

More recently, we are starting to see these platforms build on top of container infrastructure. Taking for example “function as a service” there are at least two (and likely more) open source implementations of functions as a service that run on top of Kubernetes (fission and funktion). This trend will only continue. Building a platform as a service, on top of container as a service is easy enough that you could imagine giving it out as an undergraduate computer science assignment. This ease of development means that individual developers with specific expertise in a vertical (say software for running three-dimensional simulations) can and will build PaaS platforms targeted at that specific vertical experience. In turn, by targeting such a narrow experience, they will build an experience that fits that narrow vertical perfectly, making their solution a compelling one in that target market.

This then points to the other benefit of next generation PaaS being built on top of container as a service. It frees the developer from having to make an “all-in” choice on a particular PaaS platform. When layered on top of container as a service, the basic functionality (naming, discovery, packaging, etc) are all provided by the CaaS and thus common across multiple PaaS that happened to be deployed on top of that CaaS. This means that developers can mix and match, deploying multiple PaaS to the same container infrastructure, and choosing for each application the PaaS platform that best suits that particular platform. Also, importantly, they can choose to “drop down” to raw CaaS infrastructure if that is a better fit for their application. Freeing PaaS from providing the infrastructure layer, enables PaaS to diversify and target specific experiences without fear of being too narrow. The experiences become more targeted, more powerful, and yet by building on top of container as a service, more flexible as well.

Kubernetes is infrastructure for next generation applications, PaaS and more. Given this, I’m really excited by our announcement today that Kubernetes on Azure Container Service has reached general availability. When you deploy your next generation application to Azure, whether on a PaaS or deployed directly onto Kubernetes itself (or both) you can deploy it onto a managed, supported Kubernetes cluster.

Furthermore, because we know that the world of PaaS and software development in general is a hybrid one, we’re excited to announce the preview availability of Windows clusters in Azure Container Service. We’re also working on hybrid clusters in ACS-Engine and expect to roll those out to general availability in the coming months.

I’m thrilled to see how containers and container as a service is changing the world of compute, I’m confident that we’re only scratching the surface of the transformation we’ll see in the coming months and years.

--Brendan Burns, Partner Architect, at Microsoft and co-founder of Kubernetes

Friday, February 10, 2017

Inside's Shift to Kubernetes from OpenStack

Editor's note: Today’s post is by the Infrastructure Platform Department team at about their transition from OpenStack to Kubernetes. is one of China’s largest companies and the first Chinese Internet company to make the Global Fortune 500 list.

History of cluster building

The era of physical machines (2004-2014)

Before 2014, our company's applications were all deployed on the physical machine. In the age of physical machines, we needed to wait an average of one week for the allocation to application coming on-line. Due to the lack of isolation, applications would affected each other, resulting in a lot of potential risks. At that time, the average number of tomcat instances on each physical machine was no more than nine. The resource of physical machine was seriously wasted and the scheduling was inflexible. The time of application migration took hours due to the breakdown of physical machines. And the auto-scaling cannot be achieved. To enhance the efficiency of application deployment, we developed compilation-packaging, automatic deployment, log collection, resource monitoring and some other systems.

Containerized era (2014-2016)

The Infrastructure Platform Department (IPD) led by Liu Haifeng--Chief Architect of JD.COM, sought a new resolution in the fall of 2014. Docker ran into our horizon. At that time, docker had been rising, but was slightly weak and lacked of experience in production environment. We had repeatedly tested docker. In addition, docker was customized to fix a couple of issues, such as system crash caused by device mapper and some Linux kernel bugs. We also added plenty of new features into docker, including disk speed limit, capacity management, and layer merging in image building and so on.

To manage the container cluster properly, we chose the architecture of OpenStack + Novadocker driver. Containers are managed as virtual machines. It is known as the first generation of JD container engine platform--JDOS1.0 (JD Datacenter Operating System). The main purpose of JDOS 1.0 is to containerize the infrastructure. All applications run in containers rather than physical machines since then. As for the operation and maintenance of applications, we took full advantage of existing tools. The time for developers to request computing resources in production environment reduced to several minutes rather than a week. After the pooling of computing resources, even the scaling of 1,000 containers would be finished in seconds. Application instances had been isolated from each other. Both the average deployment density of applications and the physical machine utilization had increased by three times, which brought great economic benefits.

We deployed clusters in each IDC and provided unified global APIs to support deployment across the IDC. There are 10,000 compute nodes at most and 4,000 at least in a single OpenStack distributed container cluster in our production environment. The first generation of container engine platform (JDOS 1.0) successfully supported the “6.18” and “11.11” promotional activities in both 2015 and 2016. There are already 150,000 running containers online by November 2016.

“6.18” and “11.11” are known as the two most popular online promotion of JD.COM, similar to the black Friday promotions. Fulfilled orders in November 11, 2016 reached 30 million. 

In the practice of developing and promoting JDOS 1.0, applications were migrated directly from physical machines to containers. Essentially, JDOS 1.0 was an implementation of IaaS. Therefore, deployment of applications was still heavily dependent on compilation-packaging and automatic deployment tools. However, the practice of JDOS1.0 is very meaningful. Firstly, we successfully moved business into containers. Secondly, we have a deep understanding of container network and storage, and know how to polish them to the best. Finally, all the experiences lay a solid foundation for us to develop a brand new application container platform.

New container engine platform (JDOS 2.0)

Platform architecture

When JDOS 1.0 grew from 2,000 containers to 100,000, we launched a new container engine platform (JDOS 2.0). The goal of JDOS 2.0 is not just an infrastructure management platform, but also a container engine platform faced to applications. On the basic of JDOS 1.0 and Kubernetes, JDOS 2.0 integrates the storage and network of JDOS 1.0, gets through the process of CI/CD from the source to the image, and finally to the deployment. Also, JDOS 2.0 provides one-stop service such as log, monitor, troubleshooting, terminal and orchestration. The platform architecture of JDOS 2.0 is shown below.


Source Code Management
Container Tool
Container Networking
Container Engine
Image Registry
CI Tool
Log Management
Logstash + Elastic Search

In JDOS 2.0, we define two levels, system and application. A system consists of several applications and an application consists of several Pods which provide the same service. In general, a department can apply for one or more systems which directly corresponds to the namespace of Kubernetes. This means that the Pods of the same system will be in the same namespace.

Most of the JDOS 2.0 components (GitLab / Jenkins / Harbor / Logstash / Elastic Search / Prometheus) are also containerized and deployed on the Kubernetes platform.

One Stop Solution


  1. JDOS 2.0 takes docker image as the core to implement continuous integration and continuous deployment.
  2. Developer pushes code to git.
  3. Git triggers the jenkins master to generate build job.
  4. Jenkins master invokes Kubernetes to create jenkins slave Pod.
  5. Jenkins slave pulls the source code, compiles and packs.
  6. Jenkins slave sends the package and the Dockerfile to the image build node with docker.
  7. The image build node builds the image.
  8. The image build node pushes the image to the image registry Harbor.
  9. User creates or updates app Pods in different zone.

The docker image in JDOS 1.0 consisted primarily of the operating system and the runtime software stack of the application. So, the deployment of applications was still dependent on the auto-deployment and some other tools. While in JDOS 2.0, the deployment of the application is done during the image building. And the image contains the complete software stack, including App. With the image, we can achieve the goal of running applications as designed in any environment.


Networking and External Service Load Balancing

JDOS 2.0 takes the network solution of JDOS 1.0, which is implemented with the VLAN model of OpenStack Neutron. This solution enables highly efficient communication between containers, making it ideal for a cluster environment within a company. Each Pod occupies a port in Neutron, with a separate IP. Based on the Container Network Interface standard (CNI) standard, we have developed a new project Cane for integrating kubelet and Neutron.


At the same time, Cane is also responsible for the management of LoadBalancer in Kubernetes service. When a LoadBalancer is created / deleted / modified, Cane will call the creating / removing / modifying interface of the lbaas service in Neutron. In addition, the Hades component in the Cane project provides an internal DNS resolution service for the Pods.

The source code of the Cane project is currently being finished and will be released on GitHub soon.

Flexible Scheduling

D:\百度云同步盘\徐新坤-新人培训计划\docker\MAE\分享\schedule.pngJDOS 2.0 accesses applications, including big data, web applications, deep learning and some other types, and takes more diverse and flexible scheduling approaches. In some IDCs, we experimentally mixed deployment of online tasks and offline tasks. Compared to JDOS 1.0, overall resource utilization increased by about 30%.


The rich functionality of Kubernetes allows us to pay more attention to the entire ecosystem of the platform, such as network performance, rather than the platform itself. In particular, the SREs highly appreciated the functionality of replication controller. With it, the scaling of the applications is achieved in several seconds. JDOS 2.0 now has accessed about 20% of the applications, and deployed 2 clusters with about 20,000 Pods running daily. We plan to access more applications of our company, to replace the current JDOS 1.0. And we are also glad to share our experience in this process with the community.

Thank you to all the contributors of Kubernetes and the other open source projects.

--Infrastructure Platform Department team at

Wednesday, February 8, 2017

Run Deep Learning with PaddlePaddle on Kubernetes

Editor's note: Today's post is a joint post from the deep learning team at Baidu and the etcd team at CoreOS. 

What is PaddlePaddle

PaddlePaddle is an easy-to-use, efficient, flexible and scalable deep learning platform originally developed at Baidu for applying deep learning to Baidu products since 2014. 

There have been more than 50 innovations created using PaddlePaddle supporting 15 Baidu products ranging from the search engine, online advertising, to Q&A and system security. 

In September 2016, Baidu open sourced PaddlePaddle, and it soon attracted many contributors from outside of Baidu.

Why Run PaddlePaddle on Kubernetes

PaddlePaddle is designed to be slim and independent of computing infrastructure. Users can run it on top of Hadoop, Spark, Mesos, Kubernetes and others.. We have a strong interest with Kubernetes because of its flexibility, efficiency and rich features.

While we are applying PaddlePaddle in various Baidu products, we noticed two main kinds of PaddlePaddle usage -- research and product. Research data does not change often, and the focus is fast experiments to reach the expected scientific measurement. Products data changes often. It usually comes from log messages generated from the Web services.

A successful deep learning project includes both the research and the data processing pipeline. There are many parameters to be tuned. A lot of engineers work on the different parts of the project simultaneously.

To ensure the project is easy to manage and utilize hardware resource efficiently, we want to run all parts of the project on the same infrastructure platform.

The platform should provide:

  • fault-tolerance. It should abstract each stage of the pipeline as a service, which consists of many processes that provide high throughput and robustness through redundancy.
  • auto-scaling. In the daytime, there are usually many active users, the platform should scale out online services. While during nights, the platform should free some resources for deep learning experiments.
  • job packing and isolation. It should be able to assign a PaddlePaddle trainer process requiring the GPU, a web backend service requiring large memory, and a CephFS process requiring disk IOs to the same node to fully utilize its hardware.

What we want is a platform which runs the deep learning system, the Web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed queue service (e.g., Kafka), the log joiner and other data processors written using Storm, Spark, and Hadoop MapReduce on the same cluster. We want to run all jobs -- online and offline, production and experiments -- on the same cluster, so we could make full utilization of the cluster, as different kinds of jobs require different hardware resource. 

We chose container based solutions since the overhead introduced by VMs is contradictory to our goal of efficiency and utilization. 

Based on our research of different container based solutions, Kubernetes fits our requirement the best.

Distributed Training on Kubernetes

PaddlePaddle supports distributed training natively. There are two roles in a PaddlePaddle cluster: parameter server and trainer. Each parameter server process maintains a shard of the global model. Each trainer has its local copy of the model, and uses its local data to update the model. During the training process, trainers send model updates to parameter servers, parameter servers are responsible for aggregating these updates, so that trainers can synchronize their local copy with the global model.

Figure 1: Model is partitioned into two shards. Managed by two parameter servers respectively. 

Some other approaches use a set of parameter servers to collectively hold a very large model in the CPU memory space on multiple hosts. But in practice, it is not often that we have such big models, because it would be very inefficient to handle very large model due to the limitation of GPU memory. In our configuration, multiple parameter servers are mostly for fast communications. Suppose there is only one parameter server process working with all trainers, the parameter server would have to aggregate gradients from all trainers and becomes a bottleneck. In our experience, an experimentally efficient configuration includes the same number of trainers and parameter servers. And we usually run a pair of trainer and parameter server on the same node. In the following Kubernetes job configuration, we start a job that runs N Pods, and in each Pod there are a parameter server and a trainer process.

apiVersion: batch/v1
kind: Job
 name: PaddlePaddle-cluster-job
 parallelism: 3
 completions: 3
     name: PaddlePaddle-cluster-job
     - name: jobpath
         path: /home/admin/efs
     - name: trainer
       image: your_repo/paddle:mypaddle
       command: ["bin/bash",  "-c", "/root/"]
       - name: JOB_NAME
         value: paddle-cluster-job
       - name: JOB_PATH
         value: /home/jobpath
       - name: JOB_NAMESPACE
         value: default
       - name: jobpath
         mountPath: /home/jobpath
     restartPolicy: Never

We can see from the config that parallelism, completions are both set to 3. So this job will simultaneously start up 3 PaddlePaddle pods, and this job will be finished when all 3 pods finishes.

Figure 2: Job A of three pods and Job B of one pod running on two nodes.

The entrypoint of each pod is It downloads data from a storage service, so that trainers can read quickly from the pod-local disk space. After downloading completes, it runs a Python script,, which starts a parameter server, waits until parameter servers of all pods are ready to serve, and then starts the trainer process in the pod.

This waiting is necessary because each trainer needs to talk to all parameter servers, as shown in Figure. 1. Kubernetes API enables trainers to check the status of pods, so the Python script could wait until all parameter servers’ status change to "running" before it triggers the training process.

Currently, the mapping from data shards to pods/trainers is static. If we are going to run N trainers, we would need to partition the data into N shards, and statically assign each shard to a trainer. Again we rely on the Kubernetes API to enlist pods in a job so could we index pods / trainers from 1 to N. The i-th trainer would read the i-th data shard.

Training data is usually served on a distributed filesystem. In practice we use CephFS on our on-premise clusters and Amazon Elastic File System on AWS. If you are interested in building a Kubernetes cluster to run distributed PaddlePaddle training jobs, please follow this tutorial.

What’s Next

We are working on running PaddlePaddle with Kubernetes more smoothly.

As you might notice the current trainer scheduling fully relies on Kubernetes based on a static partition map. This approach is simple to start, but might cause a few efficiency problems.

First, slow or dead trainers block the entire job. There is no controlled preemption or rescheduling after the initial deployment. Second, the resource allocation is static. So if Kubernetes has more available resources than we anticipated, we have to manually change the resource requirements. This is tedious work, and is not aligned with our efficiency and utilization goal.

To solve the problems mentioned above, we will add a PaddlePaddle master that understands Kubernetes API, can dynamically add/remove resource capacity, and dispatches shards to trainers in a more dynamic manner. The PaddlePaddle master uses etcd as a fault-tolerant storage of the dynamic mapping from shards to trainers. Thus, even if the master crashes, the mapping is not lost. Kubernetes can restart the master and the job will keep running. 

Another potential improvement is better PaddlePaddle job configuration. Our experience of having the same number of trainers and parameter servers was mostly collected from using special-purpose clusters. That strategy was observed performant on our clients' clusters that run only PaddlePaddle jobs. However, this strategy might not be optimal on general-purpose clusters that run many kinds of jobs.

PaddlePaddle trainers can utilize multiple GPUs to accelerate computations. GPU is not a first class resource in Kubernetes yet. We have to manage GPUs semi-manually. We would love to work with Kubernetes community to improve GPU support to ensure PaddlePaddle runs the best on Kubernetes. 

--Yi Wang, Baidu Research and Xiang Li, CoreOS