An open source system for automating deployment, scaling, and operations of applications.

Wednesday, August 31, 2016

Security Best Practices for Kubernetes Deployment

Editor’s note: today’s post is by Amir Jerbi and Michael Cherny of Aqua Security, describing security best practices for Kubernetes deployments, based on data they’ve collected from various use-cases seen in both on-premises and cloud deployments. 

Kubernetes provides many controls that can greatly improve your application security. Configuring them requires intimate knowledge with Kubernetes and the deployment’s security requirements. The best practices we highlight here are aligned to the container lifecycle: build, ship and run, and are specifically tailored to Kubernetes deployments. We adopted these best practices in our own SaaS deployment that runs Kubernetes on Google Cloud Platform.

The following are our recommendations for deploying a secured Kubernetes application:

Ensure That Images Are Free of Vulnerabilities 
Having running containers with vulnerabilities opens your environment to the risk of being easily compromised. Many of the attacks can be mitigated simply by making sure that there are no software components that have known vulnerabilities.

  • Implement Continuous Security Vulnerability Scanning -- Containers might include outdated packages with known vulnerabilities (CVEs). This cannot be a ‘one off’ process, as new vulnerabilities are published every day. An ongoing process, where images are continuously assessed, is crucial to insure a required security posture. 
  • Regularly Apply Security Updates to Your Environment -- Once vulnerabilities are found in running containers, you should always update the source image and redeploy the containers. Try to avoid direct updates (e.g. ‘apt-update’) to the running containers, as this can break the image-container relationship. Upgrading containers is extremely easy with the Kubernetes rolling updates feature - this allows gradually updating a running application by upgrading its images to the latest version.

Ensure That Only Authorized Images are Used in Your Environment

Without a process that ensures that only images adhering to the organization’s policy are allowed to run, the organization is open to risk of running vulnerable or even malicious containers. Downloading and running images from unknown sources is dangerous. It is equivalent to running software from an unknown vendor on a production server. Don’t do that.

Use private registries to store your approved images - make sure you only push approved images to these registries. This alone already narrows the playing field, reducing the number of potential images that enter your pipeline to a fraction of the hundreds of thousands of publicly available images. Build a CI pipeline that integrates security assessment (like vulnerability scanning), making it part of the build process.  

The CI pipeline should ensure that only vetted code (approved for production) is used for building the images. Once an image is built, it should be scanned for security vulnerabilities, and only if no issues are found then the image would be pushed to a private registry, from which deployment to production is done. A failure in the security assessment should create a failure in the pipeline, preventing images with bad security quality from being pushed to the image registry.

There is work in progress being done in Kubernetes for image authorization plugins (expected in Kubernetes 1.4), which will allow preventing the shipping of unauthorized images. For more info see this pull request.

Limit Direct Access to Kubernetes Nodes
You should limit SSH access to Kubernetes nodes, reducing the risk for unauthorized access to host resource. Instead you should ask users to use "kubectl exec", which will provide direct access to the container environment without the ability to access the host.

You can use Kubernetes Authorization Plugins to further control user access to resources. This allows defining fine-grained-access control rules for specific namespace, containers and operations.

Create Administrative Boundaries between Resources
Limiting the scope of user permissions can reduce the impact of mistakes or malicious activities. A Kubernetes namespace allows you to partition created resources into logically named groups. Resources created in one namespace can be hidden from other namespaces. By default, each resource created by a user in Kubernetes cluster runs in a default namespace, called default. You can create additional namespaces and attach resources and users to them. You can use Kubernetes Authorization plugins to create policies that segregate access to namespace resources between different users.

For example: the following policy will allow ‘alice’ to read pods from namespace ‘fronto’.

{
 "apiVersion": "abac.authorization.kubernetes.io/v1beta1",
 "kind": "Policy",
 "spec": {
   "user": "alice",
   "namespace": "fronto",
   "resource": "pods",
   "readonly": true
 }
}

Define Resource Quota
An option of running resource-unbound containers puts your system in risk of DoS or “noisy neighbor” scenarios. To prevent and minimize those risks you should define resource quotas. By default, all resources in Kubernetes cluster are created with unbounded CPU and memory requests/limits. You can create resource quota policies, attached to Kubernetes namespace, in order to limit the CPU and memory a pod is allowed to consume.

The following is an example for namespace resource quota definition that will limit number of pods in the namespace to 4, limiting their CPU requests between 1 and 2 and memory requests between 1GB to 2GB.

compute-resources.yaml:


apiVersion: v1
kind: ResourceQuota
metadata:
 name: compute-resources
spec:
 hard:
   pods: "4"
   requests.cpu: "1"
   requests.memory: 1Gi
   limits.cpu: "2"
   limits.memory: 2Gi

Assign a resource quota to namespace:



kubectl create -f ./compute-resources.yaml --namespace=myspace

Implement Network Segmentation
Running different applications on the same Kubernetes cluster creates a risk of one compromised application attacking a neighboring application. Network segmentation is important to ensure that containers can communicate only with those they are supposed to. 
One of the challenges in Kubernetes deployments is creating network segmentation between pods, services and containers. This is a challenge due to the “dynamic” nature of container network identities (IPs), along with the fact that containers can communicate both inside the same node or between nodes.

Users of Google Cloud Platform can benefit from automatic firewall rules, preventing cross-cluster communication. A similar implementation can be deployed on-premises using network firewalls or SDN solutions. There is work being done in this area by the Kubernetes Network SIG, which will greatly improve the pod-to-pod communication policies. A new network policy API should address the need to create firewall rules around pods, limiting the network access that a containerized can have.

The following is an example of a network policy that controls the network for “backend” pods, only allowing inbound network access from “frontend” pods:


POST /apis/net.alpha.kubernetes.io/v1alpha1/namespaces/tenant-a/networkpolicys
{
 "kind": "NetworkPolicy",
 "metadata": {
   "name": "pol1"
 },
 "spec": {
   "allowIncoming": {
     "from": [{
       "pods": { "segment": "frontend" }
     }],
     "toPorts": [{
       "port": 80,
       "protocol": "TCP"
     }]
   },
   "podSelector": {
     "segment": "backend"
   }
 }
}

Read more about Network policies here.

Apply Security Context to Your Pods and Containers
When designing your containers and pods, make sure that you configure the security context for your pods, containers and volumes. A security context is a property defined in the deployment yaml. It controls the security parameters that will be assigned to the pod/container/volume. Some of the important parameters are:


Security Context Setting
Description
SecurityContext->runAsNonRoot
Indicates that containers should run as non-root user
SecurityContext->Capabilities
Controls the Linux capabilities assigned to the container.
SecurityContext->readOnlyRootFilesystem
Controls whether a container will be able to write into the root filesystem.
PodSecurityContext->runAsNonRoot
Prevents running a container with ‘root’ user as part of the pod


The following is an example for pod definition with security context parameters:



apiVersion: v1
kind: Pod
metadata:
 name: hello-world
spec:
 containers:
 # specification of the pod’s containers
 # ...
 securityContext:
   readOnlyRootFilesystem: true
   runAsNonRoot: true

Reference here

In case you are running containers with elevated privileges (--privileged) you should consider using the “DenyEscalatingExec” admission control. This control denies exec and attach commands to pods that run with escalated privileges that allow host access. This includes pods that run as privileged, have access to the host IPC namespace, and have access to the host PID namespace. For more details on admission controls, see the Kubernetes documentation

Log Everything
Kubernetes supplies cluster-based logging, allowing to log container activity into a central log hub. When a cluster is created, the standard output and standard error output of each container can be ingested using a Fluentd agent running on each node into either Google Stackdriver Logging or into Elasticsearch and viewed with Kibana.

Summary
Kubernetes supplies many options to create a secured deployment. There is no one-size-fit-all solution that can be used everywhere, so a certain degree of familiarity with these options is required, as well as an understanding of how they can enhance your application’s security.

We recommend implementing the best practices that were highlighted in this blog, and use Kubernetes flexible configuration capabilities to incorporate security processes into the continuous integration pipeline, automating the entire process with security seamlessly “baked in”.


--Michael Cherny, Head of Security Research, and Amir Jerbi, CTO and co-founder Aqua Security


Monday, August 29, 2016

Scaling Stateful Applications using Kubernetes Pet Sets and FlexVolumes with Datera Elastic Data Fabric

Editor’s note: today’s guest post is by Shailesh Mittal, Software Architect and Ashok Rajagopalan, Sr Director Product at Datera Inc, talking about Stateful Application provisioning with Kubernetes on Datera Elastic Data Fabric.

Introduction

Persistent volumes in Kubernetes are foundational as customers move beyond stateless workloads to run stateful applications. While Kubernetes has supported stateful applications such as MySQL, Kafka, Cassandra, and Couchbase for a while, the introduction of Pet Sets has significantly improved this support. In particular, the procedure to sequence the provisioning and startup, the ability to scale and associate durably by Pet Sets has provided the ability to automate to scale the “Pets” (applications that require consistent handling and durable placement). 

Datera, elastic block storage for cloud deployments, has seamlessly integrated with Kubernetes through the FlexVolume framework. Based on the first principles of containers, Datera allows application resource provisioning to be decoupled from the underlying physical infrastructure. This brings clean contracts (aka, no dependency or direct knowledge of the underlying physical infrastructure), declarative formats, and eventually portability to stateful applications.

While Kubernetes allows for great flexibility to define the underlying application infrastructure through yaml configurations, Datera allows for that configuration to be passed to the storage infrastructure to provide persistence. Through the notion of Datera AppTemplates, in a Kubernetes environment, stateful applications can be automated to scale. 



Deploying Persistent Storage

Persistent storage is defined using the Kubernetes PersistentVolume subsystem. PersistentVolumes are volume plugins and define volumes that live independently of the lifecycle of the pod that is using it. They are implemented as NFS, iSCSI, or by cloud provider specific storage system. Datera has developed a volume plugin for PersistentVolumes that can provision iSCSI block storage on the Datera Data Fabric for Kubernetes pods.

The Datera volume plugin gets invoked by kubelets on minion nodes and relays the calls to the Datera Data Fabric over its REST API. Below is a sample deployment of a PersistentVolume with the Datera plugin:

 apiVersion: v1
 kind: PersistentVolume
 metadata:
   name: pv-datera-0
 spec:
   capacity:
     storage: 100Gi
   accessModes:
     - ReadWriteOnce
   persistentVolumeReclaimPolicy: Retain
   flexVolume:
     driver: "datera/iscsi"
     fsType: "xfs"
     options:
       volumeID: "kube-pv-datera-0"
       size: “100"
       replica: "3"
       backstoreServer: "tlx170.tlx.daterainc.com:7717”

This manifest defines a PersistentVolume of 100 GB to be provisioned in the Datera Data Fabric, should a pod request the persistent storage.


[root@tlx241 /]# kubectl get pv
NAME          CAPACITY   ACCESSMODES   STATUS      CLAIM     REASON    AGE
pv-datera-0   100Gi        RWO         Available                       8s
pv-datera-1   100Gi        RWO         Available                       2s
pv-datera-2   100Gi        RWO         Available                       7s
pv-datera-3   100Gi        RWO         Available                       4s

Configuration

The Datera PersistenceVolume plugin is installed on all minion nodes. When a pod lands on a minion node with a valid claim bound to the persistent storage provisioned earlier, the Datera plugin forwards the request to create the volume on the Datera Data Fabric. All the options that are specified in the PersistentVolume manifest are sent to the plugin upon the provisioning request.

Once a volume is provisioned in the Datera Data Fabric, volumes are presented as an iSCSI block device to the minion node, and kubelet mounts this device for the containers (in the pod) to access it.

Using Persistent Storage

Kubernetes PersistentVolumes are used along with a pod using PersistentVolume Claims. Once a claim is defined, it is bound to a PersistentVolume matching the claim’s specification. A typical claim for the PersistentVolume defined above would look like below:


kind: PersistentVolumeClaim
apiVersion: v1
metadata:
 name: pv-claim-test-petset-0
spec:
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 100Gi

When this claim is defined and it is bound to a PersistentVolume, resources can be used with the pod specification:


[root@tlx241 /]# kubectl get pv
NAME          CAPACITY   ACCESSMODES   STATUS      CLAIM                            REASON    AGE
pv-datera-0   100Gi      RWO           Bound       default/pv-claim-test-petset-0             6m
pv-datera-1   100Gi      RWO           Bound       default/pv-claim-test-petset-1             6m
pv-datera-2   100Gi      RWO           Available                                              7s
pv-datera-3   100Gi      RWO           Available                                              4s

[root@tlx241 /]# kubectl get pvc
NAME                     STATUS    VOLUME        CAPACITY   ACCESSMODES   AGE
pv-claim-test-petset-0   Bound     pv-datera-0   0                        3m
pv-claim-test-petset-1   Bound     pv-datera-1   0                        3m

A pod can use a PersistentVolume Claim like below:

apiVersion: v1
kind: Pod
metadata:
 name: kube-pv-demo
spec:
 containers:
 - name: data-pv-demo
   image: nginx
   volumeMounts:
   - name: test-kube-pv1
     mountPath: /data
   ports:
   - containerPort: 80
 volumes:
 - name: test-kube-pv1
   persistentVolumeClaim:
     claimName: pv-claim-test-petset-0

The result is a pod using a PersistentVolume Claim as a volume. It in-turn sends the request to the Datera volume plugin to provision storage in the Datera Data Fabric.


[root@tlx241 /]# kubectl describe pods kube-pv-demo
Name:       kube-pv-demo
Namespace:  default
Node:       tlx243/172.19.1.243
Start Time: Sun, 14 Aug 2016 19:17:31 -0700
Labels:     <none>
Status:     Running
IP:         10.40.0.3
Controllers: <none>
Containers:
 data-pv-demo:
   Image:   nginx
   Port:    80/TCP
   State:   Running
     Started:  Sun, 14 Aug 2016 19:17:34 -0700
   Ready:   True
   Restart Count:  0
   Environment Variables:  <none>
Conditions:
 Type           Status
 Initialized    True
 Ready          True
 PodScheduled   True
Volumes:
 test-kube-pv1:
   Type:  PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
   ClaimName:   pv-claim-test-petset-0
   ReadOnly:    false
 default-token-q3eva:
   Type:        Secret (a volume populated by a Secret)
   SecretName:  default-token-q3eva
   QoS Tier:  BestEffort
Events:
 FirstSeen LastSeen Count From SubobjectPath Type Reason Message
 --------- -------- ----- ---- ------------- -------- ------ -------
 43s 43s 1 {default-scheduler } Normal Scheduled Successfully assigned kube-pv-demo to tlx243
 42s 42s 1 {kubelet tlx243} spec.containers{data-pv-demo} Normal Pulling pulling image "nginx"
 40s 40s 1 {kubelet tlx243} spec.containers{data-pv-demo} Normal Pulled Successfully pulled image "nginx"
 40s 40s 1 {kubelet tlx243} spec.containers{data-pv-demo} Normal Created Created container with docker id ae2a50c25e03
 40s 40s 1 {kubelet tlx243} spec.containers{data-pv-demo} Normal Started Started container with docker id ae2a50c25e03

The persistent volume is presented as iSCSI device at minion node (tlx243 in this case):

[root@tlx243 ~]# lsscsi
[0:2:0:0]    disk    SMC      SMC2208          3.24  /dev/sda 
[11:0:0:0]   disk    DATERA   IBLOCK           4.0   /dev/sdb

[root@tlx243 datera~iscsi]# mount | grep sdb
/dev/sdb on /var/lib/kubelet/pods/6b99bd2a-628e-11e6-8463-0cc47ab41442/volumes/datera~iscsi/pv-datera-0 type xfs (rw,relatime,attr2,inode64,noquota)

Containers running in the pod see this device mounted at /data as specified in the manifest:

[root@tlx241 /]# kubectl exec kube-pv-demo -c data-pv-demo -it bash
root@kube-pv-demo:/# mount | grep data
/dev/sdb on /data type xfs (rw,relatime,attr2,inode64,noquota)

Using Pet Sets

Typically, pods are treated as stateless units, so if one of them is unhealthy or gets superseded, Kubernetes just disposes it. In contrast, a PetSet is a group of stateful pods that has a stronger notion of identity. The goal of a PetSet is to decouple this dependency by assigning identities to individual instances of an application that are not anchored to the underlying physical infrastructure.

A PetSet requires {0..n-1} Pets. Each Pet has a deterministic name, PetSetName-Ordinal, and a unique identity. Each Pet has at most one pod, and each PetSet has at most one Pet with a given identity. A PetSet ensures that a specified number of “pets” with unique identities are running at any given time. The identity of a Pet is comprised of:
  • a stable hostname, available in DNS
  • an ordinal index
  • stable storage: linked to the ordinal & hostname
A typical PetSet definition using a PersistentVolume Claim looks like below:

# A headless service to create DNS records
apiVersion: v1
kind: Service
metadata:
 name: test-service
 labels:
   app: nginx
spec:
 ports:
 - port: 80
   name: web
 clusterIP: None
 selector:
   app: nginx
---
apiVersion: apps/v1alpha1
kind: PetSet
metadata:
 name: test-petset
spec:
 serviceName: "test-service"
 replicas: 2
 template:
   metadata:
     labels:
       app: nginx
     annotations:
   spec:
     terminationGracePeriodSeconds: 0
     containers:
     - name: nginx
       ports:
       - containerPort: 80
         name: web
       volumeMounts:
       - name: pv-claim
         mountPath: /data
 volumeClaimTemplates:
 - metadata:
     name: pv-claim
     annotations:
   spec:
     accessModes: [ "ReadWriteOnce" ]
     resources:
       requests:
         storage: 100Gi

We have the following PersistentVolume Claims available:

[root@tlx241 /]# kubectl get pvc
NAME                     STATUS    VOLUME        CAPACITY   ACCESSMODES   AGE
pv-claim-test-petset-0   Bound     pv-datera-0   0                        41m
pv-claim-test-petset-1   Bound     pv-datera-1   0                        41m
pv-claim-test-petset-2   Bound     pv-datera-2   0                        5s
pv-claim-test-petset-3   Bound     pv-datera-3   0                        2s

When this PetSet is provisioned, two pods get instantiated:

[root@tlx241 /]# kubectl get pods
NAMESPACE     NAME                        READY     STATUS    RESTARTS   AGE
default       test-petset-0               1/1       Running   0          7s
default       test-petset-1               1/1       Running   0          3s

Here is how the PetSet test-petset instantiated earlier looks like:


[root@tlx241 /]# kubectl describe petset test-petset
Name: test-petset
Namespace: default
Selector: app=nginx
Labels: app=nginx
Replicas: 2 current / 2 desired
Annotations: <none>
CreationTimestamp: Sun, 14 Aug 2016 19:46:30 -0700
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
No events.

Once a PetSet is instantiated, such as test-petset below, upon increasing the number of replicas (i.e. the number of pods started with that PetSet), more pods get instantiated and more PersistentVolume Claims get bound to new pods:

[root@tlx241 /]# kubectl patch petset test-petset -p'{"spec":{"replicas":"3"}}'
"test-petset” patched

[root@tlx241 /]# kubectl describe petset test-petset
Name: test-petset
Namespace: default
Selector: app=nginx
Labels: app=nginx
Replicas: 3 current / 3 desired
Annotations: <none>
CreationTimestamp: Sun, 14 Aug 2016 19:46:30 -0700
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
No events.

[root@tlx241 /]# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
test-petset-0               1/1       Running   0          29m
test-petset-1               1/1       Running   0          28m
test-petset-2               1/1       Running   0          9s

Now the PetSet is running 3 pods after patch application.

When the above PetSet definition is patched to have one more replica, it introduces one more pod in the system. This in turn results in one more volume getting provisioned on the Datera Data Fabric. So volumes get dynamically provisioned and attached to a pod upon the PetSet scaling up.

To support the notion of durability and consistency, if a pod moves from one minion to another, volumes do get attached (mounted) to the new minion node and detached (unmounted) from the old minion to maintain persistent access to the data.

Conclusion

This demonstrates Kubernetes with Pet Sets orchestrating stateful and stateless workloads. While the Kubernetes community is working on expanding the FlexVolume framework’s capabilities, we are excited that this solution makes it possible for Kubernetes to be run more widely in the datacenters. 

Join and contribute: Kubernetes Storage SIG.