K8S Pod 新安全策略 Pod Security Admission 介绍

K8S Internals 系列：第一期

容器编排之争在 Kubernetes 一统天下局面形成后，K8S 成为了云原生时代的新一代操作系统。K8S 让一切变得简单了，但自身逐渐变得越来越复杂。【K8S Internals 系列专栏】围绕 K8S 生态的诸多方面，将由博云容器云研发团队定期分享有关调度、安全、网络、性能、存储、应用场景等热点话题。希望大家在享受 K8S 带来的高效便利的同时，又可以如庖丁解牛般领略其内核运行机制的魅力。

Pod Security Policy 简介

Pod Security Policy 是一个赋予集群管理员控制 Pod 安全规范的内置准入控制器，可以让管理人员控制Pod实例安全的诸多方面，例如禁止采用root权限、防止容器逃逸等等。Pod Security Policy 定义了一组 Pod 运行时必须遵循的条件及相关字段的默认值，Pod 必须满足这些条件才能被成功创建，Pod Security Policy 对象 Spec 包含以下字段也即是 Pod Security Policy 能够控制的方面：

其中 AppArmor 和 seccomp 需要通过给 PodSecurityPolicy 对象添加注解的方式设定：

seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default'
seccomp.security.alpha.kubernetes.io/defaultProfileNames: 'docker/default'
apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default' 
apparmor.security.beta.kubernetes.io/defaultProfileNames: 'runtime/default'

Pod Security Policy 是集群级别的资源，我们看一下它的使用流程：

PSP 使用流程

由于需要创建 ClusterRole/Role 和 ClusterRoleBinding/RoleBinding 绑定服务账号来使用 PSP，这使得我们不能很容易的看出究竟使用了哪些 PSP，更难看出 Pod 的创建被哪些安全规则限制。

为什么出现 Pod Security Admission

通过对 PodSecurityPolicy 使用，应该也会发现它的问题，例如没有 dry-run 和审计模式、不方便开启和关闭等，并且使用起来也不那么清晰。种种缺陷造成的结果是 PodSecurityPolicy 在 Kubernetes V1.21 被标记为弃用，并且将在 V1.25中被移除，在 Kubernets V1.22 中则增加了新特性 Pod Security Admission。

Pod Security Admission 介绍

Pod security admission 是 Kubernetes 内置的一种准入控制器，在 Kubernetes V1.23版本中这一特性门是默认开启的，在V1.22中需要通过 Kube-API" target="_blank">apiserver 参数--feature-gates="...,PodSecurity=true"开启。在低于V1.22的 Kuberntes 版本中也可以自行安装 Pod Security Admission Webhook。

Pod security admission 是通过执行内置的 Pod Security Standards 来限制集群中的 pod 的创建。

1. Pod Security Standards

为了广泛的覆盖安全应用场景， Pod Security Standards 渐进式的定义了三种不同的 Pod 安全标准策略：

详细内容参见 Pod Security Standards (https://kubernetes.io/docs/concepts/security/pod-security-standards)。

2. Pod Security Standards 实施方法

在 Kubernetes 集群中开启了 Pod security admission 特性门之后，就可以通过给 namespace 设置 label 的方式来实施 Pod Security Standards。其中有三种设定模式可选用：

label 设置模板解释：

# 设定模式及安全标准策略等级
# MODE必须是 `enforce`, `audit`或`warn`其中之一。
# LEVEL必须是`privileged`, `baseline`或 `restricted`其中之一
pod-security.kubernetes.io/<MODE>: <LEVEL>

# 此选项是非必填的，用来锁定使用哪个版本的的安全标准
# MODE必须是 `enforce`, `audit`或`warn`其中之一。
# VERSION必须是一个有效的kubernetes minor version(例如v1.23)，或者 `latest`
pod-security.kubernetes.io/<MODE>-version: <VERSION>

一个 namesapce 可以设定任意种模式或者不同的模式设定不同的安全标准策略。

通过准入控制器配置文件，可以为 Pod security admission 设置默认配置：

API" target="_blank">apiVersion: API" target="_blank">apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1beta1
    kind: PodSecurityConfiguration
    # Defaults applied when a mode label is not set.
    #
    # Level label values must be one of:
    # - "privileged" (default)
    # - "baseline"
    # - "restricted"
    #
    # Version label values must be one of:
    # - "latest" (default) 
    # - specific version like "v1.23"
    defaults:
      enforce: "privileged"
      enforce-version: "latest"
      audit: "privileged"
      audit-version: "latest"
      warn: "privileged"
      warn-version: "latest"
    exemptions:
      # Array of authenticated usernames to exempt.
      usernames: []
      # Array of runtime class names to exempt.
      runtimeClassNames: []
      # Array of namespaces to exempt.
      namespaces: []

pod security admission 可以从 username，runtimeClassName，namespace三个维度对 Pod 进行安全标准检查的豁免。

3. Pod Security Standards 实施演示

环境: Kubernetes v1.23

运行时的容器面临很多攻击风险，例如容器逃逸，从容器发起资源耗尽型攻击。

3.1 Baseline 策略

Baseline 策略目标是应用于常见的容器化应用，禁止已知的特权提升，在官方的介绍中此策略针对的是应用运维人员和非关键性应用开发人员，在该策略中包括：

必须禁止共享宿主命名空间、禁止容器特权、限制 Linux 能力、禁止 hostPath 卷、限制宿主机端口、设定 AppArmor、SElinux、Seccomp、Sysctls 等。

下面演示设定 Baseline 策略。

违反 Baseline 策略存在的风险：

特权容器可以看到宿主机设备
挂载 procfs 后可以看到宿主机进程，打破进程隔离
可以打破网络隔离
挂载运行时 socket 后可以不受限制的与运行时通信

等等以上风险都可能导致容器逃逸。

创建名为 my-baseline-namespace 的 namespace，并设定 enforce 和 warn 两种模式都对应 Baseline 等级的 Pod 安全标准策略：

apiVersion: v1
kind: Namespace
metadata:
  name: my-baseline-namespace
  labels:
    pod-security.kubernetes.io/enforce: baseline  
    pod-security.kubernetes.io/enforce-version: v1.23
    pod-security.kubernetes.io/warn: baseline
    pod-security.kubernetes.io/warn-version: v1.23

创建 pod

创建一个违反 baseline 策略的 pod

apiVersion: v1
kind: Pod
metadata:
  name: hostnamespaces2
  namespace: my-baseline-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: true
      privileged: true
      capabilities:
        drop:
        - ALL
  hostPID: true
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

执行 apply 命令，显示不能设置 hostPID=true，securityContext.privileged=true，Pod 创建被拒绝，特权容器的运行，并且开启 hostPID，容器进程没有与宿主机进程隔离，容易造成 Pod 容器逃逸：

[root@localhost podSecurityStandard]# kubectl apply -f fail-hostnamespaces2.yaml
Error from server (Forbidden): error when creating "fail-hostnamespaces2.yaml": pods "hostnamespaces2" is forbidden: violates PodSecurity "baseline:v1.23": host namespaces (hostPID=true), privileged (container "prometheus" must not set securityContext.privileged=true)

创建不违反 baseline 策略的 pod，设定 Pod 的 hostPID=false，securityContext.privileged=false

apiVersion: v1
kind: Pod
metadata:
  name: hostnamespaces2
  namespace: my-baseline-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
      privileged: false
      capabilities:
        drop:
        - ALL
  hostPID: false
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

执行 apply 命令，pod 被允许创建：

[root@localhost podSecurityStandard]# kubectl apply -f pass-hostnamespaces2.yaml
pod/hostnamespaces2 created

3.2 Restricted 策略

Restricted 策略目标是实施当前保护 Pod 的最佳实践，在官方介绍中此策略主要针对运维人员和安全性很重要的应用开发人员，以及不太被信任的用户。该策略包含所有的 baseline 策略的内容，额外增加：限制可以通过 PersistentVolumes 定义的非核心卷类型、禁止（通过 SetUID 或 SetGID 文件模式）获得特权提升、必须要求容器以非 root 用户运行、Containers 不可以将 runAsUser 设置为 0、容器组必须弃用 ALL capabilities 并且只允许添加 NET_BIND_SERVICE 能力。

Restricted 策略进一步的限制在容器内获取 root 权限，linux 内核功能。例如针对 kubernetes 网络的中间人攻击需要拥有 Linux 系统的 CAP_NET_RAW 权限来发送 ARP 包。

创建名为 my-restricted-namespace的namespace，并设定 enforce 和 warn 两种模式都对应 Restricted 等级的 Pod 安全标准策略：

apiVersion: v1
kind: Namespace
metadata:
name: my-restricted-namespace
labels:
 pod-security.kubernetes.io/enforce: restricted 
 pod-security.kubernetes.io/enforce-version: v1.23
 pod-security.kubernetes.io/warn: restricted
 pod-security.kubernetes.io/warn-version: v1.23

创建 pod

创建一个违反 Restricted 策略的 pod

apiVersion: v1
kind: Pod
metadata:
  name: runasnonroot0
  namespace: my-restricted-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
  securityContext:
    seccompProfile:
      type: RuntimeDefault

执行 apply 命令，显示必须设置 securityContext.runAsNonRoot=true，securityContext.capabilities.drop=["ALL"]，Pod 创建被拒绝，容器以 root 用户运行时容器获取权限过大，结合没有 Drop linux 内核能力有 kubernetes 网络中间人攻击的风险：

[root@localhost podSecurityStandard]# kubectl apply -f fail-runasnonroot0.yaml
Error from server (Forbidden): error when creating "fail-runasnonroot0.yaml": pods "runasnonroot0" is forbidden: violates PodSecurity "restricted:v1.23": unrestricted capabilities (container "prometheus" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "prometheus" must set securityContext.runAsNonRoot=true)

创建不违反 Restricted 策略的 pod，设定 Pod 的 securityContext.runAsNonRoot=true，Drop 所有 linux 能力。

apiVersion: v1
kind: Pod
metadata:
  name: runasnonroot0
  namespace: my-restricted-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

执行 apply 命令，pod 被允许创建：

[root@localhost podSecurityStandard]# kubectl apply -f pass-runasnonroot0.yaml
pod/runasnonroot0 created

4. Pod Security admission 当前局限性

如果你的集群中已经配置 PodSecurityPolicy，考虑把它们迁移到 pod security admission 是需要一定的工作量的。

首先需要考虑当前的 pod security admission 是否适合你的集群，目前它旨在满足开箱即用的最常见的安全需求，与 PSP 相比它存在以下差异：

pod security admission 只是对 pod 进行安全标准的检查，不支持对 pod 进行修改，不能为 pod 设置默认的安全配置。
pod security admission 只支持官方定义的三种安全标准策略，不支持灵活的自定义安全标准策略。这使得不能完全将PSP规则迁移到 pod security admission，需要进行具体的安全规则考量。
pod security admission 不像 PSP 一样可以与具体的用户进行绑定，只支持豁免特定的用户或者 RuntimeClass 及 namespace。

Pod security admission 源码分析

Kubernetes 准入控制器是在代码层面与 API server 逻辑解耦的插件，对象被创建、更新、或删除在 etcd 持久化之前可以对请求进行拦截执行特定的逻辑。一个请求到 API server 经典的流程如下图所示：

Api Request 处理流程图

1. 源码主体逻辑流程图

podsecurityAdmission 代码流程图

Pod security admission 主体逻辑流程如图所示，准入控制器首先解析拦截到的请求，根据解析到的资源类型进行不同的逻辑处理：

Namespace：如果解析到的资源是 Namespace，准入控制器先根据该 namesapce 的 labels 解析出配置安全标准策略的等级、模式及锁定的Pod安全标准策略版本等信息。检查如果过不包含Pod安全标准策略信息则直接允许请求通过，如果包含 Pod 安全标准策略信息则判断是 create 新的namespace，还是 update 旧的 namespace，如果是 create 则判断配置是否正确，如果是 update 则评估 namespace 中的 pod 是否符合新设定的安全标准策略。
Pod：如果解析到的资源是 Pod，准入控制器先获取该 Pod 所处的 namespace 设定的 Pod 安全标准策略信息，如果该 namespace 未设定 Pod 安全标准策略则允许请求通过，否则评估该 Pod 是否符合安全标准策略。
others：准入控制器先获取该资源所处的 namespace 设定的 Pod 安全策略信息，如果该 namespace 未设定 Pod 安全策略则允许请求通过，否则进一步解析该资源判断该资源是否是诸如 PodTemplate，ReplicationController，ReplicaSet，Deployment，DaemonSet，StatefulSet，Job，CronJob 等包含 PodSpec 的资源，解析出 PodSpec 后评估该资源是否符合 Pod 安全策略。

2. 初始化 Pod security admission

像大多数 go 程序一样，Pod security admission 使用 github.com/spf13/cobra 创建了启动命令，在启动调用 runServer 初始化并启动 webhook 服务。入参 Options 中包含了 DefaultClientQPSLimit，DefaultClientQPSBurst，DefaultPort，DefaultInsecurePort 等默认配置。

// NewSchedulerCommand creates a *cobra.Command object with default parameters and registryOptions
func NewServerCommand() *cobra.Command {
 opts := options.NewOptions()

 cmdName := "podsecurity-webhook"
 if executable, err := os.Executable(); err == nil {
  cmdName = filepath.Base(executable)
 }
 cmd := &cobra.Command{
  Use: cmdName,
  Long: `The PodSecurity webhook is a standalone webhook server implementing the Pod
Security Standards.`,
  RunE: func(cmd *cobra.Command, _ []string) error {
   verflag.PrintAndExitIfRequested()
            // 初始化并且启动webhook服务
   return runServer(cmd.Context(), opts)
  },
  Args: cobra.NoArgs,
 }
 opts.AddFlags(cmd.Flags())
 verflag.AddFlags(cmd.Flags())

 return cmd
}

runserver 函数中加载了准入控制器的配置，初始化了 server，最终启动 server。

func runServer(ctx context.Context, opts *options.Options) error {
    // 加载配置内容
 config, err := LoadConfig(opts)
 if err != nil {
  return err
 }
    // 根据配置内容初始化server
 server, err := Setup(config)
 if err != nil {
  return err
 }
 
 ctx, cancel := context.WithCancel(ctx)
 defer cancel()
 go func() {
  stopCh := apiserver.SetupSignalHandler()
  <-stopCh
  cancel()
 }()
 // 启动server
 return server.Start(ctx)
}

下面截取了 Setup 函数部分主要代码片段，Setup 函数创建了 Admission 对象包含:

PodSecurityConfig：准入控制器配置内容，包括默认的 Pod 安全标准策略等级及设定模式和锁定对应 kubernetes 版本，以及豁免的 Usernames、RuntimeClasses 和 Namespaces。
Evaluator：创建的评估器，即定义了检查安全标准策略的具体方法。
Metrics：用于收集 Prometheus 指标。
PodSpecExtractor：用解析请求对象中的 PodSpec。
PodLister：用于获取指定 namespace 中的 Pods。
NamespaceGetter：用户获取拦截到请求中的资源所处的 namespace。

// Setup creates an Admission object to handle the admission logic.
func Setup(c *Config) (*Server, error) {
    ...
 s.delegate = &admission.Admission{
  Configuration:    c.PodSecurityConfig,
  Evaluator:        evaluator,
  Metrics:          metrics,
  PodSpecExtractor: admission.DefaultPodSpecExtractor{},
  PodLister:        admission.PodListerFromClient(client),
  NamespaceGetter:  admission.NamespaceGetterFromListerAndClient(namespaceLister, client),
 }
   ...
 return s, nil
}

准入控制器服务启动之后注册了 HandleValidate 方法进行准入检验逻辑的处理,在此方法中调用 Validate 方法进行具体 Pod 安全标准策略的检验。

//处理webhook拦截到的请求
func (s *Server) HandleValidate(w http.ResponseWriter, r *http.Request) {
 defer utilruntime.HandleCrash(func(_ interface{}) {
  // Assume the crash happened before the response was written.
  http.Error(w, "internal server error", http.StatusInternalServerError)
 })
     ...
    // 进行具体的检验操作
 response := s.delegate.Validate(ctx, attributes)
 response.UID = review.Request.UID // Response UID must match request UID
 review.Response = response
 writeResponse(w, review)
}

3. 准入检验处理逻辑
Validate 方法根据获取请求包含的不同资源类型调用不同的检验方法进行具体的检验操作，以下三种处理方向最终都会调用 EvaluatePod 方法，对 Pod 进行安全标准策略评估。

// Validate admits an API request.
// The objects in admission attributes are expected to be external v1 objects that we care about.
// The returned response may be shared and must not be mutated.
func (a *Admission) Validate(ctx context.Context, attrs api.Attributes) *admissionv1.AdmissionResponse {
 var response *admissionv1.AdmissionResponse
 switch attrs.GetResource().GroupResource() {
 case namespacesResource:
  response = a.ValidateNamespace(ctx, attrs)
 case podsResource:
  response = a.ValidatePod(ctx, attrs)
 default:
  response = a.ValidatePodController(ctx, attrs)
 }
 return response
}

EvaluatePod 方法中对 namespace 设定安全标准策略和版本进行判断，从而选取不同的检查方法对 Pod 进行安全性检验。

func (r *checkRegistry) EvaluatePod(lv api.LevelVersion, podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) []CheckResult {
    // 如果设定的Pod安全标准策略等级是Privileged（宽松的策略）直接返回
 if lv.Level == api.LevelPrivileged {
  return nil
 }
    // 如果注册的检查策略最大版本号低于namespace设定策略版本号，则使用注册的检查策略的最大版本号
 if r.maxVersion.Older(lv.Version) {
  lv.Version = r.maxVersion
 }

 var checks []CheckPodFn
    // 如果设定的Pod安全标准策略等级是Baseline
 if lv.Level == api.LevelBaseline {
  checks = r.baselineChecks[lv.Version]
 } else {
  // includes non-overridden baseline checks
        // 其他走严格的Pod安全标准策略检查
  checks = r.restrictedChecks[lv.Version]
 }

 var results []CheckResult
    // 遍历检查方法，返回检查结果
 for _, check := range checks {
  results = append(results, check(podMetadata, podSpec))
 }
 return results
}

下面截取一个具体的检验方法来看一下是如何进行 pod 安全标准检查的，如下检查了 Pod 中的容器是否关闭了 allowPrivilegeEscalation，AllowPrivilegeEscalation 设置容器内的子进程是否可以提升权限，通常在设置非 root 用户（MustRunAsNonRoot）时进行设置。

func allowPrivilegeEscalation_1_8(podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) CheckResult {
 var badContainers []string
 visitContainers(podSpec, func(container *corev1.Container) {
        // 检查pod中容器安全上下文是否配置，AllowPrivilegeEscalation是否配置，及AllowPrivilegeEscalation是否设置为false.
  if container.SecurityContext == nil || container.SecurityContext.AllowPrivilegeEscalation == nil || *container.SecurityContext.AllowPrivilegeEscalation {
   badContainers = append(badContainers, container.Name)
  }
 })

 if len(badContainers) > 0 {
        // 存在违反Pod安全标准策略的内容，则返回具体结果信息
  return CheckResult{
   Allowed:         false,
   ForbiddenReason: "allowPrivilegeEscalation != false",
   ForbiddenDetail: fmt.Sprintf(
    "%s %s must set securityContext.allowPrivilegeEscalation=false",
    pluralize("container", "containers", len(badContainers)),
    joinQuote(badContainers),
   ),
  }
 }
 return CheckResult{Allowed: true}
}

总结

在 Kubernetes V1.23 版本中 Pod Security Admission 已经升级到 beta 版本，虽然目前功能不算强大，但该特性未来可期，在不远的将来一定会发挥越来越大的作用。