Kubernetes 中大量用到了证书, 比如 ca证书、以及 kubelet、apiserver、proxy、etcd等组件,还有 kubeconfig 文件。
如果证书过期,轻则无法登录 Kubernetes 集群,重则整个集群异常。
为了解决证书过期的问题,一般有以下几种方式:
本次主要介绍关于 Kubernetes 集群证书过期的监控,这里提供 3 种监控方案:
/etc/kubernetes/pki
和 /var/lib/kubelet
下的证书以及 kubeconfig 文件Blackbox Exporter 用于探测 HTTPS、HTTP、TCP、DNS、ICMP 和 grpc 等 Endpoint。在你定义 Endpoint 后,Blackbox Exporter 会生成指标,可以使用 Grafana 等工具进行可视化。Blackbox Exporter 最重要的功能之一是测量 Endpoint 的可用性。
当然, Blackbox Exporter 探测 HTTPS 后就可以获取到证书的相关信息, 就是利用这种方式实现对 Kubernetes apiserver 证书过期时间的监控.
调整 Blackbox Exporter 的配置, 增加 insecure_tls_verify: true
, 如下:
重启 blackbox exporter: kubectl rollout restart deploy ...
增加对 Kubernetes APIServer 内部端点https://kubernetes.default.svc.cluster.local/readyz的监控.
如果你没有使用 Prometheus Operator, 使用的是原生的 Prometheus, 则需要修改 Prometheus 配置文件的 configmap 或 secret, 添加 scrape config, 示例如下:
如果在使用 Prometheus Operator, 则可以增加如下 Probe CRD, Prometheus Operator 会自动将其转换并 merge 到 Prometheus 中.
apiVersion: monitoring.coreos.com/v1 kind: Probe metadata: name: kubernetes-apiserver spec: interval: 60s module: http_2xx prober: path: /probe url: monitor-prometheus-blackbox-exporter.default.svc.cluster.local:9115 targets: staticConfig: static: - https://kubernetes.default.svc.cluster.local/readyz
最后, 可以增加 Prometheus 告警 Rule, 这里就直接用 Prometheus Operator 创建 PrometheusRule CRD 做示例了, 示例如下:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: prometheus-blackbox-exporter spec: groups: - name: prometheus-blackbox-exporter rules: - alert: BlackboxSslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 0m labels: severity: warning - alert: BlackboxSslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14 for: 0m labels: severity: critical - alert: BlackboxSslCertificateExpired annotations: description: |- SSL certificate has expired already VALUE = {{ $value }} LABELS = {{ $labels }} summary: SSL certificate expired (instance {{ $labels.instance }}) expr: probe_ssl_earliest_cert_expiry - time() <= 0 for: 0m labels: severity: emergency
这里可以参考我的文章:Prometheus Operator 与 kube-prometheus 之二 - 如何监控 1.23+ kubeadm 集群, 安装完成后, 开箱即用.
开箱即用内容包括:
这里用到的指标有:
apiserver_client_certificate_expiration_seconds_count
apiserver_client_certificate_expiration_seconds_bucket
kubelet_certificate_manager_client_expiration_renew_errors
kubelet_server_expiration_renew_errors
kubelet_certificate_manager_client_ttl_seconds
kubelet_certificate_manager_server_ttl_seconds
对应的 Prometheus 告警规则如下:
该 Exporter 是通过监控集群所有node的指定目录或 path 下的证书文件以及 kubeconfig 文件来获取证书信息.
如果是使用 kubeadm 搭建的 Kubernetes 集群, 则可以监控如下包含证书的文件和 kubeconfig:
watchFiles: - /var/lib/kubelet/pki/kubelet-client-current.pem - /etc/kubernetes/pki/apiserver.crt - /etc/kubernetes/pki/apiserver-etcd-client.crt - /etc/kubernetes/pki/apiserver-kubelet-client.crt - /etc/kubernetes/pki/ca.crt - /etc/kubernetes/pki/front-proxy-ca.crt - /etc/kubernetes/pki/front-proxy-client.crt - /etc/kubernetes/pki/etcd/ca.crt - /etc/kubernetes/pki/etcd/healthcheck-client.crt - /etc/kubernetes/pki/etcd/peer.crt - /etc/kubernetes/pki/etcd/server.crt watchKubeconfFiles: - /etc/kubernetes/admin.conf - /etc/kubernetes/controller-manager.conf - /etc/kubernetes/scheduler.conf
编辑 values.yaml:
kubeVersion: '' extraLabels: {} nameOverride: '' fullnameOverride: '' imagePullSecrets: [] image: registry: docker.io repository: enix/x509-certificate-exporter tag: pullPolicy: IfNotPresent psp: create: false rbac: create: true secretsExporter: serviceAccountName: serviceAccountAnnotations: {} clusterRoleAnnotations: {} clusterRoleBindingAnnotations: {} hostPathsExporter: serviceAccountName: serviceAccountAnnotations: {} clusterRoleAnnotations: {} clusterRoleBindingAnnotations: {} podExtraLabels: {} podAnnotations: {} exposePerCertificateErrorMetrics: false exposeRelativeMetrics: false metricLabelsFilterList: null secretsExporter: enabled: true debugMode: false replicas: 1 restartPolicy: Always strategy: {} resources: limits: cpu: 200m memory: 150Mi requests: cpu: 20m memory: 20Mi nodeSelector: {} tolerations: [] affinity: {} podExtraLabels: {} podAnnotations: {} podSecurityContext: {} securityContext: runAsUser: 65534 runAsGroup: 65534 readOnlyRootFilesystem: true capabilities: drop: - ALL secretTypes: - type: kubernetes.io/tls key: tls.crt includeNamespaces: [] excludeNamespaces: [] includeLabels: [] excludeLabels: [] cache: enabled: true maxDuration: 300 hostPathsExporter: debugMode: false restartPolicy: Always updateStrategy: {} resources: limits: cpu: 100m memory: 40Mi requests: cpu: 10m memory: 20Mi nodeSelector: {} tolerations: [] affinity: {} podExtraLabels: {} podAnnotations: {} podSecurityContext: {} securityContext: runAsUser: 0 runAsGroup: 0 readOnlyRootFilesystem: true capabilities: drop: - ALL watchDirectories: [] watchFiles: [] watchKubeconfFiles: [] daemonSets: cp: nodeSelector: node-role.kubernetes.io/master: '' tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists watchFiles: - /var/lib/kubelet/pki/kubelet-client-current.pem - /etc/kubernetes/pki/apiserver.crt - /etc/kubernetes/pki/apiserver-etcd-client.crt - /etc/kubernetes/pki/apiserver-kubelet-client.crt - /etc/kubernetes/pki/ca.crt - /etc/kubernetes/pki/front-proxy-ca.crt - /etc/kubernetes/pki/front-proxy-client.crt - /etc/kubernetes/pki/etcd/ca.crt - /etc/kubernetes/pki/etcd/healthcheck-client.crt - /etc/kubernetes/pki/etcd/peer.crt - /etc/kubernetes/pki/etcd/server.crt watchKubeconfFiles: - /etc/kubernetes/admin.conf - /etc/kubernetes/controller-manager.conf - /etc/kubernetes/scheduler.conf nodes: watchFiles: - /var/lib/kubelet/pki/kubelet-client-current.pem - /etc/kubernetes/pki/ca.crt rbacProxy: enabled: false podListenPort: 9793 hostNetwork: false service: create: true port: 9793 annotations: {} extraLabels: {} prometheusServiceMonitor: create: true scrapeInterval: 60s scrapeTimeout: 30s extraLabels: {} relabelings: {} prometheusPodMonitor: create: false prometheusRules: create: true alertOnReadErrors: true readErrorsSeverity: warning alertOnCertificateErrors: true certificateErrorsSeverity: warning certificateRenewalsSeverity: warning certificateExpirationsSeverity: critical warningDaysLeft: 30 criticalDaysLeft: 14 extraLabels: {} alertExtraLabels: {} rulePrefix: '' disableBuiltinAlertGroup: false extraAlertGroups: [] extraDeploy: []
通过 Helm Chart 安装:
helm repo add enix https://charts.enix.io helm install x509-certificate-exporter enix/x509-certificate-exporter
通过这个 Helm Chart 也会自动安装:
其监控指标为:
x509_cert_not_after
该 Exporter 还提供了一个比较花哨的 Grafana Dashboard, 如下:
Alert Rules 如下:
为了监控 Kubernetes 集群的证书过期时间, 我们提供了 3 种方案, 各有优劣:
/etc/kubernetes/pki
和 /var/lib/kubelet
下的证书以及 kubeconfig 文件
可以根据您的实际情况灵活进行选择.
🎉🎉🎉
三人行, 必有我师; 知识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.