应用程序存在 "有状态" 和 "无状态" 两种类别。Kubernetes 系统中,Deployment、ReplicaSet 和 DaemonSet 等常用于管理无状态应用,但实际情况,应用本身是分布式的集群,也有不少有状态的应用,下面我们聊聊 "有状态" 应用的管理。
无状态应用是通过 ReplicaSet 控制器 或 DaemonSet 控制器等管理,那有状态控制器用的是什么进行管控呢?接下来就要介绍我们用来管理有状态应用的 StatefulSet 控制器。
应用程序与用户、设备、其他应用程序或外部组件进行通信时,根据其是否需要记录前一次或多次通信中的相关事宜信息以作下一次通信的分类标准,可以将那些需要记录信息的应用程序成为 "有状态"(stateful)应用,而无须记录的则称为 "无状态"(stateless)应用。
ReplicaSet 控制器能够从一个预置的 Pod 模板创建一个或者多个 Pod 资源,除了主机名和 IP 地址之外,这些 Pod 资源并没有本质上的区别,就连 Pod 的名称也是使用同一种散列模式生成,具有很强的相似性。
若 ReplicaSet 控制器在 Pod 模板中包含了某些 PVC(Persistent Volume Claim)的引用,则由它创建的所有 Pod 资源都将共享此存储卷。PVC 后端的 PV 访问模型配置为 ReadOnlyMany 或者 ReadWriteMany 时,这些 Pod 资源中的容器应用挂载存储卷后也就有了相同的数据集。
不过大多数情况是,一个集群系统的分布式应用中,每个实例都有可能需要存储使用不同的数据集,或者各自拥有其专有的数据副本,例如:分布式系统 GlusterFS 和 分布式文档存储 MongoDB 中的每个实例各自使用专有的数据集,分布式服务框架 Zookeeper 以及主从复制集群中的 Redis 的每个实例各自拥有其专用的数据副本。
由于 ReplicaSet 控制器使用同一个模板生成 Pod 资源,显然,它无法实现为每个 Pod 资源创建专用的存储卷,以及组织多个只负责生成一个 Pod 资源的 ReplicaSet 控制器则有规模扩展不变的尴尬。自主式 Pod 资源又没有自愈能力。
其次,除了要用到专用持久化存储卷外,有些集群类的分布式应用实例在运行期间还存在角色上的差异,它们存在 单向/双向 的基于 IP 地址或 主机名 的引用关系,例如 主从复制集群中的 MySQL 从节点的引用。这类应用实例,每一个都应当作一个独立的个体对待。ReplicaSet 对象控制下的 Pod 资源重构后,其 名称 和 IP地址 都存在变动的可能性,因此无法适配此种场景需求。
因此,StatefulSet(有副本状态集)则是专门用来满足此类应用的控制器类型,由其管控的每个 Pod 对象都有着固定的主机名和专用存储卷,即便被重构后亦能保持不变。
StatefulSet 是 Pod 资源控制器的一种实现,用于部署有状态应用的 Pod 资源,确保它们的运行顺序及每个 Pod 资源的唯一性。
其与 ReplicaSet 控制器不同的是,虽然素有的 Pod 对象都基于同一个 spec 配置所建,但 StatefulSet 需要为每个 Pod 维持一个唯一且固定的标识符,必要时还要为其创建专有的存储卷。StatefulSet 主要适用于那些依赖于下列类型资源的应用程序:
[root@mh-k8s-master-247-10 ~]# kubectl explain statefulset KIND: StatefulSet VERSION: apps/v1 DESCRIPTION: StatefulSet represents a set of pods with consistent identities. Identities are defined as: - Network: A single stable DNS and hostname. - Storage: As many VolumeClaims as requested. The StatefulSet guarantees that a given network identity will always map to the same storage identity. FIELDS: apiVersion <string> APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources kind <string> Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds metadata <Object> spec <Object> Spec defines the desired identities of pods in this set. status <Object> Status is the current status of Pods in this StatefulSet. This data may be out of date by some window of time.
statefulset 的 .spec 字段定义如下:
[root@mh-k8s-master-247-10 ~]# kubectl explain statefulset.spec KIND: StatefulSet VERSION: apps/v1 RESOURCE: spec <Object> DESCRIPTION: Spec defines the desired identities of pods in this set. A StatefulSetSpec is the specification of a StatefulSet. FIELDS: podManagementPolicy <string> podManagementPolicy controls how pods are created during initial scale up, when replacing pods on nodes, or when scaling down. The default policy is `OrderedReady`, where pods are created in increasing order (pod-0, then pod-1, etc) and the controller will wait until each pod is ready before continuing. When scaling down, the pods are removed in the opposite order. The alternative policy is `Parallel` which will create pods in parallel to match the desired scale without waiting, and on scale down will delete all pods at once. replicas <integer> replicas is the desired number of replicas of the given Template. These are replicas in the sense that they are instantiations of the same Template, but individual replicas also have a consistent identity. If unspecified, defaults to 1. revisionHistoryLimit <integer> revisionHistoryLimit is the maximum number of revisions that will be maintained in the StatefulSet's revision history. The revision history consists of all revisions not represented by a currently applied StatefulSetSpec version. The default value is 10. selector <Object> -required- selector is a label query over pods that should match the replica count. It must match the pod template's labels. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors serviceName <string> -required- serviceName is the name of the service that governs this StatefulSet. This service must exist before the StatefulSet, and is responsible for the network identity of the set. Pods get DNS/hostnames that follow the pattern: pod-specific-string.serviceName.default.svc.cluster.local where "pod-specific-string" is managed by the StatefulSet controller. template <Object> -required- template is the object that describes the pod that will be created if insufficient replicas are detected. Each pod stamped out by the StatefulSet will fulfill this Template, but have a unique identity from the rest of the StatefulSet. updateStrategy <Object> updateStrategy indicates the StatefulSetUpdateStrategy that will be employed to update Pods in the StatefulSet when a revision is made to Template. volumeClaimTemplates <[]Object> volumeClaimTemplates is a list of claims that pods are allowed to reference. The StatefulSet controller is responsible for mapping network identities to claims in a way that maintains the identity of a pod. Every claim in this list must have at least one matching (by name) volumeMount in one container in the template. A claim in this list takes precedence over any volumes in the template, with the same name.
典型的、完整可用的 StatefulSet 通常由三个组件组成:
Kubernetes 自 1.7 版本起还支持用户自定义更新策略,该版本兼容支持之前版本中的删除后更新(OnDelete)策略,以及新的滚动更新策略(RollingUpdate)。
[root@mh-k8s-master-247-10 ~]# kubectl explain statefulset.spec.updateStrategy.rollingUpdate KIND: StatefulSet VERSION: apps/v1 RESOURCE: rollingUpdate <Object> DESCRIPTION: RollingUpdate is used to communicate parameters when Type is RollingUpdateStatefulSetStrategyType. RollingUpdateStatefulSetStrategy is used to communicate parameter for RollingUpdateStatefulSetStrategyType. FIELDS: partition <integer> Partition indicates the ordinal at which the StatefulSet should be partitioned. Default value is 0.
[root@mh-k8s-master-247-10 ~]# kubectl get statefulsets/prometheus-k8s -n kubesphere-monitoring-system NAME READY AGE prometheus-k8s 2/2 75d [root@mh-k8s-master-247-10 ~]# kubectl get statefulsets/prometheus-k8s -n kubesphere-monitoring-system -o yaml apiVersion: apps/v1 kind: StatefulSet metadata: annotations: prometheus-operator-input-hash: "11602265415068396751" creationTimestamp: "2022-04-12T05:25:11Z" generation: 1 labels: prometheus: k8s managedFields: - apiVersion: apps/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:prometheus-operator-input-hash: {} f:labels: .: {} f:prometheus: {} f:ownerReferences: .: {} k:{"uid":"d1762bda-7b94-4554-b369-c2335bf4a692"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} f:spec: f:podManagementPolicy: {} f:replicas: {} f:revisionHistoryLimit: {} f:selector: f:matchLabels: .: {} f:app: {} f:prometheus: {} f:serviceName: {} f:template: f:metadata: f:labels: .: {} f:app: {} f:prometheus: {} f:spec: f:affinity: .: {} f:nodeAffinity: .: {} f:preferredDuringSchedulingIgnoredDuringExecution: {} f:podAntiAffinity: .: {} f:preferredDuringSchedulingIgnoredDuringExecution: {} f:containers: k:{"name":"prometheus"}: .: {} f:args: {} f:image: {} f:imagePullPolicy: {} f:livenessProbe: .: {} f:failureThreshold: {} f:httpGet: .: {} f:path: {} f:port: {} f:scheme: {} f:periodSeconds: {} f:successThreshold: {} f:timeoutSeconds: {} f:name: {} f:ports: .: {} k:{"containerPort":9090,"protocol":"TCP"}: .: {} f:containerPort: {} f:name: {} f:protocol: {} f:readinessProbe: .: {} f:failureThreshold: {} f:httpGet: .: {} f:path: {} f:port: {} f:scheme: {} f:periodSeconds: {} f:successThreshold: {} f:timeoutSeconds: {} f:resources: .: {} f:limits: .: {} f:cpu: {} f:memory: {} f:requests: .: {} f:cpu: {} f:memory: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:volumeMounts: .: {} k:{"mountPath":"/etc/prometheus/certs"}: .: {} f:mountPath: {} f:name: {} f:readOnly: {} k:{"mountPath":"/etc/prometheus/config_out"}: .: {} f:mountPath: {} f:name: {} f:readOnly: {} k:{"mountPath":"/etc/prometheus/rules/prometheus-k8s-rulefiles-0"}: .: {} f:mountPath: {} f:name: {} k:{"mountPath":"/etc/prometheus/secrets/kube-etcd-client-certs"}: .: {} f:mountPath: {} f:name: {} f:readOnly: {} k:{"mountPath":"/prometheus"}: .: {} f:mountPath: {} f:name: {} f:subPath: {} k:{"name":"prometheus-config-reloader"}: .: {} f:args: {} f:command: {} f:env: .: {} k:{"name":"POD_NAME"}: .: {} f:name: {} f:valueFrom: .: {} f:fieldRef: .: {} f:apiVersion: {} f:fieldPath: {} f:image: {} f:imagePullPolicy: {} f:name: {} f:resources: .: {} f:limits: .: {} f:memory: {} f:requests: .: {} f:memory: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:volumeMounts: .: {} k:{"mountPath":"/etc/prometheus/config"}: .: {} f:mountPath: {} f:name: {} k:{"mountPath":"/etc/prometheus/config_out"}: .: {} f:mountPath: {} f:name: {} k:{"name":"rules-configmap-reloader"}: .: {} f:args: {} f:image: {} f:imagePullPolicy: {} f:name: {} f:resources: .: {} f:limits: .: {} f:memory: {} f:requests: .: {} f:memory: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:volumeMounts: .: {} k:{"mountPath":"/etc/prometheus/rules/prometheus-k8s-rulefiles-0"}: .: {} f:mountPath: {} f:name: {} f:dnsPolicy: {} f:nodeSelector: .: {} f:kubernetes.io/os: {} f:restartPolicy: {} f:schedulerName: {} f:securityContext: .: {} f:fsGroup: {} f:runAsNonRoot: {} f:runAsUser: {} f:serviceAccount: {} f:serviceAccountName: {} f:terminationGracePeriodSeconds: {} f:tolerations: {} f:volumes: .: {} k:{"name":"config"}: .: {} f:name: {} f:secret: .: {} f:defaultMode: {} f:secretName: {} k:{"name":"config-out"}: .: {} f:emptyDir: {} f:name: {} k:{"name":"prometheus-k8s-rulefiles-0"}: .: {} f:configMap: .: {} f:defaultMode: {} f:name: {} f:name: {} k:{"name":"secret-kube-etcd-client-certs"}: .: {} f:name: {} f:secret: .: {} f:defaultMode: {} f:secretName: {} k:{"name":"tls-assets"}: .: {} f:name: {} f:secret: .: {} f:defaultMode: {} f:secretName: {} f:updateStrategy: f:type: {} f:volumeClaimTemplates: {} f:status: f:replicas: {} manager: operator operation: Update time: "2022-04-12T05:25:12Z" - apiVersion: apps/v1 fieldsType: FieldsV1 fieldsV1: f:status: f:readyReplicas: {} manager: kube-controller-manager operation: Update time: "2022-04-12T05:27:18Z" name: prometheus-k8s namespace: kubesphere-monitoring-system ownerReferences: - apiVersion: monitoring.coreos.com/v1 blockOwnerDeletion: true controller: true kind: Prometheus name: k8s uid: d1762bda-7b94-4554-b369-c2335bf4a692 resourceVersion: "9637" selfLink: /apis/apps/v1/namespaces/kubesphere-monitoring-system/statefulsets/prometheus-k8s uid: 4c1b145b-5bce-4ee7-bbb1-ec4c7ac0ba5f spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: prometheus prometheus: k8s serviceName: prometheus-operated template: metadata: creationTimestamp: null labels: app: prometheus prometheus: k8s spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: node-role.kubernetes.io/monitoring operator: Exists weight: 100 podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: prometheus operator: In values: - k8s namespaces: - kubesphere-monitoring-system topologyKey: kubernetes.io/hostname weight: 100 containers: - args: - --web.console.templates=/etc/prometheus/consoles - --web.console.libraries=/etc/prometheus/console_libraries - --config.file=/etc/prometheus/config_out/prometheus.env.yaml - --storage.tsdb.path=/prometheus - --storage.tsdb.retention.time=7d - --web.enable-lifecycle - --storage.tsdb.no-lockfile - --query.max-concurrency=1000 - --web.route-prefix=/ image: registry.cn-beijing.aliyuncs.com/kubesphereio/prometheus:v2.26.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 6 httpGet: path: /-/healthy port: web scheme: HTTP periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 name: prometheus ports: - containerPort: 9090 name: web protocol: TCP readinessProbe: failureThreshold: 120 httpGet: path: /-/ready port: web scheme: HTTP periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 resources: limits: cpu: "4" memory: 16Gi requests: cpu: 200m memory: 400Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/config_out name: config-out readOnly: true - mountPath: /etc/prometheus/certs name: tls-assets readOnly: true - mountPath: /prometheus name: prometheus-k8s-db subPath: prometheus-db - mountPath: /etc/prometheus/rules/prometheus-k8s-rulefiles-0 name: prometheus-k8s-rulefiles-0 - mountPath: /etc/prometheus/secrets/kube-etcd-client-certs name: secret-kube-etcd-client-certs readOnly: true - args: - --log-format=logfmt - --reload-url=http://localhost:9090/-/reload - --config-file=/etc/prometheus/config/prometheus.yaml.gz - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name image: registry.cn-beijing.aliyuncs.com/kubesphereio/prometheus-config-reloader:v0.42.1 imagePullPolicy: IfNotPresent name: prometheus-config-reloader resources: limits: memory: 25Mi requests: memory: 25Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/config name: config - mountPath: /etc/prometheus/config_out name: config-out - args: - --webhook-url=http://localhost:9090/-/reload - --volume-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0 image: registry.cn-beijing.aliyuncs.com/kubesphereio/configmap-reload:v0.3.0 imagePullPolicy: IfNotPresent name: rules-configmap-reloader resources: limits: memory: 25Mi requests: memory: 25Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/rules/prometheus-k8s-rulefiles-0 name: prometheus-k8s-rulefiles-0 dnsPolicy: ClusterFirst nodeSelector: kubernetes.io/os: linux restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 0 runAsNonRoot: false runAsUser: 0 serviceAccount: prometheus-k8s serviceAccountName: prometheus-k8s terminationGracePeriodSeconds: 600 tolerations: - effect: NoSchedule key: dedicated operator: Equal value: monitoring volumes: - name: config secret: defaultMode: 420 secretName: prometheus-k8s - name: tls-assets secret: defaultMode: 420 secretName: prometheus-k8s-tls-assets - emptyDir: {} name: config-out - configMap: defaultMode: 420 name: prometheus-k8s-rulefiles-0 name: prometheus-k8s-rulefiles-0 - name: secret-kube-etcd-client-certs secret: defaultMode: 420 secretName: kube-etcd-client-certs updateStrategy: type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: prometheus-k8s-db spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi volumeMode: Filesystem status: phase: Pending status: collisionCount: 0 currentReplicas: 2 currentRevision: prometheus-k8s-957d4c968 observedGeneration: 1 readyReplicas: 2 replicas: 2 updateRevision: prometheus-k8s-957d4c968 updatedReplicas: 2 [root@mh-k8s-master-247-10 ~]#
定义 StatefulSet 资源时,spec 中必须要嵌套的字段为 "serviceName" 和 "template",用于指定关联的 Headless Service 和要使用的 Pod 模板,"volumeClaimTemplate" 字段用于为 Pod 资源创建专有存储卷 PVC 模板,它可内嵌使用的字段即为 persistentVolumeClaim 资源的可用字段,对 StatefulSet 资源为可选字段。
StatefulSet 控制器默认以串行方式创建各 Pod 副本,如果想要以并行方式创建,则可以修改 .spec.podManagementPolicy 字段的值为 "Parallel",默认值为 "OrderedReady",修改字段值定义如下:
[root@mh-k8s-master-247-10 ~]# kubectl explain statefulset.spec.podManagementPolicy KIND: StatefulSet VERSION: apps/v1 FIELD: podManagementPolicy <string> DESCRIPTION: podManagementPolicy controls how pods are created during initial scale up, when replacing pods on nodes, or when scaling down. The default policy is `OrderedReady`, where pods are created in increasing order (pod-0, then pod-1, etc) and the controller will wait until each pod is ready before continuing. When scaling down, the pods are removed in the opposite order. The alternative policy is `Parallel` which will create pods in parallel to match the desired scale without waiting, and on scale down will delete all pods at once. [root@mh-k8s-master-247-10 ~]#
由 StatefulSet 控制器创建的 Pod 资源拥有固定、唯一的标识和专用存储卷,即便重新调度或终止重建,其名称也依然保持不变,且此前的存储卷及其数据都不会丢失。
Pod 由 StatefulSet 控制器创建的话,拥有固定且唯一的标识符,它们基于唯一的索引号及相关的 StatefulSet 对象的名称而生成,格式为 "<statefulset name>-<ordinalindex>",比如:
[root@mh-k8s-master-247-10 ~]# kubectl get statefulsets/prometheus-k8s -n kubesphere-monitoring-system NAME READY AGE prometheus-k8s 2/2 75d [root@mh-k8s-master-247-10 ~]# kubectl get pods -n kubesphere-monitoring-system -l app=prometheus NAME READY STATUS RESTARTS AGE prometheus-k8s-0 3/3 Running 1 75d prometheus-k8s-1 3/3 Running 1 75d [root@mh-k8s-master-247-10 ~]#
Pod 资源的著名名同其资源名称,因此也是带索引序号的名称格式,请看:
[root@mh-k8s-master-247-10 ~]# kubectl get pods -n kubesphere-monitoring-system -l app=prometheus NAME READY STATUS RESTARTS AGE prometheus-k8s-0 3/3 Running 1 75d prometheus-k8s-1 3/3 Running 1 75d [root@mh-k8s-master-247-10 ~]# for i in 0 1;do kubectl exec prometheus-k8s-$i -n kubesphere-monitoring-system -c prometheus -- sh -c 'hostname'; done prometheus-k8s-0 prometheus-k8s-1 [root@mh-k8s-master-247-10 ~]#
StatefulSet 对象资源的名称创建由 DNS 资源创建和记录:
StatefulSet 资源的扩缩容与 Deployment 资源相似,即通过修改资源的副本数来改动其目标 Pod 资源数量。
对 StatefulSet 资源来说,kubectl scale 和 kubectl path 命令均可实现此功能,也可以使用 kubectl edit 命令直接修改其副本,或者在修改配置文件之后,由 Kubectl apply 命令重新声明。
例如,我们将 myapp 中的 Pod 副本数量扩展到 6 个。
kubectl scale statefulset myapp --replicas=6 statefulset.apps "myapp" scaled
Statefulset 资源的扩展过程与创建过程的 Pod 资源生成策略相同,默认为顺序执行,而且其名称也将现有 Pod 资源的最后一个序号向后进行。
与扩容操作相比,缩容操作只需要将其副本数量调低些即可。缩减规模终止 Pod 资源的默认策略也是以 Pod 顺序号逆序逐一进行,直到数量满足期望目标为止。