https://kubernetes.io/docs/concepts/cluster-administration/logging/
总体分为三种方式:
容器日志驱动:
https://docs.docker.com/config/containers/logging/configure/
查看当前的docker主机的驱动:
$ docker info --format '{{.LoggingDriver}}'
json-file格式,docker会默认将标准和错误输出保存为宿主机的文件,路径为:
/var/lib/docker/containers/<container-id>/<container-id>-json.log
并且可以设置日志轮转:
{ "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3", "labels": "production_status", "env": "os,customer" } }
优势:
劣势:
思路:在pod中启动一个sidecar容器,把容器内的日志文件吐到标准输出,由宿主机中的日志收集agent进行采集。
$ cat count-pod.yaml apiVersion: v1 kind: Pod metadata: name: counter spec: containers: - name: count image: busybox args: - /bin/sh - -c - > i=0; while true; do echo "$i: $(date)" >> /var/log/1.log; echo "$(date) INFO $i" >> /var/log/2.log; i=$((i+1)); sleep 1; done volumeMounts: - name: varlog mountPath: /var/log - name: count-log-1 image: busybox args: [/bin/sh, -c, 'tail -n+1 -f /var/log/1.log'] volumeMounts: - name: varlog mountPath: /var/log - name: count-log-2 image: busybox args: [/bin/sh, -c, 'tail -n+1 -f /var/log/2.log'] volumeMounts: - name: varlog mountPath: /var/log volumes: - name: varlog emptyDir: {} $ kubectl create -f counter-pod.yaml $ kubectl logs -f counter -c count-log-1
优势:
劣势:
思路:直接在业务Pod中使用sidecar的方式启动一个日志收集的组件(比如fluentd),这样日志收集可以将容器内的日志当成本地文件来进行收取。
优势:不用往宿主机存储日志,本地日志完全可以收集
劣势:每个业务应用额外启动一个日志agent,带来额外的资源损耗
目前来讲,最建议的是采用节点级的日志代理。
方案一:自研方案,实现一个自研的日志收集agent,大致思路:
方案二:日志使用开源的Agent进行收集(EFK方案),适用范围广,可以满足绝大多数日志收集、展示的需求。
Elasticsearch
一个开源的分布式、Restful 风格的搜索和数据分析引擎,它的底层是开源库Apache Lucene。它可以被下面这样准确地形容:
Kibana
Kibana是一个开源的分析和可视化平台,设计用于和Elasticsearch一起工作。可以通过Kibana来搜索,查看,并和存储在Elasticsearch索引中的数据进行交互。也可以轻松地执行高级数据分析,并且以各种图标、表格和地图的形式可视化数据。
Fluentd
一个针对日志的收集、处理、转发系统。通过丰富的插件系统,可以收集来自于各种系统或应用的日志,转化为用户指定的格式后,转发到用户所指定的日志存储系统之中。
Fluentd 通过一组给定的数据源抓取日志数据,处理后(转换成结构化的数据格式)将它们转发给其他服务,比如 Elasticsearch、对象存储、kafka等等。Fluentd 支持超过300个日志存储和分析服务,所以在这方面是非常灵活的。主要运行步骤如下
为什么推荐使用fluentd作为k8s体系的日志收集工具?
云原生:https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/fluentd-elasticsearch
将日志文件JSON化
可插拔架构设计
极小的资源占用
基于C和Ruby语言, 30-40MB,13,000 events/second/core
极强的可靠性
https://docs.fluentd.org/v/0.12/quickstart/life-of-a-fluentd-event
Input -> filter 1 -> ... -> filter N -> Buffer -> Output
指令介绍:
source ,数据源,对应Input
通过使用 source 指令,来选择和配置所需的输入插件来启用 Fluentd 输入源, source 把事件提交到 fluentd 的路由引擎中。使用type来区分不同类型的数据源。如下配置可以监听指定文件的追加输入:
<source> @type tail path /var/log/httpd-access.log pos_file /var/log/td-agent/httpd-access.log.pos tag myapp.access format apache2 </source>
filter,Event processing pipeline(事件处理流)
filter 可以串联成 pipeline,对数据进行串行处理,最终再交给 match 输出。 如下可以对事件内容进行处理:
<source> @type http port 9880 </source> <filter myapp.access> @type record_transformer <record> host_param “#{Socket.gethostname}” </record> </filter>
filter 获取数据后,调用内置的 @type record_transformer 插件,在事件的 record 里插入了新的字段 host_param,然后再交给 match 输出。
label指令
可以在 source
里指定 @label
,这个 source 所触发的事件就会被发送给指定的 label 所包含的任务,而不会被后续的其他任务获取到。
<source> @type forward </source> <source> ### 这个任务指定了 label 为 @SYSTEM ### 会被发送给 <label @SYSTEM> ### 而不会被发送给下面紧跟的 filter 和 match @type tail @label @SYSTEM path /var/log/httpd-access.log pos_file /var/log/td-agent/httpd-access.log.pos tag myapp.access format apache2 </source> <filter access.**> @type record_transformer <record> # … </record> </filter> <match **> @type elasticsearch # … </match> <label @SYSTEM> ### 将会接收到上面 @type tail 的 source event <filter var.log.middleware.**> @type grep # … </filter> <match **> @type s3 # … </match> </label>
match,匹配输出
查找匹配 “tags” 的事件,并处理它们。match 命令的最常见用法是将事件输出到其他系统(因此,与 match 命令对应的插件称为 “输出插件”)
<source> @type http port 9880 </source> <filter myapp.access> @type record_transformer <record> host_param “#{Socket.gethostname}” </record> </filter> <match myapp.access> @type file path /var/log/fluent/access </match>
事件的结构:
time:事件的处理时间
tag:事件的来源,在fluentd.conf中配置
record:真实的日志内容,json对象
比如,下面这条原始日志:
192.168.0.1 - - [28/Feb/2013:12:00:00 +0900] "GET / HTTP/1.1" 200 777
经过fluentd 引擎处理完后的样子可能是:
2020-07-16 08:40:35 +0000 apache.access: {"user":"-","method":"GET","code":200,"size":777,"host":"192.168.0.1","path":"/"}
Input -> filter 1 -> ... -> filter N -> Buffer -> Output
因为每个事件数据量通常很小,考虑数据传输效率、稳定性等方面的原因,所以基本不会每条事件处理完后都会立马写入到output端,因此fluentd建立了缓冲模型,模型中主要有两个概念:
可以设置的参数,主要有:
大致的过程为:
随着fluentd事件的不断生成并写入chunk,缓存块持变大,当缓存块满足buffer_chunk_limit大小或者新的缓存块诞生超过flush_interval时间间隔后,会推入缓存queue队列尾部,该队列大小由buffer_queue_limit决定。
每次有新的chunk入列,位于队列最前部的chunk块会立即写入配置的存储后端,比如配置的是kafka,则立即把数据推入kafka中。
比较理想的情况是每次有新的缓存块进入缓存队列,则立马会被写入到后端,同时,新缓存块也持续入列,但是入列的速度不会快于出列的速度,这样基本上缓存队列处于空的状态,队列中最多只有一个缓存块。
但是实际情况考虑网络等因素,往往缓存块被写入后端存储的时候会出现延迟或者写入失败的情况,当缓存块写入后端失败时,该缓存块还会留在队列中,等retry_wait时间后重试发送,当retry的次数达到retry_limit后,该缓存块被销毁(数据被丢弃)。
此时缓存队列持续有新的缓存块进来,如果队列中存在很多未及时写入到后端存储的缓存块的话,当队列长度达到buffer_queue_limit大小,则新的事件被拒绝,fluentd报错,error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data"。
还有一种情况是网络传输缓慢的情况,若每3秒钟会产生一个新块,但是写入到后端时间却达到了30s钟,队列长度为100,那么每个块出列的时间内,又有新的10个块进来,那么队列很快就会被占满,导致异常出现。
目标:收集容器内的nginx应用的access.log日志,并解析日志字段为JSON格式,原始日志的格式为:
$ tail -f access.log ... 53.49.146.149 1561620585.973 0.005 502 [27/Jun/2019:15:29:45 +0800] 178.73.215.171 33337 GET https
收集并处理成:
{ "serverIp": "53.49.146.149", "timestamp": "1561620585.973", "respondTime": "0.005", "httpCode": "502", "eventTime": "27/Jun/2019:15:29:45 +0800", "clientIp": "178.73.215.171", "clientPort": "33337", "method": "GET", "protocol": "https" }
思路:
fluent.conf
<source> @type tail @label @nginx_access path /fluentd/access.log pos_file /fluentd/nginx_access.posg tag nginx_access format none @log_level trace </source> <label @nginx_access> <filter nginx_access> @type parser key_name message format /(?<serverIp>[^ ]*) (?<timestamp>[^ ]*) (?<respondTime>[^ ]*) (?<httpCode>[^ ]*) \[(?<eventTime>[^\]]*)\] (?<clientIp>[^ ]*) (?<clientPort>[^ ]*) (?<method>[^ ]*) (?<protocol>[^ ]*)/ </filter> <match nginx_access> @type stdout </match> </label>
启动服务,追加文件内容:
$ docker run -u root --rm -ti 192.168.136.10:5000/fluentd_elasticsearch/fluentd:v2.5.2 sh / # cd /fluentd/ / # touch access.log / # fluentd -c /fluentd/etc/fluent.conf / # echo '53.49.146.149 1561620585.973 0.005 502 [27/Jun/2019:15:29:45 +0800] 178.73.215.171 33337 GET https' >>/fluentd/access.log
使用该网站进行正则校验: http://fluentular.herokuapp.com
<source> @type tail @label @nginx_access path /fluentd/access.log pos_file /fluentd/nginx_access.posg tag nginx_access format none @log_level trace </source> <label @nginx_access> <filter nginx_access> @type parser key_name message format /(?<serverIp>[^ ]*) (?<timestamp>[^ ]*) (?<respondTime>[^ ]*) (?<httpCode>[^ ]*) \[(?<eventTime>[^\]]*)\] (?<clientIp>[^ ]*) (?<clientPort>[^ ]*) (?<method>[^ ]*) (?<protocol>[^ ]*)/ </filter> <filter nginx_access> @type record_transformer enable_ruby <record> host_name "#{Socket.gethostname}" my_key "my_val" tls ${record["protocol"].index("https") ? "true" : "false"} </record> </filter> <match nginx_access> @type stdout </match> </label>
efk/elasticsearch.yaml
apiVersion: apps/v1 kind: StatefulSet metadata: labels: addonmanager.kubernetes.io/mode: Reconcile k8s-app: elasticsearch-logging version: v7.4.2 name: elasticsearch-logging namespace: logging spec: replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: k8s-app: elasticsearch-logging version: v7.4.2 serviceName: elasticsearch-logging template: metadata: labels: k8s-app: elasticsearch-logging version: v7.4.2 spec: nodeSelector: es: "true" ## 指定部署在哪个节点。需根据环境来修改 containers: - env: - name: NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: cluster.initial_master_nodes value: elasticsearch-logging-0 - name: ES_JAVA_OPTS value: "-Xms512m -Xmx512m" image: 192.168.136.10:5000/elasticsearch/elasticsearch:7.4.2 name: elasticsearch-logging ports: - containerPort: 9200 name: db protocol: TCP - containerPort: 9300 name: transport protocol: TCP volumeMounts: - mountPath: /usr/share/elasticsearch/data name: elasticsearch-logging dnsConfig: options: - name: single-request-reopen initContainers: - command: - /sbin/sysctl - -w - vm.max_map_count=262144 image: alpine:3.6 imagePullPolicy: IfNotPresent name: elasticsearch-logging-init resources: {} securityContext: privileged: true - name: fix-permissions image: alpine:3.6 command: ["sh", "-c", "chown -R 1000:1000 /usr/share/elasticsearch/data"] securityContext: privileged: true volumeMounts: - name: elasticsearch-logging mountPath: /usr/share/elasticsearch/data volumes: - name: elasticsearch-logging hostPath: path: /esdata --- apiVersion: v1 kind: Service metadata: labels: k8s-app: elasticsearch-logging name: elasticsearch namespace: logging spec: ports: - port: 9200 protocol: TCP targetPort: db selector: k8s-app: elasticsearch-logging type: ClusterIP
$ kubectl create namespace logging ## 给slave1节点打上label,将es服务调度到slave1节点 $ kubectl label node k8s-slave1 es=true ## 部署服务,可以先去部署es的节点把镜像下载到本地 $ kubectl create -f elasticsearch.yaml statefulset.apps/elasticsearch-logging created service/elasticsearch created ## 等待片刻,查看一下es的pod部署到了k8s-slave1节点,状态变为running $ kubectl -n logging get po -o wide NAME READY STATUS RESTARTS AGE IP NODE elasticsearch-logging-0 1/1 Running 0 69m 10.244.1.104 k8s-slave1 # 然后通过curl命令访问一下服务,验证es是否部署成功 $ kubectl -n logging get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE elasticsearch ClusterIP 10.109.174.58 <none> 9200/TCP 71m $ curl 10.109.174.58:9200 { "name" : "elasticsearch-logging-0", "cluster_name" : "docker-cluster", "cluster_uuid" : "uic8xOyNSlGwvoY9DIBT1g", "version" : { "number" : "7.4.2", "build_flavor" : "default", "build_type" : "docker", "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96", "build_date" : "2019-10-28T20:40:44.881551Z", "build_snapshot" : false, "lucene_version" : "8.2.0", "minimum_wire_compatibility_version" : "6.8.0", "minimum_index_compatibility_version" : "6.0.0-beta1" }, "tagline" : "You Know, for Search" }
kibana需要暴露web页面给前端使用,因此使用ingress配置域名来实现对kibana的访问
kibana为无状态应用,直接使用Deployment来启动
kibana需要访问es,直接利用k8s服务发现访问此地址即可,http://elasticsearch:9200
efk/kibana.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: kibana namespace: logging labels: app: kibana spec: selector: matchLabels: app: "kibana" template: metadata: labels: app: kibana spec: nodeSelector: kibana: "true" ## 指定部署在哪个节点。需根据环境来修改 containers: - name: kibana image: 192.168.136.10:5000/kibana/kibana:7.4.2 resources: limits: cpu: 1000m requests: cpu: 100m env: - name: ELASTICSEARCH_URL value: http://elasticsearch:9200 ports: - containerPort: 5601 --- apiVersion: v1 kind: Service metadata: name: kibana namespace: logging labels: app: kibana spec: ports: - port: 5601 protocol: TCP targetPort: 5601 type: ClusterIP selector: app: kibana --- apiVersion: extensions/v1beta1 kind: Ingress metadata: name: kibana namespace: logging spec: rules: - host: kibana.luffy.com http: paths: - path: / backend: serviceName: kibana servicePort: 5601
$ kubectl label node k8s-slave2 kibana=true $ kubectl create -f kibana.yaml deployment.apps/kibana created service/kibana created ingress/kibana created # 然后查看pod,等待状态变成running $ kubectl -n logging get po NAME READY STATUS RESTARTS AGE elasticsearch-logging-0 1/1 Running 0 88m kibana-944c57766-ftlcw 1/1 Running 0 15m ## 配置域名解析 kibana.luffy.com,并访问服务进行验证,若可以访问,说明连接es成功
efk/fluentd-es-config-main.yaml
apiVersion: v1 data: fluent.conf: |- # This is the root config file, which only includes components of the actual configuration # # Do not collect fluentd's own logs to avoid infinite loops. <match fluent.**> @type null </match> @include /fluentd/etc/config.d/*.conf kind: ConfigMap metadata: labels: addonmanager.kubernetes.io/mode: Reconcile name: fluentd-es-config-main namespace: logging
配置文件,fluentd-config.yaml,注意点:
efk/fluentd-configmap.yaml
kind: ConfigMap apiVersion: v1 metadata: name: fluentd-config namespace: logging labels: addonmanager.kubernetes.io/mode: Reconcile data: containers.input.conf: |- <source> @id fluentd-containers.log @type tail path /var/log/containers/*.log pos_file /var/log/es-containers.log.pos time_format %Y-%m-%dT%H:%M:%S.%NZ localtime tag raw.kubernetes.* format json read_from_head true </source> # Detect exceptions in the log output and forward them as one log entry. # https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions <match raw.kubernetes.**> @id raw.kubernetes @type detect_exceptions remove_tag_prefix raw message log stream stream multiline_flush_interval 5 max_bytes 500000 max_lines 1000 </match> output.conf: |- # Enriches records with Kubernetes metadata <filter kubernetes.**> @type kubernetes_metadata </filter> <match **> @id elasticsearch @type elasticsearch @log_level info include_tag_key true host elasticsearch port 9200 logstash_format true request_timeout 30s <buffer> @type file path /var/log/fluentd-buffers/kubernetes.system.buffer flush_mode interval retry_type exponential_backoff flush_thread_count 2 flush_interval 5s retry_forever retry_max_interval 30 chunk_limit_size 2M queue_limit_length 8 overflow_action block </buffer> </match>
daemonset定义文件,fluentd.yaml,注意点:
efk/fluentd.yaml
apiVersion: v1 kind: ServiceAccount metadata: name: fluentd-es namespace: logging labels: k8s-app: fluentd-es kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: fluentd-es labels: k8s-app: fluentd-es kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" resources: - "namespaces" - "pods" verbs: - "get" - "watch" - "list" --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: fluentd-es labels: k8s-app: fluentd-es kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile subjects: - kind: ServiceAccount name: fluentd-es namespace: logging apiGroup: "" roleRef: kind: ClusterRole name: fluentd-es apiGroup: "" --- apiVersion: apps/v1 kind: DaemonSet metadata: labels: addonmanager.kubernetes.io/mode: Reconcile k8s-app: fluentd-es name: fluentd-es namespace: logging spec: selector: matchLabels: k8s-app: fluentd-es template: metadata: labels: k8s-app: fluentd-es spec: containers: - env: - name: FLUENTD_ARGS value: --no-supervisor -q image: 192.168.136.10:5000/fluentd_elasticsearch/fluentd:v2.5.2 imagePullPolicy: IfNotPresent name: fluentd-es resources: limits: memory: 500Mi requests: cpu: 100m memory: 200Mi volumeMounts: - mountPath: /var/log name: varlog - mountPath: /var/lib/docker/containers name: varlibdockercontainers readOnly: true - mountPath: /fluentd/etc/config.d name: config-volume - mountPath: /fluentd/etc/fluent.conf name: config-volume-main subPath: fluent.conf nodeSelector: fluentd: "true" securityContext: {} serviceAccount: fluentd-es serviceAccountName: fluentd-es volumes: - hostPath: path: /var/log type: "" name: varlog - hostPath: path: /var/lib/docker/containers type: "" name: varlibdockercontainers - configMap: defaultMode: 420 name: fluentd-config name: config-volume - configMap: defaultMode: 420 items: - key: fluent.conf path: fluent.conf name: fluentd-es-config-main name: config-volume-main
## 给slave1打上标签,进行部署fluentd日志采集服务 $ kubectl label node k8s-slave1 fluentd=true $ kubectl label node k8s-slave2 fluentd=true # 创建服务 $ kubectl create -f fluentd-es-config-main.yaml configmap/fluentd-es-config-main created $ kubectl create -f fluentd-configmap.yaml configmap/fluentd-config created $ kubectl create -f fluentd.yaml serviceaccount/fluentd-es created clusterrole.rbac.authorization.k8s.io/fluentd-es created clusterrolebinding.rbac.authorization.k8s.io/fluentd-es created daemonset.extensions/fluentd-es created ## 然后查看一下pod是否已经在k8s-slave1 $ kubectl -n logging get po -o wide NAME READY STATUS RESTARTS AGE elasticsearch-logging-0 1/1 Running 0 123m fluentd-es-246pl 1/1 Running 0 2m2s kibana-944c57766-ftlcw 1/1 Running 0 50m
上述是简化版的k8s日志部署收集的配置,完全版的可以提供 https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/fluentd-elasticsearch 来查看。
在slave节点中启动服务,同时往标准输出中打印测试日志,到kibana中查看是否可以收集
efk/test-pod.yaml
apiVersion: v1 kind: Pod metadata: name: counter spec: nodeSelector: fluentd: "true" containers: - name: count image: alpine:3.6 args: [/bin/sh, -c, 'i=0; while true; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done']
$ kubectl get po NAME READY STATUS RESTARTS AGE counter 1/1 Running 0 6s
登录kibana界面,按照截图的顺序操作:
可以通过其他元数据来过滤日志数据,比如可以单击任何日志条目以查看其他元数据,如容器名称,Kubernetes 节点,命名空间等,比如kubernetes.pod_name : counter
到这里,我们就在 Kubernetes 集群上成功部署了 EFK ,要了解如何使用 Kibana 进行日志数据分析,可以参考 Kibana 用户指南文档:https://www.elastic.co/guide/en/kibana/current/index.html