前言
近期公司内部在做技术拉伸项,考虑到之前有看过Skywalking的相关文章,但是一直也没有自己本地搭建实践一下,借此机会,尝试一把。做一下入门的尝试和学习。
Skywalking是一款国产APM(应用程序性能监视)工具,专为微服务、云原生架构和基于容器架构而设计。
提供了分布式追踪、应用和服务依赖分析、服务网格遥测分析、度量聚合和可视化一体化解决方案
官网给的架构图
比较抽象,我自己理解后也画了个图
看着很丑是吧,但是很清晰呀,其实Skywalking应用也就四个部分
1-植入探针
2-推送应用监测数据到oapservice
3-到达oapservice的数据经过加工分析后落库
4-可视化UI页面提供数据分析
整体背景大概就这样,详细介绍请移步官方Skywalking
下面开始在windows上搞起!
基于本次实践需要用到数据存储,应用服务和Skywalking都可以支持的存储中间件,于是就选择了Elasticsearch
下载windows版本 目前最新版本7.14.1,我就喜欢用最新的,所以本次实践也是下载最新版本的(Elasticsearch的版本兼容问题一大堆,如果你没有跟我一样的洁癖,请随意!)
打开PowerShell 运行bin/elasticsearch(或bin\elasticsearch.bat在 Windows 上)
观察没有报错后在浏览器打开http://localhost:9200
好,到此存储是搞完了!
贴个镜像地址下载
还是一样,本人喜欢最新版本,目前最新版本是8.7.0,其他版本请移步历史版本下载
下载完解压文件(隐藏了文件,太多了,只展示目录)
├─agent
│ ├─activations
│ ├─bootstrap-plugins
│ ├─config
│ ├─logs
│ ├─optional-plugins
│ ├─optional-reporter-plugins
│ └─plugins
├─bin
├─config
│ ├─envoy-metrics-rules
│ ├─fetcher-prom-rules
│ ├─lal
│ ├─log-mal-rules
│ ├─meter-analyzer-config
│ ├─oal
│ ├─otel-oc-rules
│ ├─ui-initialized-templates
│ └─zabbix-rules
├─config-examples
├─licenses
│ └─ui-licenses
├─oap-libs
├─tools
│ └─profile-exporter
└─webapp
bin目录存放的是启动脚本,包含oapService.sh、webappService.sh等启动脚本
config是oap服务的配置,包含一个application.yml的配置
agent是skywalking的agent,和业务系统绑定在一起,负责收集各种监控数据
webapp目录是skywalking前端的UI界面服务的配置
启动前我们配一下配置文件
config目录下有个application.yml 主要修改一下数据存储方式
cluster: selector: ${SW_CLUSTER:standalone} standalone: ... storage: selector: ${SW_STORAGE:elasticsearch7} elasticsearch7: nameSpace: ${SW_NAMESPACE:"my-application"} clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200} protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"} connectTimeout: ${SW_STORAGE_ES_CONNECT_TIMEOUT:500} socketTimeout: ${SW_STORAGE_ES_SOCKET_TIMEOUT:30000} trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""} trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""} dayStep: ${SW_STORAGE_DAY_STEP:1} # Represent the number of days in the one minute/hour/day index. indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:1} # Shard number of new indexes indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:1} # Replicas number of new indexes # Super data set has been defined in the codes, such as trace segments.The following 3 config would be improve es performance when storage super size data in es. superDatasetDayStep: ${SW_SUPERDATASET_STORAGE_DAY_STEP:-1} # Represent the number of days in the super size dataset record index, the default value is the same as dayStep when the value is less than 0 superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} # This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin and Jaeger traces. superDatasetIndexReplicasNumber: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_REPLICAS_NUMBER:0} # Represent the replicas number in the super size dataset record index, the default value is 0. indexTemplateOrder: ${SW_STORAGE_ES_INDEX_TEMPLATE_ORDER:0} # the order of index template user: ${SW_ES_USER:""} password: ${SW_ES_PASSWORD:""} secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool. bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:5000} # Execute the async bulk record data every ${SW_STORAGE_ES_BULK_ACTIONS} requests # flush the bulk every 10 seconds whatever the number of requests # INT(flushInterval * 2/3) would be used for index refresh period. flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:15} concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000} metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000} segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200} profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200} oapAnalyzer: ${SW_STORAGE_ES_OAP_ANALYZER:"{\"analyzer\":{\"oap_analyzer\":{\"type\":\"stop\"}}}"} # the oap analyzer. oapLogAnalyzer: ${SW_STORAGE_ES_OAP_LOG_ANALYZER:"{\"analyzer\":{\"oap_log_analyzer\":{\"type\":\"standard\"}}}"} # the oap log analyzer. It could be customized by the ES analyzer configuration to support more language log formats, such as Chinese log, Japanese log and etc. advanced: ${SW_STORAGE_ES_ADVANCED:""}
打开PowerShell 切换到skywalking的bin目录下
运行 .\oapService.bat
如下图即启动成功
一样,启动前配置一下配置文件,在webapp下的webapp.xml
server: port: 8080 spring: cloud: gateway: routes: - id: oap-route uri: lb://oap-service predicates: - Path=/graphql/** discovery: client: simple: instances: oap-service: - uri: http://127.0.0.1:12800 # - uri: http://<oap-host-1>:<oap-port1> # - uri: http://<oap-host-2>:<oap-port2> mvc: throw-exception-if-no-handler-found: true web: resources: add-mappings: true management: server: base-path: /manage
再打开一个PowerShell 还是到bin目录
运行 .\webappService.bat
如下图即启动成功
打开http://localhost:8080/ (刚刚配置Skywalking的UI页面启动指定端口是8080,注意一会起应用服务的端口不要冲突)
因为我们还没有起具体的应用,所以这时候页面没有注册进来任何信息。
将Skywalking包下的agent包copy到应用示例里(这里就直接给出示例应用demo)
并修改agent/config/agent.config文件
# The service name in UI agent.service_name=${SW_AGENT_NAME:skyWalking-demo} # Backend service addresses. collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:127.0.0.1:11800} # Logging file_name logging.file_name=${SW_LOGGING_FILE_NAME:skywalking-api.log} # Logging level logging.level=${SW_LOGGING_LEVEL:DEBUG} # Mount the specific folders of the plugins. Plugins in mounted folders would work. plugin.mount=${SW_MOUNT_FOLDERS:plugins,activations}
server: port: 8500 spring: swagger: enabled: true title: elasticsearch-study\u7CFB\u7EDF description: skywalking-demo\u7CFB\u7EDF version: v1.0 host: http://localhost:8500/swagger-ui.html terms-of-service-url: http://qrainly.top/ contact: name: bj auto: openurl: true web: loginurl: http://localhost:8500/swagger-ui.html googleexcute: C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe elasticsearch: rest: uris: localhost:9200 connection-timeout: 10s #username: #password: logging: level: org.springframework.data.elasticsearch.core: debug # -javaagent:D:\v_liuwen\code\skywalking-demo\agent\agent\skywalking-agent.jar
其他代码会在后面贴出
-javaagent:D:\v_liuwen\code\skywalking-demo\agent\skywalking-agent.jar
本地启动两个服务示例 一个端口8500 另一个8501
点击多次【查询所有数据】接口后,观察Skywalking可视化页面
可以看到已经注册上Skywalking了。
可以在拓扑图上看到服务之间的依赖关系
刚才调用的/all接口的链路过程都展示出来了,可以很直观的分析其链路的情况
这个模块需要建个分析任务,就不演示了!
这块因为我本地只起了单服务,没有跨服务调用,所以也没打日志
告警是需要配置文件的
Skywalking目录下config/alarm-settings.yml
rules: # Rule unique name, must be ended with `_rule`. service_resp_time_rule: metrics-name: service_resp_time op: ">" threshold: 20 period: 1 count: 3 silence-period: 1 message: Response time of service {name} is more than 20ms in 3 minutes of last 10 minutes. service_sla_rule: # Metrics value need to be long, double or int metrics-name: service_sla op: "<" threshold: 8000 # The length of time to evaluate the metrics period: 10 # How many times after the metrics match the condition, will trigger alarm count: 2 # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. silence-period: 3 message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes service_resp_time_percentile_rule: # Metrics value need to be long, double or int metrics-name: service_percentile op: ">" threshold: 1000,1000,1000,1000,1000 period: 10 count: 3 silence-period: 5 message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000 service_instance_resp_time_rule: metrics-name: service_instance_resp_time op: ">" threshold: 1000 period: 10 count: 2 silence-period: 5 message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes database_access_resp_time_rule: metrics-name: database_access_resp_time threshold: 1000 op: ">" period: 10 count: 2 message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes endpoint_relation_resp_time_rule: metrics-name: endpoint_relation_resp_time threshold: 1000 op: ">" period: 10 count: 2 message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes webhooks: - http://localhost:8031/skywalking/alarm/pushData
故意在调接口断点延时
还可以配置把报警直接推到钉钉等其他平台
本次实践就到这里,后续有新玩法再跟大家分享
参考资料
https://www.fangzhipeng.com/architecture/2020/06/12/skywalking-test.html
https://www.jianshu.com/p/055e4223d054
持续输出中…