现象: 在集群中的一台etcd节点,由于某种原因踢出了集群,现在需要重新加入集群
报错内容如下
8月 27 16:40:17 binary-k8s-node1 etcd[30462]: {"level":"fatal","ts":"2021-08-27T16:40:17.603+0800","caller":"etcdmain/etcd.go:271","msg":"discovery failed","error":"open /data/etcd/ssl/server.pem: no such file or directory","stacktrace":"go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:271\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
这是由于当前etcd节点已经加入过某个etcd集群导致的,再次尝试加入新的集群就会报错,解决问题的方法就是将该节点在原有集群里面踢出去或者将该节点的ETCD_INITIAL_CLUSTER_STATE参数设置成"existing"即可解决
报错内容如下
9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"info","ts":"2021-09-10T11:01:06.960+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://192.168.20.11:2380"]} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"info","ts":"2021-09-10T11:01:06.960+0800","caller":"embed/etcd.go:465","msg":"starting with peer TLS","tls-info":"cert = /data/etcd/ssl/server.pem, key = /data/etcd/ssl/server-key.pem, trusted-ca = /data/etcd/ssl/ca.pem, client-cert-auth = false, crl-file = ","cipher-suites":[]} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"warn","ts":"2021-09-10T11:01:06.960+0800","caller":"embed/etcd.go:502","msg":"scheme is HTTP while key and cert files are present; ignoring key and cert files","peer-url":"http://192.168.20.11:2380"} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"info","ts":"2021-09-10T11:01:06.960+0800","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:2379","http://192.168.20.11:2379"]} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"warn","ts":"2021-09-10T11:01:06.961+0800","caller":"embed/etcd.go:614","msg":"scheme is HTTP while key and cert files are present; ignoring key and cert files","client-url":"http://127.0.0.1:2379"} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"warn","ts":"2021-09-10T11:01:06.961+0800","caller":"embed/etcd.go:614","msg":"scheme is HTTP while key and cert files are present; ignoring key and cert files","client-url":"http://192.168.20.11:2379"} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"info","ts":"2021-09-10T11:01:06.961+0800","caller":"embed/etcd.go:360","msg":"closing etcd server","name":"etcd-4","data-dir":"/data/etcd/data","advertise-peer-urls":["http://192.168.20.11:2380"],"advertise-client-urls":["http://127.0.0.1:2379","http://192.168.20.11:2379"]} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"info","ts":"2021-09-10T11:01:06.961+0800","caller":"embed/etcd.go:364","msg":"closed etcd server","name":"etcd-4","data-dir":"/data/etcd/data","advertise-peer-urls":["http://192.168.20.11:2380"],"advertise-client-urls":["http://127.0.0.1:2379","http://192.168.20.11:2379"]} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"warn","ts":"2021-09-10T11:01:06.961+0800","caller":"etcdmain/etcd.go:176","msg":"failed to start etcd","error":"error setting up initial cluster: URL address does not have the form \"host:port\": http://ip:192.168.20.11:2380"} 9月 10 11:01:06 binary-k8s-master2 etcd[5213]: {"level":"fatal","ts":"2021-09-10T11:01:06.961+0800","caller":"etcdmain/etcd.go:271","msg":"discovery failed","error":"error setting up initial cluster: URL address does not have the form \"host:port\": http://ip:192.168.20.11:2380","stacktrace":"go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:271\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
在报错中看到error setting up initial cluster这个关键信息,就说明肯定是由于配置文件写的不对导致的,仔细检查配置文件语法就能找到问题所在
报错内容如下
9月 10 11:06:04 binary-k8s-master2 etcd[10971]: {"level":"warn","ts":"2021-09-10T11:06:04.981+0800","caller":"etcdmain/etcd.go:176","msg":"failed to start etcd","error":"error setting up initial cluster: URL address does not have the form \"host:port\": https://ip:192.168.20.11:2380"} 9月 10 11:06:04 binary-k8s-master2 etcd[10971]: {"level":"fatal","ts":"2021-09-10T11:06:04.981+0800","caller":"etcdmain/etcd.go:271","msg":"discovery failed","error":"error setting up initial cluster: URL address does not have the form \"host:port\": https://ip:192.168.20.11:2380","stacktrace":"go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:271\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
仔细查看日志,根据提示说url找不到,在看后面的具体内容,发现还是配置文件写的不太对吧,https://后面居然跟了个ip单词,问题找到了
解决方法,将配置文件https://后面的ip单词去掉去掉
果然有问题,去掉即可
服务启动成功
9月 10 13:12:42 binary-k8s-master1 etcd[8832]: {"level":"warn","ts":"2021-09-10T13:12:42.386+0800","caller":"rafthttp/stream.go:682","msg":"request sent was ignored by remote peer due to cluster ID mismatch","remote-peer-id":"aae107adddd0d3d8","remote-peer-cluster-id":"2d72d2986bd93bc7","local-member-id":"51ae3f86f3783687","local-member-cluster-id":"20b119eb5f91aa4b","error":"cluster ID mismatch"} 9月 10 13:12:42 binary-k8s-master1 etcd[8832]: {"level":"warn","ts":"2021-09-10T13:12:42.386+0800","caller":"rafthttp/stream.go:682","msg":"request sent was ignored by remote peer due to cluster ID mismatch","remote-peer-id":"aae107adddd0d3d8","remote-peer-cluster-id":"2d72d2986bd93bc7","local-member-id":"51ae3f86f3783687","local-member-cluster-id":"20b119eb5f91aa4b","error":"cluster ID mismatch"} 9月 10 13:12:42 binary-k8s-master1 etcd[8832]: request sent was ignored (cluster ID mismatch: remote[aae107adddd0d3d8]=2d72d2986bd93bc7, local=20b119eb5f91aa4b)
此报错是由于新节点原来是单机部署的单节点etcd,加入集群后没有删除数据目录导致的,删除数据目录即可解决
rm -rf /data/etcd/data/*
报错内容如下
9月 14 18:45:40 binary-k8s-master1 etcd[14881]: {"level":"warn","ts":"2021-09-14T18:45:40.932+0800","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c8a24e337417915f","rtt":"0s","error":"x509: certificate is valid for 192.168.20.10, 192.168.20.11, 192.168.20.12, 192.168.20.13, not 192.168.20.8"}
由于新节点的ip不在etcd证书文件里,所以导致的错误
解决方法:在证书配置文件中新增节点ip,然后重新生成证书,将证书拷贝至所有节点,重启所有etcd节点即可