背景:
- etcd 3.2.23
- golang 1.9.1
- centos 6
- etcd cluster集群节点7台,节点member之间 https 通讯,与 client 之间也是 https 通讯
- 测试环境 2W 台主机,每台都要安装这个 agent,每个 agent注册一个包含lease的key,key的 ttl 设置为10s
- 设置keepalive
- 产品等着发布上线,持续遇到集群性能问题,长时间找不到原因,感觉我要崩溃了
- 集群性能问题不解决,线上更多的服务器更是没法部署,领导不认可,不满意,我就要被开除了,我就要辞职了。。。
- Etcd github官方提了issue,官方没时间帮我们看(理解官方的苦衷,issue 太多,很多都是使用者的问题)…
var err error ctx, cancel := context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT) leaseResp, err := this.GetEtcdClient().Grant(ctx, 10) //租约时间设定为10秒 cancel() if err != nil { return err } ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT) kvc := etcdclient.NewKV(this.kapi) var txnResp *etcdclient.TxnResponse txnResp, err = kvc.Txn(ctx). If(etcdclient.Compare(etcdclient.CreateRevision(keyName), "=", 0)). Then(etcdclient.OpPut(keyName, value.GetRegisterInfo(),etcdclient.WithLease(etcdclient.LeaseID(leaseResp.ID)))). Commit() _, err = this.kapi.Put(ctx, keyName, value.GetRegisterInfo()) cancel() if err != nil { return err } if !txnResp.Succeeded { return fmt.Errorf("无法注册自己:%s ,该 Key 已经存在",keyName) } ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT) _, err = this.GetEtcdClient().KeepAlive(context.TODO(), leaseResp.ID) cancel() if err != nil { return err }
问题现象:
- 带有lease id的 key 无故丢失
- leader节点 大部分时间 TIME-WAIT数量非常高
- leader 节点的 CLOSE-WAIT 有时也非常高
- leader 的 CPU 使用率持续在80-90%的样子
- leader 节点的内存使用率持续上升,疑似内存泄露的症状
监控图如下:
etcd leader的 etcd 进程入手:
先分析「疑似内存溢出」问题,在 etcd 里,已经有pprof的信息,golang的tool里也有直接读取 http pprof 性能分析的工具,只是不支持 https 的。需要将 pprof 文件下载到本地,再从本地读取
curl --cacert /xxx/xxx/etcd/config/ssl/ca.pem --cert /xxx/xxx/etcd/config/ssl/client.pem --key /xxx/xxx/etcd/config/ssl/client-key.pem -L https://leader_ip:2379/debug/pprof/profile > leader_ip_cpu.txt
结果如下:
cfc4n@cnxct:~/Downloads$go tool pprof leader_ip_cpu.txt File: etcd Type: cpu Time: Jul 3, 2018 at 4:13pm (CST) Duration: 2.64mins, Total samples = 34.54mins (1306.43%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top 20 Showing nodes accounting for 23.98mins, 69.42% of 34.54mins total Dropped 1702 nodes (cum <= 0.17mins) Showing top 20 nodes out of 211 flat flat% sum% cum cum% 3.62mins 10.48% 10.48% 3.62mins 10.48% runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s 2.99mins 8.66% 19.13% 2.99mins 8.66% math/big.divWVW /usr/local/go/src/math/big/arith_amd64.s 2.34mins 6.78% 25.91% 2.34mins 6.78% math/big.bitLen /usr/local/go/src/math/big/arith_amd64.s 1.90mins 5.51% 31.42% 10.44mins 30.22% math/big.nat.montgomery /usr/local/go/src/math/big/nat.go 1.67mins 4.82% 36.24% 1.67mins 4.82% runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s 1.46mins 4.22% 40.47% 3.57mins 10.33% runtime.scanobject /usr/local/go/src/runtime/mgcmark.go 1.36mins 3.93% 44.40% 1.44mins 4.17% syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s 1.30mins 3.78% 48.17% 1.30mins 3.78% runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go 1.04mins 3.01% 51.19% 1.04mins 3.02% runtime.greyobject /usr/local/go/src/runtime/mgcmark.go 0.94mins 2.71% 53.90% 0.94mins 2.71% math/big.divWVW 0.80mins 2.33% 56.23% 0.83mins 2.41% runtime.siftdownTimer /usr/local/go/src/runtime/time.go 0.80mins 2.32% 58.55% 3.17mins 9.17% math/big.nat.divLarge /usr/local/go/src/math/big/nat.go 0.78mins 2.27% 60.82% 0.78mins 2.27% math/big.addMulVVW /usr/local/go/src/math/big/arith_amd64.s 0.57mins 1.66% 62.48% 0.57mins 1.66% runtime._ExternalCode /usr/local/go/src/runtime/proc.go 0.57mins 1.64% 64.12% 0.57mins 1.64% math/big.(*byteReader).Read <autogenerated> 0.42mins 1.22% 65.33% 3.22mins 9.32% runtime.mallocgc /usr/local/go/src/runtime/malloc.go 0.40mins 1.17% 66.50% 0.40mins 1.17% math/big.mulAddVWW /usr/local/go/src/math/big/arith_amd64.s 0.34mins 0.99% 67.49% 0.34mins 0.99% vendor/golang_org/x/crypto/curve25519.ladderstep /usr/local/go/src/vendor/golang_org/x/crypto/curve25519/ladderstep_amd64.s 0.34mins 0.99% 68.49% 0.34mins 0.99% math/big.bitLen 0.32mins 0.93% 69.42% 0.41mins 1.18% sync.(*Pool).pin /usr/local/go/src/sync/pool.go
可以看到大量 math 相关的包函数占用了大量的 cpu,而我们程序的业务中,并不涉及数学计算相关业务,也不会引用 math 包。唯一可能性就是在 https 的 tls 证书在做 RSA 算法校验时,用到了大量数学计算。我把 pprof 导出问 svg 图确认了一下,确实是 tls 的证书计算占用了大量 CPU。
但是 https 的链接,不应该是建立1次后就不再建立了吗?为什么会有这么大量的 cpu 计算呢。难道 TIMEWAIT 这么高跟这个有关系吗?回到 TIMEWAIT 的问题上来,此刻,leader 节点是 xxxx-etcdserver-xxx02,且TIME_WAIT非常高
[user@xxxx-etcdserver-xxx02 etcd]$ ss |grep 2380|awk '{print $5}'|awk -F ":" '{print $1}'|sort |uniq -c|sort -rn 53882 192.168.198.234 53667 192.168.227.202 53539 192.168.255.210 53197 192.168.58.204 53135 192.168.137.210 51966 192.168.128.205
leader 节点上的time-wait来源,都是来自各个member节点,平均每个节点将近几万个 TIMEWAIT,相隔几秒再次查询,还是5万多个。而各个 member 节点上,查到的是跟 leader 节点的2380端口 保持1千多个 ESTABLISHED 状态的连接,也确实是频繁的创建链接。为什么member节点频繁的创建链接呢?tcpdump抓点TCP 通讯包看一下:
如上抓包的图,发现了什么?
- 120140 2018-06-22 20:37:30.219 10.72.198.234 74 10.20.127.136 TCP 12902 → etcd-server(2380) [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=3008487334 TSecr=0 WS=128
- 120141 2018-06-22 20:37:30.219 10.20.127.136 74 10.72.198.234 TCP etcd-server(2380) → 12902 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=2796204056 TSecr=3008487334 WS=128
- 120362 2018-06-22 20:37:30.242 10.72.198.234 66 10.20.127.136 TCP 12902 → etcd-server(2380) [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=3008487358 TSecr=2796204056
- 120368 2018-06-22 20:37:30.242 10.72.198.234 204 10.20.127.136 TCP 12902 → etcd-server(2380) [PSH, ACK] Seq=1 Ack=1 Win=14720 Len=138 TSval=3008487358 TSecr=2796204056
- 120369 2018-06-22 20:37:30.242 10.20.127.136 66 10.72.198.234 TCP etcd-server(2380) → 12902 [ACK] Seq=1 Ack=139 Win=15616 Len=0 TSval=2796204080 TSecr=3008487358
- 174076 2018-06-22 20:37:40.242 10.72.198.234 66 10.20.127.136 TCP 12902 → etcd-server(2380) [FIN, ACK] Seq=139 Ack=1 Win=14720 Len=0 TSval=3008497358 TSecr=2796204080
- 174377 2018-06-22 20:37:40.282 10.20.127.136 66 10.72.198.234 TCP etcd-server(2380) → 12902 [ACK] Seq=1 Ack=140 Win=15616 Len=0 TSval=2796214120 TSecr=3008497358
从这些数据包中,你能看出什么问题?想一想,打开这张图,想5分钟再往下看。。。
- 120140是10.72.198.234 的 member 节点作为客户端,向10.20.127.136的2380的 leader 节点发起TCP 连接请求。
- 120141是 leader 回的握手确认包。120362是 member 发的 ACK 握手完成确认包。这些都很简单明了。
- 120368是 member 向 leader 发的数据包,120369是 leader 回 member 收到前面数据包的 ack。嗯,这些也很好明白。
- 174076是什么? member向 leader 发的
FIN
关闭 TCP 连接的数据包,为什么?为什么member要关闭连接? - 再抽查几个数据包看看,从 “专家信息”里预览的1000多个“SYN”里找….
可以看出,都是 member主动向 leader 发 FIN 关闭包,这是为什么呢? 仅仅是看出来这些吗? 有没有注意到 member 发的消息包,收到 leader 回的 ack 之后,距离 member 主动发的 FIN 之间的时间间隔?都是10s ?为什么是10s?
leader 回了 ack,但没主动向 member 发“FIN,ACK”,为什么没回呢?若没回,那么 leader 上的这个 TCP 五元组的状态是什么?是的,是“CLOSE-WAIT”,难道这就是造成leader上大量 CLOSE-WAIT 的原因?先怀疑一下,继续往下看…
如上所说,member 向 leader 发了一个PSH数据包,发来什么内容呢?为什么 leader 没回信息?
payload的总长度是138,完全看不出来是什么数据包?不,应该还是有眼熟的,看前3个字节,0x16,0x03,0x01,如果对 HTTPS 熟悉的话,会联想到这是 TLS 的magic number,0x16对应十进制的22 即“Client Hello 的 Handshake”,后面的0x31,0x01是 TLS 的版本号。再配合我们的 member 跟 leader 之间确实是 https 通讯,可以确定这个地方是 tls 的client Hello握手包。也就是说配合前面发现的 10s 的间隔,可以看出 member 与 leader 之间的 TLS 握手并未完成,leader 并没有回 Server Hello… 整个TCP 包比较短,没有发现其他有用的信息…
去掉 HTTPS,减轻 cpu 负担:
回到之前的 CPU pprof 情况来看,是 TLS 相关证书计算,导致 CPU 使用率较高,但大量的 TCP 连接的发起,才是导致大量 TLS 计算的原因,导致内存飙升,消息积压的原因。不过,我们还是尝试去掉 member 与 leader 时间的 tls 证书校验,尝试减轻 leader 的CPU 使用。但结果影响不大,CPU 使用率依旧很高,不过,这也在意料之中,毕竟不是问题根本原因,查看pprof 如下
cfc4n@cnxct:~/Downloads$ go tool pprof 111.111.111.111-0703_cpu.txt File: etcd Type: cpu Time: Jul 3, 2018 at 4:55pm (CST) Duration: 30.01s, Total samples = 6.16mins (1231.36%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top 20 Showing nodes accounting for 4.29mins, 69.69% of 6.16mins total Dropped 928 nodes (cum <= 0.03mins) Showing top 20 nodes out of 181 flat flat% sum% cum cum% 2.13mins 34.62% 34.62% 2.13mins 34.62% runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s 0.59mins 9.51% 44.13% 0.62mins 10.06% syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s 0.33mins 5.42% 49.55% 0.33mins 5.42% runtime._ExternalCode /usr/local/go/src/runtime/proc.go 0.25mins 4.07% 53.62% 1.23mins 20.00% runtime.lock /usr/local/go/src/runtime/lock_futex.go 0.13mins 2.18% 55.80% 0.13mins 2.18% runtime.procyield /usr/local/go/src/runtime/asm_amd64.s 0.11mins 1.80% 57.60% 0.23mins 3.72% runtime.selectgoImpl /usr/local/go/src/runtime/select.go 0.10mins 1.57% 59.17% 0.10mins 1.57% runtime.osyield /usr/local/go/src/runtime/sys_linux_amd64.s 0.09mins 1.39% 60.56% 1.47mins 23.85% runtime.unlock /usr/local/go/src/runtime/lock_futex.go 0.07mins 1.12% 61.69% 0.33mins 5.30% runtime.mallocgc /usr/local/go/src/runtime/malloc.go 0.07mins 1.06% 62.74% 0.07mins 1.06% runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go 0.06mins 1.01% 63.76% 0.15mins 2.36% runtime.scanobject /usr/local/go/src/runtime/mgcmark.go 0.05mins 0.77% 64.53% 0.05mins 0.77% runtime.siftupTimer /usr/local/go/src/runtime/time.go 0.04mins 0.71% 65.24% 0.04mins 0.71% runtime.greyobject /usr/local/go/src/runtime/mgcmark.go 0.04mins 0.71% 65.95% 0.07mins 1.14% runtime.deferreturn /usr/local/go/src/runtime/panic.go 0.04mins 0.7% 66.65% 1.35mins 21.86% runtime.deltimer /usr/local/go/src/runtime/time.go 0.04mins 0.67% 67.32% 0.12mins 1.92% runtime.pcvalue /usr/local/go/src/runtime/symtab.go 0.04mins 0.61% 67.93% 0.04mins 0.64% runtime.siftdownTimer /usr/local/go/src/runtime/time.go 0.04mins 0.6% 68.53% 0.23mins 3.74% runtime.gentraceback /usr/local/go/src/runtime/traceback.go 0.04mins 0.59% 69.11% 0.04mins 0.59% runtime.casgstatus /usr/local/go/src/runtime/proc.go 0.04mins 0.57% 69.69% 0.04mins 0.57% runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s
CPU 使用率较高的,是 runtime.futex 跟系统调用的 syscall.Syscall,以及runtime下的包,heapBitsForObject、scanobject、greyobject是 golang GC 时的运行时函数,意味着GC 已经占用很大的 CPU 了。
配合memory pprof来看下
cfc4n@cnxct:~/Desktop$ go tool pprof 10.3.5.52.txt File: etcd Type: inuse_space Time: Jul 3, 2018 at 4:57pm (CST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top 20 Showing nodes accounting for 10665.34MB, 94.72% of 11259.98MB total Dropped 368 nodes (cum <= 56.30MB) Showing top 20 nodes out of 57 flat flat% sum% cum cum% 7156.71MB 63.56% 63.56% 7156.71MB 63.56% github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*recvBuffer).put /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/transport.go 2258.28MB 20.06% 83.61% 9417.49MB 83.64% github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).HandleStreams.func4 /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go 663.28MB 5.89% 89.51% 663.28MB 5.89% net/http.http2putRequestBodyBuf /usr/local/go/src/net/http/h2_bundle.go 277.15MB 2.46% 91.97% 277.15MB 2.46% runtime.makechan /usr/local/go/src/runtime/chan.go 66.11MB 0.59% 92.55% 66.11MB 0.59% crypto/tls.(*block).reserve /usr/local/go/src/crypto/tls/conn.go 65.24MB 0.58% 93.13% 68.74MB 0.61% net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go 56.52MB 0.5% 93.64% 56.52MB 0.5% runtime.malg /usr/local/go/src/runtime/proc.go 54.52MB 0.48% 94.12% 85.91MB 0.76% runtime.mapassign /usr/local/go/src/runtime/hashmap.go 23.53MB 0.21% 94.33% 113.68MB 1.01% crypto/tls.(*Conn).readHandshake /usr/local/go/src/crypto/tls/conn.go 15MB 0.13% 94.46% 74.58MB 0.66% runtime.systemstack /usr/local/go/src/runtime/asm_amd64.s 12MB 0.11% 94.57% 795MB 7.06% net/http.(*http2Server).ServeConn /usr/local/go/src/net/http/h2_bundle.go 11MB 0.098% 94.67% 65.23MB 0.58% net/http.(*http2Framer).readMetaFrame /usr/local/go/src/net/http/h2_bundle.go 2MB 0.018% 94.68% 990.71MB 8.80% net/http.(*conn).serve /usr/local/go/src/net/http/server.go 2MB 0.018% 94.70% 77.73MB 0.69% net/http.(*http2serverConn).readFrames /usr/local/go/src/net/http/h2_bundle.go 1MB 0.0089% 94.71% 721.45MB 6.41% net/http.(*http2serverConn).newWriterAndRequest /usr/local/go/src/net/http/h2_bundle.go 1MB 0.0089% 94.72% 755.99MB 6.71% net/http.(*http2serverConn).serve /usr/local/go/src/net/http/h2_bundle.go 0 0% 94.72% 64.32MB 0.57% [etcd] 0 0% 94.72% 190.21MB 1.69% crypto/tls.(*Conn).Handshake /usr/local/go/src/crypto/tls/conn.go 0 0% 94.72% 64.60MB 0.57% crypto/tls.(*Conn).readRecord /usr/local/go/src/crypto/tls/conn.go 0 0% 94.72% 190.21MB 1.69% crypto/tls.(*Conn).serverHandshake /usr/local/go/src/crypto/tls/handshake_server.go
如上,可以看出,大约65%的内存都用在了recvBuffer.put
// recvBuffer is an unbounded channel of recvMsg structs. // Note recvBuffer differs from controlBuffer only in that recvBuffer // holds a channel of only recvMsg structs instead of objects implementing "item" interface. // recvBuffer is written to much more often than // controlBuffer and using strict recvMsg structs helps avoid allocation in "recvBuffer.put" type recvBuffer struct { c chan recvMsg mu sync.Mutex backlog []recvMsg } func newRecvBuffer() *recvBuffer { b := &recvBuffer{ c: make(chan recvMsg, 1), } return b } func (b *recvBuffer) put(r recvMsg) { b.mu.Lock() if len(b.backlog) == 0 { select { case b.c <- r: b.mu.Unlock() return default: } } b.backlog = append(b.backlog, r) b.mu.Unlock() }
这里的 put 方法是将收到的recvMsg写入到b.backlog这个 slice 中的,func (b recvBuffer) put(r recvMsg)跟 (b recvBuffer) load()两个函数都对b.backlog有读写操作,也就是说,这个slice 写入、读取时,都会进行加锁,严重占用 CPU 以及影响性能,增大系统负载。而且若 (b recvBuffer) load()处理速度跟不上func (b recvBuffer) put(r recvMsg)写入速度,意味着这个slice会持续增长,造成看似内存溢出的现象。。。
那问题来了,func (b recvBuffer) put(r recvMsg)为什么有这么大的调用量,为什么在频繁写入…为什么呢???
定位到内存溢出的疑似原因:
grpc/server.go func (s Server) serveStreams(st transport.ServerTransport)
grpc/transport/handler_server.go 文件284行 func (ht serverHandlerTransport) HandleStreams 函数中的334行
s.trReader = &transportReader{ reader: &recvBufferReader{ctx: s.ctx, recv: s.buf}, windowHandler: func(int) {}, } // readerDone is closed when the Body.Read-ing goroutine exits. readerDone := make(chan struct{}) go func() { defer close(readerDone) // TODO: minimize garbage, optimize recvBuffer code/ownership const readSize = 8196 for buf := make([]byte, readSize); ; { n, err := req.Body.Read(buf) if n > 0 { s.buf.put(recvMsg{data: buf[:n:n]}) buf = buf[n:] } if err != nil { s.buf.put(recvMsg{err: mapRecvMsgError(err)}) return } if len(buf) == 0 { buf = make([]byte, readSize) } } }()
然后再走到了,func (b *recvBuffer) put(r recvMsg)函数里,也就是说,这里的消息slice堆积,是因为接收到了大量http request。那接下来就是要找到这些http request 是来自谁了?
阅读 etcd 源码:
配合 member 之间大量的 TCP TIME-WAIT 状态链接,排查方向改到 member 节点上。
要查找 etcd member 建立 TCP/HTTP 链接的地方, 在源码里,还是非常多的,一时间没找到更好的bug 定位方式,只要从etcd 源码去逐步分析排查。
- etcdmain/main.go 39行 startEtcdOrProxyV2()
- etcdmain/etcd.go 103行 startEtcd()
- etcdmain/etcd.go 186行 embed.StartEtcd(cfg)
- embed/etcd.go 107行 startPeerListeners() ,用 rafthttp.NewListener 创建了一个Listener。用于 etcd member 之间通讯。依赖github.con/coreos/etcd/pkg/transport包
- embed/etcd.go 110行 startClientListeners() 用 net.Listen 创建了一个Listener。用于 client 跟 member 之间通讯。keepalive 使用github.con/coreos/etcd/pkg/transport包
- 157行 etcdserver.NewServer() 对本地配置读取,数据库目录读取,设置mvcc多版本控制函数,设置 lease 租约续期函数等等
- 166行 e.servePeers() 进行对 Linster 启动 Serve 接受请求
- 317行 etcdhttp.NewPeerHandler(e.Server) 设置HTTP Handler,里面函数如下:
如上,etcd 节点之间的通讯,设定了这几个Handler
- “/” –> http.NotFound
- “/raft”、”/raft/” –> raftHandler
- “/members” –> mh
- “/leases/internal” 、 “/leases” –> raftHandler
- “/version” –> versionHandler
也就是说,member 之间通讯的大量通讯只会是这些 handler 中间的一个,除非源码找漏了。。。
从这个开源产品功能特性上来看,member 之间需要大量同步的,应该就是数据了,应该把线索放到raftHandler上吗?头大….
raftHandler是s.RaftHandler() 返回的, 值是 EtcdServer.r.transport.Handler()返回值。
EtcdServer.r是在etcdserver/server.go的414行赋值的
r: *newRaftNode( raftNodeConfig{ isIDRemoved: func(id uint64) bool { return cl.IsIDRemoved(types.ID(id)) }, Node: n, heartbeat: heartbeat, raftStorage: s, storage: NewStorage(w, ss), }, ),
etcdserver/server.go 的511行 EtcdServer.r.transport = tr进行赋值,类型为rafthttp.Transport ,其Handler 如下代码:
func (t *Transport) Handler() http.Handler { pipelineHandler := newPipelineHandler(t, t.Raft, t.ClusterID) streamHandler := newStreamHandler(t, t, t.Raft, t.ID, t.ClusterID) snapHandler := newSnapshotHandler(t, t.Raft, t.Snapshotter, t.ClusterID) mux := http.NewServeMux() mux.Handle(RaftPrefix, pipelineHandler) mux.Handle(RaftStreamPrefix+"/", streamHandler) mux.Handle(RaftSnapshotPrefix, snapHandler) mux.Handle(ProbingPrefix, probing.NewHandler()) return mux }
这里又有几个 Handler ,而且都是 raft 处理相关的,代码读起来比较吃力,很难理解作者意图,
打印http.dial堆栈:
要想从这里查找 TCP 连接的发起代码,恐怕是难上加难,所以,我换了个思路,直接在 TCP 连接 Dail 的地方,判断 addr 是 etcd leader 的地址,打印调用栈。使用
runtime/debug下的debug.PrintStack()函数输出。同时打印了连接完成后的remote.addr,用于校对。不过,这改动是 golang 的系统 runtime 类库,(改了系统库,记得改回来)。要小心谨慎,在http/transport.go 882行 增加如下
func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) { if t.DialContext != nil { return t.DialContext(ctx, network, addr) } if addr == "leader_ip:2380" { debug.PrintStack() } if t.Dial != nil { c, err := t.Dial(network, addr) if c == nil && err == nil { err = errors.New("net/http: Transport.Dial hook returned (nil, nil)") } return c, err } return zeroDialer.DialContext(ctx, network, addr) }
重新编译 etcd server,替换到集群中的某一个 member 节点上。 开始压测这个节点,重现大量 TIME-WAIT 的现象。现象重现后,在日志里也拿到了运行堆栈
0. net.(*Dialer).Dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/dial.go:323 1. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial() /Users/cfc4n/gopat h/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeout_d ialer.go:29 2. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial-fm() /Users/cfc4n/go path/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeou t_transport.go:51 3. net/http.(*Transport).dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:887 4. net/http.(*Transport).dialConn() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:1060 5. net/http.(*Transport).getConn.func4() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:943 6. runtime.goexit() /usr/local/Cellar/go/1.9.1/libexec/src/runtime/asm_amd64.s:2337
在压测时,我用 netstat 抽样打印几次当前 member 上的 TCP 连接信息数据,也是用于校对日志输出的数据。
在压测的日志数据里,每建立一个 TCP 连接,都打印了这些堆栈,并且打印了远程 IP 的地址、端口。跟前面抽样打印的 TCP 数据核对一下,发现连接 leader 2380端口的几个 TCP 连接五元组跟 上面打印的remote.addr的五元组一致。
也就是说,是这个函数创建了大量的从 member 到 leader 之间的 TCP 连接。好了,如果找到大量 TCP 连接发起的函数了,又找到了该函数的调用栈了,那岂不是找到问题了?
然而,事情并没有这么简单,在上面的调用栈日志里,只能定位到coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial,并不能打印出所有调用堆栈,为什么?
// src/net/http/transport.go:886、887行 @TODO 高亮下面两行 func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) { if t.DialContext != nil { return t.DialContext(ctx, network, addr) } if t.Dial != nil { c, err := t.Dial(network, addr) if c == nil && err == nil { err = errors.New("net/http: Transport.Dial hook returned (nil, nil)") } return c, err } return zeroDialer.DialContext(ctx, network, addr) }
找到 etcd member 10s 后主动断开 TCP 连接的原因:
在系统类库里的net/http/transport.go的Transport.dial方法中,如果该Transport自定义了Dial方法,则调用自定义的…这个自定义的Dial方法是在github.com/coreos/etcd/pkg/transport/transport.go里定义的,如下:
func NewTransport(info TLSInfo, dialtimeoutd time.Duration) (*http.Transport, error) { cfg, err := info.ClientConfig() if err != nil { return nil, err } t := &http.Transport{ Proxy: http.ProxyFromEnvironment, Dial: (&net.Dialer{ Timeout: dialtimeoutd, // value taken from http.DefaultTransport KeepAlive: 30 * time.Second, }).Dial, // value taken from http.DefaultTransport TLSHandshakeTimeout: 10 * time.Second, TLSClientConfig: cfg, } dialer := (&net.Dialer{ Timeout: dialtimeoutd, KeepAlive: 30 * time.Second, }) dial := func(net, addr string) (net.Conn, error) { return dialer.Dial("unix", addr) } tu := &http.Transport{ Proxy: http.ProxyFromEnvironment, Dial: dial, TLSHandshakeTimeout: 10 * time.Second, //高亮,这里是10s,也解开了最初的10s TLS 超时的原因 TLSClientConfig: cfg, } ut := &unixTransport{tu} t.RegisterProtocol("unix", ut) t.RegisterProtocol("unixs", ut) return t, nil }
在代码dial := func(net, addr string) (net.Conn, error) … 这里,自定义了一个 Dial 方法,赋值到http.Transport.Dial 函数属性上,并且设定TLS 的超时时间是 TLSHandshakeTimeout: 10 time.Second,这也就是之前抓包时,看到的10s 后,member 节点主动断开连接的原因—-超过 TLS 设定的timeout时间了。
函数NewTransport的调用地方,只在NewTimeoutTransport这里找到,如下:
func NewTimeoutTransport(info TLSInfo, dialtimeoutd, rdtimeoutd, wtimeoutd time.Duration) (*http.Transport, error) { tr, err := NewTransport(info, dialtimeoutd) if err != nil { return nil, err } if rdtimeoutd != 0 || wtimeoutd != 0 { // the timed out connection will timeout soon after it is idle. // it should not be put back to http transport as an idle connection for future usage. tr.MaxIdleConnsPerHost = -1 } else { // allow more idle connections between peers to avoid unnecessary port allocation. tr.MaxIdleConnsPerHost = 1024 } tr.Dial = (&rwTimeoutDialer{ Dialer: net.Dialer{ Timeout: dialtimeoutd, KeepAlive: 30 * time.Second, }, rdtimeoutd: rdtimeoutd, wtimeoutd: wtimeoutd, }).Dial return tr, nil }
也就是说,找到NewTimeoutTransport调用着,就能确定发起大量 TCP 的代码…但之前的堆栈打印,并没有打印到这些函数调用,只好再想其他办法。
在NewTimeoutTransport的代码里,有一句对 HTTP Keepalive的最大持有空闲长连接数字的设置tr.MaxIdleConnsPerHost = 1024,这个参数决定 golang 的 http client 包对 server 发起 http 请求时,能持有的最大空闲 TCP 连接数量,在 etcd 的代码里,被设置为1024了,也就是说,etcd member 跟 etcd leader 之间空闲时,持有TCP 连接数可以是1024个,便于下次复用。(插一句,有人说,http 是最好的 RPC,这里我是不认同的,光从TCP 连接复用上,就有非常大的差异,在http 1.1 TCP 链接 「另类」复用,http 1.1 的keepalive属性也是仅当前一个请求响应后,这个链接才能继续复用,请求响应串行化。而在 http2里,是不需要等待上一个请求返回的,才有真正意义上的多路复用见:Streams and Multiplexing , 如果你做过 rpc 的功能实现,你就知道 rpc 在 tcp 上是如何实现请求响应的一一对应的,你就能很快明白http2\http1.1的 TCP 链接复用的区别了。)
排查 etcd member 跟 etcd leader 之间的通讯:
回到正题,对于代码这里的定位,我尝试去阅读代码来协助定位,但 etcd 的源码太复杂了,我读到 raft 协议节点之间数据同步这块,卡住了。开始从其他路子尝试解决,之前的 cpu 占用较高问题,我们在 member 节点之间去掉了 tls 证书验证,那么我们在 member 上做 tcp 抓包,抓2380端口数据包,看看数据发送的都是什么内容。
数据包中,可以看出 member 向 leader 上发送了大量的 /leases的 POST 请求。根据这个 uri,找到发起 HTTP 请求的代码,在etcdserver/v3_server.go的252行
func (s *EtcdServer) LeaseRenew(ctx context.Context, id lease.LeaseID) (int64, error) { ttl, err := s.lessor.Renew(id) if err == nil { // already requested to primary lessor(leader) return ttl, nil } if err != lease.ErrNotPrimary { return -1, err } cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout()) defer cancel() // renewals don't go through raft; forward to leader manually for cctx.Err() == nil && err != nil { leader, lerr := s.waitLeader(cctx) if lerr != nil { return -1, lerr } for _, url := range leader.PeerURLs { lurl := url + leasehttp.LeasePrefix ttl, err = leasehttp.RenewHTTP(cctx, id, lurl, s.peerRt) if err == nil || err == lease.ErrLeaseNotFound { return ttl, err } } } return -1, ErrTimeout }
阅读这段代码,可以看到函数调用leasehttp.RenewHTTP将请求发送到leader.PeerURLs,即发送到 leader 节点上。继续跟进leasehttp.RenewHTTP,
func RenewHTTP(ctx context.Context, id lease.LeaseID, url string, rt http.RoundTripper) (int64, error) { // will post lreq protobuf to leader lreq, err := (&pb.LeaseKeepAliveRequest{ID: int64(id)}).Marshal() if err != nil { return -1, err } cc := &http.Client{Transport: rt} req, err := http.NewRequest("POST", url, bytes.NewReader(lreq)) if err != nil { return -1, err } req.Header.Set("Content-Type", "application/protobuf") req.Cancel = ctx.Done() resp, err := cc.Do(req) if err != nil { return -1, err }
可以看出,该请求使用 golang 标准包的http.Client.Do来将请求发送出去。http.Client.Transport使用参数传入的rt http.RoundTripper。
定位到10s TLS timeout 的相关函数调用:
回到上一层代码
- s.peerRt参数来自*EtcdServer.peerRt,在etcdserver/server.go的429行被赋值peerRt: prt
- prt变量在该文件的288行被赋值prt, err := rafthttp.NewRoundTripper(cfg.PeerTLSInfo, cfg.peerDialTimeout())
- rafthttp.NewRoundTripper函数调用rafthttp.NewTimeoutTransport
- 没错,rafthttp.NewTimeoutTransport调用了rafthttp.NewTransport函数,也就是之前堆栈打印时,发现大量 TCP 链接创建使用的这个函数,也就是10s TLS timeout 的地方。
这里可以关联上了,确认大量发起 TCP 连接的代码调用关系,证明方向正确,剩下的就是顺藤摸瓜,找到发起大量请求的原因即可。
回到LeaseRenew函数调用的地方,调用func (s *EtcdServer) LeaseRenew函数代码见github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go的88行附近
func (ls *LeaseServer) leaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) error { for { req, err := stream.Recv() if err == io.EOF { return nil } if err != nil { if isClientCtxErr(stream.Context().Err(), err) { plog.Debugf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error()) } else { plog.Warningf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error()) } return err } // Create header before we sent out the renew request. // This can make sure that the revision is strictly smaller or equal to // when the keepalive happened at the local server (when the local server is the leader) // or remote leader. // Without this, a lease might be revoked at rev 3 but client can see the keepalive succeeded // at rev 4. resp := &pb.LeaseKeepAliveResponse{ID: req.ID, Header: &pb.ResponseHeader{}} ls.hdr.fill(resp.Header) ttl, err := ls.le.LeaseRenew(stream.Context(), lease.LeaseID(req.ID)) if err == lease.ErrLeaseNotFound { err = nil ttl = 0 } if err != nil { return togRPCError(err) } resp.TTL = ttl err = stream.Send(resp) if err != nil { if isClientCtxErr(stream.Context().Err(), err) { plog.Debugf("failed to send lease keepalive response to gRPC stream (%q)", err.Error()) } else { plog.Warningf("failed to send lease keepalive response to gRPC stream (%q)", err.Error()) } return err } } }
- 71行 func (ls *LeaseServer) LeaseKeepAlive函数调用func (ls *LeaseServer) leaseKeepAlive()
- etcdserver/etcdserverpb/rpc.pb.go的3417行func _Lease_LeaseKeepAlive_Handler(srv interface{}, stream grpc.ServerStream)调用了func (ls *LeaseServer) LeaseKeepAlive函数
- etcdserver/etcdserverpb/rpc.pb.go的3480行将函数_Lease_LeaseKeepAlive_Handler设置为 StreamName: “LeaseKeepAlive”的 Handler,在通过*grpc.Server的RegisterService方法,在启动时,注册到监听器中
- leaseKeepAlive函数是一个事件循环,从stream.Recv()拿到数据,则进行发送到 leader 节点上
- stream.Recv()函数是google.golang.org/grpc/stream.go中的函数,这个不是 etcd 官方自己的代码,而是另外一个开源项目,意味着我还需要再把这块代码再阅读阅读,理解理解。
通过漫长的阅读,可以看到这里收到的消息都是来自 client 发送来的消息,其实可以从变量命名上看到端倪,比如 client 跟 member 之间通讯处理的变量都包含stream字样,而member到 leader 或 member 节点之间都带有peer字样,其实,client 跟 etcd 节点之间通讯用的是 http2传输协议,消息编码用的是grpc+protobuf;而etcd member节点之间是http1.1传输,消息编码也是 protobuf。
etcd lease 租约续期的时机:
就是说,这里的大量lease 租约续期的请求,来自 client 的请求,在我的案例中,我把主机 lease 的 TTL 设置为10S,这块会有异常吗?client 的 lease 续期是谁来发送?client 自己吗?多久发送一次?过期前的1秒?
在 etcd 的 client 类库中,对于租约续期的代码在clientv3/lease.go的474行
// send update to all channels nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0) ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second) for _, ch := range ka.chs { select { case ch <- karesp: ka.nextKeepAlive = nextKeepAlive default: } }
nextKeepAlive变量是 lease TTL 的三分之一,用于下次续期时的触发时刻,在后面的sendKeepAliveLoop函数续期时,回每隔500ms 进行一次循环,检测是否续期。也就是说,我们场景下,10s 的 TTL 会每隔3S 一次续期。3W 客户端的话,每秒1W 的写请求,而之前我们压测结果是集群节点可以有4W 以上的 (官方压测数据)qps,不至于这点都撑不住。这是个疑点。不过,前面的分析判断,大量的续约续期导致大量 TCP 链接的建立,大量 TCP 链接的建立,导致 leader 节点上大量对象分配,大量内存申请,golang gc 时,消耗大量 CPU,那这个问题也应该要优化一下,为此,我们把 TTL 设定为600S,即10分钟,也就是说,会在200秒后进行一次续期。
扩大lease TTL,尝试优化集群负载:
修复后的版本,我们在线下发布了6000多台,观察效果,果然,在发布后,内存有明显的下降,CPU 指标也有,但是在1个小时后,指标又上升了。。。
问题依旧:
这是为什么呢?中间大约1个小时的负载指标下降是因为发布过程带来的假象吗?上次变更的 lease 有效果吗?可我更改时,是测试过的,测试报告都有呢—-agent 的 etcd lease租约时长 调整 ,TTL 续期是200S 一次了,但这是为什么?为了确认这个问题,我在其中一台 member 上抓包确认
上面确实是大量的 lease 包,那么每个 lease id 之间的发送间隔是多少?是我们预期的200秒吗?或者这些 lease 包是同一个 lease id 吗?一共有多少个 lease id 呢?
分析 tcpdump 的pcapng文件:
之前从 etcd 的源码分析可以看出,etcd node 节点之间的通讯消息编码也是 protobuf 的,我们需要把 http response body 的字节流复制出来,用 protobuf 的对应结果体解码即可。可是从 2018-07-05 16:38:31.015981 到 2018-07-05 16:43:20.420401 之间,间隔5分钟之间,一共103万个数据包,手动复制,实在没法统计。借用pcap包,对整个 tcpdump抓包的pcapng文件,分析, 过滤响应类型是application/protobuf的,取 http response body内容,并对其做 pb.LeaseKeepAliveRequest protobuf的结构体解析,计算 lease 总数与唯一数
运行结果: lease id 总数数: 77550, lease id 唯一数:1382。
这些结果说明,相同的 lease id 在频繁发送续约请求。再次写程序,把这些 lease id 取出来,到 etcd 集群上查看 TTL 剩余时间是否是预期的400-600秒之间,但结果却总是599秒,以其中一个 lease id 为例,每隔1秒去抓取 TTL 信息,但每次结果 TTL 都是599秒, grantTTL 是600,意味着 etcd member 持续不断的为这些 lease id 不停的续期。 我们之前排查结果是, member 的续期是来自 client 的 lease keepalive 请求,那么方向得放到 client 上。
抽查其中一台 client,抓包分析,也确实是在频繁发送/etcdserverpb.Lease/LeaseKeepAlive
请求,这就很奇怪,跟之前我的测试结果完全不一样,明明测试时,是每个200S 一起续期请求,为什么这里变得这么频繁,每秒都发起。
回到 etcd clientv3的代码上:
现象很奇怪,排查这个问题,只能再次从源码层面入手,方向也放到 /etcdserverpb.Lease/LeaseKeepAlive的地方,函数sendKeepAliveLoop这里开始循环处理判断是否需要续期租约
// sendKeepAliveLoop sends keep alive requests for the lifetime of the given stream. func (l *lessor) sendKeepAliveLoop(stream pb.Lease_LeaseKeepAliveClient) { for { var tosend []LeaseID now := time.Now() l.mu.Lock() for id, ka := range l.keepAlives { if ka.nextKeepAlive.Before(now) { tosend = append(tosend, id) } } l.mu.Unlock() for _, id := range tosend { r := &pb.LeaseKeepAliveRequest{ID: int64(id)} if err := stream.Send(r); err != nil { // TODO do something with this error? return } } select { case <-time.After(500 * time.Millisecond): case <-stream.Context().Done(): return case <-l.donec: return case <-l.stopCtx.Done(): return } } }
如上代码,每隔 500毫秒一次循环,遍历l.keepAlives map,判断每个LeaseID的nextKeepAlive是否到了,client 决定是否发起LeaseKeepAliveRequest。这个nextKeepAlive的赋值是在clientv3/lease.go的recvKeepAlive函数内,第473行
// send update to all channels nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0) ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second) for _, ch := range ka.chs { select { case ch <- karesp: ka.nextKeepAlive = nextKeepAlive default: } }
是的,这块代码之前看过,之前关注的只是 TTL 时间的设定,但仔细看看这段代码ka.nextKeepAlive = nextKeepAlive,这里做下一次lease 续期的赋值,判断方法是select case,判断条件是ch <- karesp,在 golang 里,这种写法表示该channle能写入的话,则进入case里面的流程,对ka.nextKeepAlive进行赋值;否则不赋值。。。否则,不赋值。。。否则,不赋值。。。
那么,这么 channel 长度多少呢?会不会写入的速度大于读取导致没写进去,导致下一次的续期时间没赋值成功呢?继续追查,在clientv3/lease.go的222行KeepAlive函数里
func (l *lessor) KeepAlive(ctx context.Context, id LeaseID) (<-chan *LeaseKeepAliveResponse, error) { ch := make(chan *LeaseKeepAliveResponse, leaseResponseChSize) l.mu.Lock() // ensure that recvKeepAliveLoop is still running select { case <-l.donec: err := l.loopErr l.mu.Unlock() close(ch) return ch, ErrKeepAliveHalted{Reason: err} default: } ka, ok := l.keepAlives[id] if !ok { // create fresh keep alive ka = &keepAlive{ chs: []chan<- *LeaseKeepAliveResponse{ch},
ch := make(chan *LeaseKeepAliveResponse, leaseResponseChSize)进行chan 的申请,长度leaseResponseChSize是常量,值是16。chs是个slice,长度1,只有这一个。那么,接下来,去确认这个 ch 读取的代码吧,可是我读完整个 lease.go也没找到哪里读取这个 channel 的代码,除了几处 close 的回收操作。那问题来了,会不会是这里把 ch channel 写满了,就无法再写入进去,导致ka.nextKeepAlive = nextKeepAlive流程走不到,导致ka.nextKeepAlive一直是上次的过去的续期时间,导致频繁进行续期呢?
找到疑点,每秒2次 lease keepalive 的请求:
开始测试,在 ch channel 写入的 default 分支里,打印“无法写入”等字样,并在for 后面打印这个ka的 TTL 等信息,用于核对。log.Println(fmt.Sprintf("karesp.TTL:%d, nextKeepAlive:%s,ka.nextKeepAlive:%s, ka.deadline:%s",karesp.TTL, nextKeepAlive.String(),ka.nextKeepAlive.String(), ka.deadline.String()))
改动后,重新编译代码,上传至某台 client 上 ,关掉 DEBUG 模式,观察日志输出。 文件见 /xxxx/xxx/core/nohup.out.1 ,在 2018/07/05 21:10:42 启动程序,经过1个多小时的观察,果然在2018/07/05 22:04:05时,lease 续约续期的请求从200秒变成了每秒2次。同时到 etcd 集群上查询响应 lease id,发现 TTL 均为599,grantTTL 为600,则可说明 是客户端在频繁重新发送租约请求。 第一个版本没打印ka.nextKeepAlive ,调整后,重新打印,为了让等待时间变短些,我把600s 改成60,便于观察。得到如下日志
2018/07/05 22:40:02 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:02 karesp.TTL:60, nextKeepAlive:2018-07-05 22:40:22.811626819 +0800 CST m=+40.135091508,ka.nextKeepAlive:2018-07-05 22:40:22.811626819 +0800 CST m=+40.135091508, ka.deadline:2018-07-05 22:41:02.811627601 +0800 CST m=+80.1350922172018/07/05 22:40:22 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:22 karesp.TTL:60, nextKeepAlive:2018-07-05 22:40:42.922698966 +0800 CST m=+60.246163715,ka.nextKeepAlive:2018-07-05 22:40:42.922698966 +0800 CST m=+60.246163715, ka.deadline:2018-07-05 22:41:22.922699917 +0800 CST m=+100.2461644702018/07/05 22:40:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:03.330494852 +0800 CST m=+80.653959932,ka.nextKeepAlive:2018-07-05 22:41:03.330494852 +0800 CST m=+80.653959932, ka.deadline:2018-07-05 22:41:43.330496602 +0800 CST m=+120.6539614892018/07/05 22:41:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:23.341112484 +0800 CST m=+100.664577164,ka.nextKeepAlive:2018-07-05 22:41:23.341112484 +0800 CST m=+100.664577164, ka.deadline:2018-07-05 22:42:03.341113662 +0800 CST m=+140.6645782122018/07/05 22:41:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:23 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:43.350264635 +0800 CST m=+120.673729313,ka.nextKeepAlive:2018-07-05 22:41:43.350264635 +0800 CST m=+120.673729313, ka.deadline:2018-07-05 22:42:23.350265498 +0800 CST m=+160.6737300502018/07/05 22:41:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:03.572888276 +0800 CST m=+140.896353032,ka.nextKeepAlive:2018-07-05 22:42:03.572888276 +0800 CST m=+140.896353032, ka.deadline:2018-07-05 22:42:43.572889271 +0800 CST m=+180.896353840
2018/07/05 22:42:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:23.86879775 +0800 CST m=+161.192262488,ka.nextKeepAlive:2018-07-05 22:42:23.86879775 +0800 CST m=+161.192262488, ka.deadline:2018-07-05 22:43:03.868798687 +0800 CST m=+201.192263243
2018/07/05 22:42:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:23 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:43.877613966 +0800 CST m=+181.201079020,ka.nextKeepAlive:2018-07-05 22:42:43.877613966 +0800 CST m=+181.201079020, ka.deadline:2018-07-05 22:43:23.877615637 +0800 CST m=+221.2010804172018/07/05 22:42:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:03.884576518 +0800 CST m=+201.208041122,ka.nextKeepAlive:2018-07-05 22:43:03.884576518 +0800 CST m=+201.208041122, ka.deadline:2018-07-05 22:43:43.884577361 +0800 CST m=+241.2080419042018/07/05 22:43:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:23.893274193 +0800 CST m=+221.216738996,ka.nextKeepAlive:2018-07-05 22:43:23.893274193 +0800 CST m=+221.216738996, ka.deadline:2018-07-05 22:44:03.893275148 +0800 CST m=+261.2167397022018/07/05 22:43:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:24 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:44.058766258 +0800 CST m=+241.382230998,ka.nextKeepAlive:2018-07-05 22:43:44.058766258 +0800 CST m=+241.382230998, ka.deadline:2018-07-05 22:44:24.058767168 +0800 CST m=+281.382231711
2018/07/05 22:43:44 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:44 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:04.413012991 +0800 CST m=+261.736477737,ka.nextKeepAlive:2018-07-05 22:44:04.413012991 +0800 CST m=+261.736477737, ka.deadline:2018-07-05 22:44:44.41301386 +0800 CST m=+301.7364783912018/07/05 22:44:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:24.420796492 +0800 CST m=+281.744261106,ka.nextKeepAlive:2018-07-05 22:44:24.420796492 +0800 CST m=+281.744261106, ka.deadline:2018-07-05 22:45:04.420797607 +0800 CST m=+321.7442621612018/07/05 22:44:24 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:24 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:44.428145652 +0800 CST m=+301.751610589,ka.nextKeepAlive:2018-07-05 22:44:44.428145652 +0800 CST m=+301.751610589, ka.deadline:2018-07-05 22:45:24.428147239 +0800 CST m=+341.7516120182018/07/05 22:44:44 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:44 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:45:44.436152168 +0800 CST m=+361.7596167232018/07/05 22:45:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:04 无法写入
2018/07/05 22:45:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:24.441563525 +0800 CST m=+341.765028110,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:46:04.44156426 +0800 CST m=+381.765028789
2018/07/05 22:45:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:04 无法写入
2018/07/05 22:45:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:24.941897369 +0800 CST m=+342.265361977,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:46:04.941898136 +0800 CST m=+382.265362694
2018/07/05 22:45:05 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:05 无法写入
再16次之后,即第17次,开始恢复每秒2次的租约请求。。。所以,也验证了此 bug。
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!
之前的134版本灰度灰度后,中间CPU 降低了大约1个小时的样子,这是为什么呢?
解释 cpu 降低1小时的疑点:
我们这么计算, TTL 600,续约续约是200, ka.chs的 数组长度是1,第一个chan <- LeasesKeepAliveResponse的长度是16 ,全部写满的时间是20016/60 = 53分钟。 也非常契合 falcon 上的 cpu 信息截图
另外再从我加日志的第一次测试时间来算: 2018/07/05 22:04:05 – 2018/07/05 21:10:42 约为53分钟。也就是说,将近1小时,跟 falcon 的统计吻合的。。。
找到真凶:
定问题了,就是修复方式了,我并没看到 ch channel 的使用的地方,或许我没理解作者的用意,我觉得,不管 ch channel 是否写入,都必须将下一次nextKeepAlive的时间赋值为最新时间。将ka.nextKeepAlive = nextKeepAlive移到for select外面即可。
不过,我们产品急于推广,保险起见,牺牲了 lease 的需求,去掉了这个功能。发布到服务器后,负载情况有非常大的改善,CPU 从40%-50%降低到2%-5%左右。
回顾总结:
回顾etcd 集群遇到的所有现象,配合这个问题,所有问题都好解释了,那么,我按照事情发生发展的顺序总结一下原因:
- etcd clientv3的lease keepalive 类库有 bug,不管 ttl 设置多久,都会在续期16次后,变为0.5s 一次
- agent 部署到大量服务器上后(2W 台),每个 agent 持有一个 lease id
- agent(相对于 etcd 来说,是client)连接到 etcd member 上,将lease 续期请求每秒2次发送到 member 上
- etcd member 跟 leader 在续期 lease 上是使用 http 1.1,TCP 连接在上一个请求没响应之前,没法复用,几乎都是新创建的 TCP 连接
- etcd leader 节点每秒收到 2W * 2 个 TCP 连接请求
- 早期 etcd node 之间是 https 通讯,TCP 建立后,会进行 TLS 握手,TLS 校验是 RSA 的数学计算,会消耗大量 CPU,导致 leader 节点 CPU 飙升
- etcd leader 使用 grpc 类库监听的 http server,每一个 client 连接过来时,都会创建对象、结构体、channel用来通讯,占用更多内存
- etcd leader 收到 etcd member转发来的 lease keepalive请求, 放到一个slice里(实际上这块是grpc做的)
- etcd leader 对存放recvMsg的b.backlog 写入速度大于消费速度,导致 slice 越来越大,占用更多内存,导致内存不停上升,导致内存溢出假象
- etcd leader 处理请求的速度小于写入速度,导致请求堆积,超过 lease id 的 TTL,导致 lease id 判断过期,删除 TTL,也解释了 hids server 的 key 莫名其妙消失的现象
- etcd member 大量新建 TCP 请求到 etcd leader,leader 处理后就关闭连接,导致 leader 侧会出现大量 TIME-WAIT 状态的链接。
- etcd leader 处理 member lease 的请求太慢,超过 TLS 的10s 后,member 触发 TLS 超时,主动断开连接,但 leader 仍处理不过来,TCP 连接变为 CLOSE-WAIT 状态
最终,稻草压死骆驼。 该 bug 已反馈至 etcd 官方 clientv3 sdk will send /etcdserverpb.Lease/LeaseKeepAlive twice every second for lease id after over time
关于 etcd 过载保护的想法:
- etcd lease 的 keepalive 放到2/3的时间点去做是不是更合适?
- grpc 的过载保护是不是也要做下,比如接受 client 请求这里从数组改用定长队列,超过丢弃,来保证集群可用。
- etcd 节点之间通讯改用 http2是不是更合适
声明
- 本文可转载,但不得删除文中 招聘信息,以及QQ 群信息。
- 原文来自:www.cnxct.com/etcd-lease-keepalive-debug-note/
- 招聘[上海/北京]安全研发专家 月薪 40k-60k 上海市/ 5-10 年/本科/研发
- 招聘[上海/北京]美团点评–50-80K 招 Java 高级技术专家
- 简历请发至 cfc4n[{at}]cnxct.com ,谢谢。
对你的etcd 产品是否影响:
- 使用etcd clientv3(golang 的客户端类库) ,所有v3版本的客户端(包括3.0.x,3.1.x,3.2.x,3.3.x,3.4.x)。
- 使用 lease 租约,并设置 keepalive
- client 量比较大能看出来效果,几百台的话,对性能影响不大。
官方已修复
coreos etcd在2018年7月24日已经修复该 bug,涉及etcd client v3的3个版本效率非常高。详情如下
- https://github.com/coreos/etcd/blob/master/CHANGELOG-3.1.md
- https://github.com/coreos/etcd/blob/master/CHANGELOG-3.2.md
- https://github.com/coreos/etcd/blob/master/CHANGELOG-3.3.md
Fix lease keepalive interval updates when response queue is full.
If <-chan *clientv3LeaseKeepAliveResponse from clientv3.Lease.KeepAlive was never consumed or channel is full, client was sending keepalive request every 500ms instead of expected rate of every "TTL / 3" duration.
CFC4N的博客 由 CFC4N 创作,采用 署名—非商业性使用—相同方式共享 4.0 进行许可。基于https://www.cnxct.com上的作品创作。转载请注明转自:etcd clientv3的lease keepalive租约频繁续期bug 排查小记
ch会写阻塞,是因为调用KeepAlive返回的chan你没有读吗?
是的。没有去读取keepaliveResponse的 chan,我看文档里、事例代码里都没有说一定要读取。不过,我觉得这个问题,不管有没有读取 response,都不该影响下一次 TTL 的续期时间。
// KeepAlive keeps the given lease alive forever. If the keepalive response
// posted to the channel is not consumed immediately, the lease client will
// continue sending keep alive requests to the etcd server at least every
// second until latest response is consumed.
KeepAlive的注释中明确说明了应该去读response,否则至少每秒发一次…
确实如此,您说的很对,我当初没认真看说明。。
不过类似上面的回复,我觉得不管是否读取,都不该影响下一次TTL续期,不是么?
调用KeepAlive返回的chan会一直被关闭,请问大佬知道这个问题怎么解决吗
佩服前辈!
好牛牛
大佬,请教一下,文中提到的官方修复其实并没有修复下面这个问题吧?
> etcd leader 处理请求的速度小于写入速度,导致请求堆积,超过 lease id 的 TTL,导致 lease id 判断过期,删除 TTL,也解释了 hids server 的 key 莫名其妙消失的现象