etcd clientv3的lease keepalive租约频繁续期bug 排查小记

背景:

  • etcd 3.2.23
  • golang 1.9.1
  • centos 6
  • etcd cluster集群节点7台,节点member之间 https 通讯,与 client 之间也是 https 通讯
  • 测试环境 2W 台主机,每台都要安装这个 agent,每个 agent注册一个包含lease的key,key的 ttl 设置为10s
  • 设置keepalive
  • 产品等着发布上线,持续遇到集群性能问题,长时间找不到原因,感觉我要崩溃了
  • 集群性能问题不解决,线上更多的服务器更是没法部署,领导不认可,不满意,我就要被开除了,我就要辞职了。。。
  • Etcd github官方提了issue,官方没时间帮我们看(理解官方的苦衷,issue 太多,很多都是使用者的问题)…

var err error
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
leaseResp, err := this.GetEtcdClient().Grant(ctx, 10)	//租约时间设定为10秒
cancel()
if err != nil {
	return err
}

ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
kvc := etcdclient.NewKV(this.kapi)
var txnResp *etcdclient.TxnResponse
txnResp, err = kvc.Txn(ctx).
	If(etcdclient.Compare(etcdclient.CreateRevision(keyName), "=", 0)).
	Then(etcdclient.OpPut(keyName, value.GetRegisterInfo(),etcdclient.WithLease(etcdclient.LeaseID(leaseResp.ID)))).
	Commit()
_, err = this.kapi.Put(ctx, keyName, value.GetRegisterInfo())
cancel()

if err != nil {
	return err
}

if !txnResp.Succeeded {
	return fmt.Errorf("无法注册自己:%s ,该 Key 已经存在",keyName)
}

ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
_, err = this.GetEtcdClient().KeepAlive(context.TODO(), leaseResp.ID)
cancel()
if err != nil {
	return err
}

问题现象:

  • 带有lease id的 key 无故丢失
  • leader节点 大部分时间 TIME-WAIT数量非常高
  • leader 节点的 CLOSE-WAIT 有时也非常高
  • leader 的 CPU 使用率持续在80-90%的样子
  • leader 节点的内存使用率持续上升,疑似内存泄露的症状

监控图如下:

etcd leader的 etcd 进程入手:

先分析「疑似内存溢出」问题,在 etcd 里,已经有pprof的信息,golang的tool里也有直接读取 http pprof 性能分析的工具,只是不支持 https 的。需要将 pprof 文件下载到本地,再从本地读取

curl --cacert /xxx/xxx/etcd/config/ssl/ca.pem --cert /xxx/xxx/etcd/config/ssl/client.pem --key /xxx/xxx/etcd/config/ssl/client-key.pem   -L https://leader_ip:2379/debug/pprof/profile > leader_ip_cpu.txt

结果如下:

cfc4n@cnxct:~/Downloads$go tool pprof leader_ip_cpu.txt
File: etcd
Type: cpu
Time: Jul 3, 2018 at 4:13pm (CST)
Duration: 2.64mins, Total samples = 34.54mins (1306.43%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 23.98mins, 69.42% of 34.54mins total
Dropped 1702 nodes (cum <= 0.17mins)
Showing top 20 nodes out of 211
      flat  flat%   sum%        cum   cum%
  3.62mins 10.48% 10.48%   3.62mins 10.48%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
  2.99mins  8.66% 19.13%   2.99mins  8.66%  math/big.divWVW /usr/local/go/src/math/big/arith_amd64.s
  2.34mins  6.78% 25.91%   2.34mins  6.78%  math/big.bitLen /usr/local/go/src/math/big/arith_amd64.s
  1.90mins  5.51% 31.42%  10.44mins 30.22%  math/big.nat.montgomery /usr/local/go/src/math/big/nat.go
  1.67mins  4.82% 36.24%   1.67mins  4.82%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s
  1.46mins  4.22% 40.47%   3.57mins 10.33%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go
  1.36mins  3.93% 44.40%   1.44mins  4.17%  syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s
  1.30mins  3.78% 48.17%   1.30mins  3.78%  runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go
  1.04mins  3.01% 51.19%   1.04mins  3.02%  runtime.greyobject /usr/local/go/src/runtime/mgcmark.go
  0.94mins  2.71% 53.90%   0.94mins  2.71%  math/big.divWVW
  0.80mins  2.33% 56.23%   0.83mins  2.41%  runtime.siftdownTimer /usr/local/go/src/runtime/time.go
  0.80mins  2.32% 58.55%   3.17mins  9.17%  math/big.nat.divLarge /usr/local/go/src/math/big/nat.go
  0.78mins  2.27% 60.82%   0.78mins  2.27%  math/big.addMulVVW /usr/local/go/src/math/big/arith_amd64.s
  0.57mins  1.66% 62.48%   0.57mins  1.66%  runtime._ExternalCode /usr/local/go/src/runtime/proc.go
  0.57mins  1.64% 64.12%   0.57mins  1.64%  math/big.(*byteReader).Read <autogenerated>
  0.42mins  1.22% 65.33%   3.22mins  9.32%  runtime.mallocgc /usr/local/go/src/runtime/malloc.go
  0.40mins  1.17% 66.50%   0.40mins  1.17%  math/big.mulAddVWW /usr/local/go/src/math/big/arith_amd64.s
  0.34mins  0.99% 67.49%   0.34mins  0.99%  vendor/golang_org/x/crypto/curve25519.ladderstep /usr/local/go/src/vendor/golang_org/x/crypto/curve25519/ladderstep_amd64.s
  0.34mins  0.99% 68.49%   0.34mins  0.99%  math/big.bitLen
  0.32mins  0.93% 69.42%   0.41mins  1.18%  sync.(*Pool).pin /usr/local/go/src/sync/pool.go

可以看到大量 math 相关的包函数占用了大量的 cpu,而我们程序的业务中,并不涉及数学计算相关业务,也不会引用 math 包。唯一可能性就是在 https 的 tls 证书在做 RSA 算法校验时,用到了大量数学计算。我把 pprof 导出问 svg 图确认了一下,确实是 tls 的证书计算占用了大量 CPU。

但是 https 的链接,不应该是建立1次后就不再建立了吗?为什么会有这么大量的 cpu 计算呢。难道 TIMEWAIT 这么高跟这个有关系吗?回到 TIMEWAIT 的问题上来,此刻,leader 节点是 xxxx-etcdserver-xxx02,且TIME_WAIT非常高

[user@xxxx-etcdserver-xxx02 etcd]$ ss |grep 2380|awk '{print $5}'|awk -F ":" '{print $1}'|sort |uniq -c|sort -rn
  53882 192.168.198.234
  53667 192.168.227.202
  53539 192.168.255.210
  53197 192.168.58.204
  53135 192.168.137.210
  51966 192.168.128.205

leader 节点上的time-wait来源,都是来自各个member节点,平均每个节点将近几万个 TIMEWAIT,相隔几秒再次查询,还是5万多个。而各个 member 节点上,查到的是跟 leader 节点的2380端口 保持1千多个 ESTABLISHED 状态的连接,也确实是频繁的创建链接。为什么member节点频繁的创建链接呢?tcpdump抓点TCP 通讯包看一下:

如上抓包的图,发现了什么?

  • 120140 2018-06-22 20:37:30.219 10.72.198.234 74 10.20.127.136 TCP 12902 → etcd-server(2380) [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=3008487334 TSecr=0 WS=128
  • 120141 2018-06-22 20:37:30.219 10.20.127.136 74 10.72.198.234 TCP etcd-server(2380) → 12902 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=2796204056 TSecr=3008487334 WS=128
  • 120362 2018-06-22 20:37:30.242 10.72.198.234 66 10.20.127.136 TCP 12902 → etcd-server(2380) [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=3008487358 TSecr=2796204056
  • 120368 2018-06-22 20:37:30.242 10.72.198.234 204 10.20.127.136 TCP 12902 → etcd-server(2380) [PSH, ACK] Seq=1 Ack=1 Win=14720 Len=138 TSval=3008487358 TSecr=2796204056
  • 120369 2018-06-22 20:37:30.242 10.20.127.136 66 10.72.198.234 TCP etcd-server(2380) → 12902 [ACK] Seq=1 Ack=139 Win=15616 Len=0 TSval=2796204080 TSecr=3008487358
  • 174076 2018-06-22 20:37:40.242 10.72.198.234 66 10.20.127.136 TCP 12902 → etcd-server(2380) [FIN, ACK] Seq=139 Ack=1 Win=14720 Len=0 TSval=3008497358 TSecr=2796204080
  • 174377 2018-06-22 20:37:40.282 10.20.127.136 66 10.72.198.234 TCP etcd-server(2380) → 12902 [ACK] Seq=1 Ack=140 Win=15616 Len=0 TSval=2796214120 TSecr=3008497358

从这些数据包中,你能看出什么问题?想一想,打开这张图,想5分钟再往下看。。。

  • 120140是10.72.198.234 的 member 节点作为客户端,向10.20.127.136的2380的 leader 节点发起TCP 连接请求。
  • 120141是 leader 回的握手确认包。120362是 member 发的 ACK 握手完成确认包。这些都很简单明了。
  • 120368是 member 向 leader 发的数据包,120369是 leader 回 member 收到前面数据包的 ack。嗯,这些也很好明白。
  • 174076是什么? member向 leader 发的FIN 关闭 TCP 连接的数据包,为什么?为什么member要关闭连接?
  • 再抽查几个数据包看看,从 “专家信息”里预览的1000多个“SYN”里找….

可以看出,都是 member主动向 leader 发 FIN 关闭包,这是为什么呢? 仅仅是看出来这些吗? 有没有注意到 member 发的消息包,收到 leader 回的 ack 之后,距离 member 主动发的 FIN 之间的时间间隔?都是10s ?为什么是10s?

leader 回了 ack,但没主动向 member 发“FIN,ACK”,为什么没回呢?若没回,那么 leader 上的这个 TCP 五元组的状态是什么?是的,是“CLOSE-WAIT”,难道这就是造成leader上大量 CLOSE-WAIT 的原因?先怀疑一下,继续往下看…

如上所说,member 向 leader 发了一个PSH数据包,发来什么内容呢?为什么 leader 没回信息?

payload的总长度是138,完全看不出来是什么数据包?不,应该还是有眼熟的,看前3个字节,0x16,0x03,0x01,如果对 HTTPS 熟悉的话,会联想到这是 TLS 的magic number,0x16对应十进制的22 即“Client Hello 的 Handshake”,后面的0x31,0x01是 TLS 的版本号。再配合我们的 member 跟 leader 之间确实是 https 通讯,可以确定这个地方是 tls 的client Hello握手包。也就是说配合前面发现的 10s 的间隔,可以看出 member 与 leader 之间的 TLS 握手并未完成,leader 并没有回 Server Hello… 整个TCP 包比较短,没有发现其他有用的信息…

去掉 HTTPS,减轻 cpu 负担:

回到之前的 CPU pprof 情况来看,是 TLS 相关证书计算,导致 CPU 使用率较高,但大量的 TCP 连接的发起,才是导致大量 TLS 计算的原因,导致内存飙升,消息积压的原因。不过,我们还是尝试去掉 member 与 leader 时间的 tls 证书校验,尝试减轻 leader 的CPU 使用。但结果影响不大,CPU 使用率依旧很高,不过,这也在意料之中,毕竟不是问题根本原因,查看pprof 如下

cfc4n@cnxct:~/Downloads$ go tool pprof 111.111.111.111-0703_cpu.txt
File: etcd
Type: cpu
Time: Jul 3, 2018 at 4:55pm (CST)
Duration: 30.01s, Total samples = 6.16mins (1231.36%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 4.29mins, 69.69% of 6.16mins total
Dropped 928 nodes (cum <= 0.03mins)
Showing top 20 nodes out of 181
      flat  flat%   sum%        cum   cum%
  2.13mins 34.62% 34.62%   2.13mins 34.62%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
  0.59mins  9.51% 44.13%   0.62mins 10.06%  syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s
  0.33mins  5.42% 49.55%   0.33mins  5.42%  runtime._ExternalCode /usr/local/go/src/runtime/proc.go
  0.25mins  4.07% 53.62%   1.23mins 20.00%  runtime.lock /usr/local/go/src/runtime/lock_futex.go
  0.13mins  2.18% 55.80%   0.13mins  2.18%  runtime.procyield /usr/local/go/src/runtime/asm_amd64.s
  0.11mins  1.80% 57.60%   0.23mins  3.72%  runtime.selectgoImpl /usr/local/go/src/runtime/select.go
  0.10mins  1.57% 59.17%   0.10mins  1.57%  runtime.osyield /usr/local/go/src/runtime/sys_linux_amd64.s
  0.09mins  1.39% 60.56%   1.47mins 23.85%  runtime.unlock /usr/local/go/src/runtime/lock_futex.go
  0.07mins  1.12% 61.69%   0.33mins  5.30%  runtime.mallocgc /usr/local/go/src/runtime/malloc.go
  0.07mins  1.06% 62.74%   0.07mins  1.06%  runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go
  0.06mins  1.01% 63.76%   0.15mins  2.36%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go
  0.05mins  0.77% 64.53%   0.05mins  0.77%  runtime.siftupTimer /usr/local/go/src/runtime/time.go
  0.04mins  0.71% 65.24%   0.04mins  0.71%  runtime.greyobject /usr/local/go/src/runtime/mgcmark.go
  0.04mins  0.71% 65.95%   0.07mins  1.14%  runtime.deferreturn /usr/local/go/src/runtime/panic.go
  0.04mins   0.7% 66.65%   1.35mins 21.86%  runtime.deltimer /usr/local/go/src/runtime/time.go
  0.04mins  0.67% 67.32%   0.12mins  1.92%  runtime.pcvalue /usr/local/go/src/runtime/symtab.go
  0.04mins  0.61% 67.93%   0.04mins  0.64%  runtime.siftdownTimer /usr/local/go/src/runtime/time.go
  0.04mins   0.6% 68.53%   0.23mins  3.74%  runtime.gentraceback /usr/local/go/src/runtime/traceback.go
  0.04mins  0.59% 69.11%   0.04mins  0.59%  runtime.casgstatus /usr/local/go/src/runtime/proc.go
  0.04mins  0.57% 69.69%   0.04mins  0.57%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s

CPU 使用率较高的,是 runtime.futex 跟系统调用的 syscall.Syscall,以及runtime下的包,heapBitsForObjectscanobjectgreyobject是 golang GC 时的运行时函数,意味着GC 已经占用很大的 CPU 了。

配合memory pprof来看下

cfc4n@cnxct:~/Desktop$ go tool pprof 10.3.5.52.txt
File: etcd
Type: inuse_space
Time: Jul 3, 2018 at 4:57pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 10665.34MB, 94.72% of 11259.98MB total
Dropped 368 nodes (cum <= 56.30MB)
Showing top 20 nodes out of 57
      flat  flat%   sum%        cum   cum%
 7156.71MB 63.56% 63.56%  7156.71MB 63.56%  github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*recvBuffer).put /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/transport.go
 2258.28MB 20.06% 83.61%  9417.49MB 83.64%  github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).HandleStreams.func4 /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go
  663.28MB  5.89% 89.51%   663.28MB  5.89%  net/http.http2putRequestBodyBuf /usr/local/go/src/net/http/h2_bundle.go
  277.15MB  2.46% 91.97%   277.15MB  2.46%  runtime.makechan /usr/local/go/src/runtime/chan.go
   66.11MB  0.59% 92.55%    66.11MB  0.59%  crypto/tls.(*block).reserve /usr/local/go/src/crypto/tls/conn.go
   65.24MB  0.58% 93.13%    68.74MB  0.61%  net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go
   56.52MB   0.5% 93.64%    56.52MB   0.5%  runtime.malg /usr/local/go/src/runtime/proc.go
   54.52MB  0.48% 94.12%    85.91MB  0.76%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go
   23.53MB  0.21% 94.33%   113.68MB  1.01%  crypto/tls.(*Conn).readHandshake /usr/local/go/src/crypto/tls/conn.go
      15MB  0.13% 94.46%    74.58MB  0.66%  runtime.systemstack /usr/local/go/src/runtime/asm_amd64.s
      12MB  0.11% 94.57%      795MB  7.06%  net/http.(*http2Server).ServeConn /usr/local/go/src/net/http/h2_bundle.go
      11MB 0.098% 94.67%    65.23MB  0.58%  net/http.(*http2Framer).readMetaFrame /usr/local/go/src/net/http/h2_bundle.go
       2MB 0.018% 94.68%   990.71MB  8.80%  net/http.(*conn).serve /usr/local/go/src/net/http/server.go
       2MB 0.018% 94.70%    77.73MB  0.69%  net/http.(*http2serverConn).readFrames /usr/local/go/src/net/http/h2_bundle.go
       1MB 0.0089% 94.71%   721.45MB  6.41%  net/http.(*http2serverConn).newWriterAndRequest /usr/local/go/src/net/http/h2_bundle.go
       1MB 0.0089% 94.72%   755.99MB  6.71%  net/http.(*http2serverConn).serve /usr/local/go/src/net/http/h2_bundle.go
         0     0% 94.72%    64.32MB  0.57%  [etcd]
         0     0% 94.72%   190.21MB  1.69%  crypto/tls.(*Conn).Handshake /usr/local/go/src/crypto/tls/conn.go
         0     0% 94.72%    64.60MB  0.57%  crypto/tls.(*Conn).readRecord /usr/local/go/src/crypto/tls/conn.go
         0     0% 94.72%   190.21MB  1.69%  crypto/tls.(*Conn).serverHandshake /usr/local/go/src/crypto/tls/handshake_server.go

如上,可以看出,大约65%的内存都用在了recvBuffer.put

// recvBuffer is an unbounded channel of recvMsg structs.
// Note recvBuffer differs from controlBuffer only in that recvBuffer
// holds a channel of only recvMsg structs instead of objects implementing "item" interface.
// recvBuffer is written to much more often than
// controlBuffer and using strict recvMsg structs helps avoid allocation in "recvBuffer.put"
type recvBuffer struct {
	c       chan recvMsg
	mu      sync.Mutex
	backlog []recvMsg
}

func newRecvBuffer() *recvBuffer {
	b := &recvBuffer{
		c: make(chan recvMsg, 1),
	}
	return b
}

func (b *recvBuffer) put(r recvMsg) {
	b.mu.Lock()
	if len(b.backlog) == 0 {
		select {
		case b.c <- r:
			b.mu.Unlock()
			return
		default:
		}
	}
	b.backlog = append(b.backlog, r)
	b.mu.Unlock()
}

这里的 put 方法是将收到的recvMsg写入到b.backlog这个 slice 中的,func (b recvBuffer) put(r recvMsg)跟 (b recvBuffer) load()两个函数都对b.backlog有读写操作,也就是说,这个slice 写入、读取时,都会进行加锁,严重占用 CPU 以及影响性能,增大系统负载。而且若 (b recvBuffer) load()处理速度跟不上func (b recvBuffer) put(r recvMsg)写入速度,意味着这个slice会持续增长,造成看似内存溢出的现象。。。
那问题来了,func (b
recvBuffer) put(r recvMsg)为什么有这么大的调用量,为什么在频繁写入…为什么呢???

定位到内存溢出的疑似原因:

grpc/server.go func (s Server) serveStreams(st transport.ServerTransport)
grpc/transport/handler_server.go 文件284行 func (ht
serverHandlerTransport) HandleStreams 函数中的334行

	s.trReader = &transportReader{
		reader:        &recvBufferReader{ctx: s.ctx, recv: s.buf},
		windowHandler: func(int) {},
	}

	// readerDone is closed when the Body.Read-ing goroutine exits.
	readerDone := make(chan struct{})
	go func() {
		defer close(readerDone)

		// TODO: minimize garbage, optimize recvBuffer code/ownership
		const readSize = 8196
		for buf := make([]byte, readSize); ; {
			n, err := req.Body.Read(buf)
			if n > 0 {
				s.buf.put(recvMsg{data: buf[:n:n]})
				buf = buf[n:]
			}
			if err != nil {
				s.buf.put(recvMsg{err: mapRecvMsgError(err)})
				return
			}
			if len(buf) == 0 {
				buf = make([]byte, readSize)
			}
		}
	}()

然后再走到了,func (b *recvBuffer) put(r recvMsg)函数里,也就是说,这里的消息slice堆积,是因为接收到了大量http request。那接下来就是要找到这些http request 是来自谁了?

阅读 etcd 源码:

配合 member 之间大量的 TCP TIME-WAIT 状态链接,排查方向改到 member 节点上。

要查找 etcd member 建立 TCP/HTTP 链接的地方, 在源码里,还是非常多的,一时间没找到更好的bug 定位方式,只要从etcd 源码去逐步分析排查。

  • etcdmain/main.go 39行 startEtcdOrProxyV2()
  • etcdmain/etcd.go 103行 startEtcd()
  • etcdmain/etcd.go 186行 embed.StartEtcd(cfg)
  • embed/etcd.go 107行 startPeerListeners() ,用 rafthttp.NewListener 创建了一个Listener。用于 etcd member 之间通讯。依赖github.con/coreos/etcd/pkg/transport包
  • embed/etcd.go 110行 startClientListeners() 用 net.Listen 创建了一个Listener。用于 client 跟 member 之间通讯。keepalive 使用github.con/coreos/etcd/pkg/transport包
  • 157行 etcdserver.NewServer() 对本地配置读取,数据库目录读取,设置mvcc多版本控制函数,设置 lease 租约续期函数等等
  • 166行 e.servePeers() 进行对 Linster 启动 Serve 接受请求
  • 317行 etcdhttp.NewPeerHandler(e.Server) 设置HTTP Handler,里面函数如下:

如上,etcd 节点之间的通讯,设定了这几个Handler

  • “/” –> http.NotFound
  • “/raft”、”/raft/” –> raftHandler
  • “/members” –> mh
  • “/leases/internal” 、 “/leases” –> raftHandler
  • “/version” –> versionHandler

也就是说,member 之间通讯的大量通讯只会是这些 handler 中间的一个,除非源码找漏了。。。
从这个开源产品功能特性上来看,member 之间需要大量同步的,应该就是数据了,应该把线索放到raftHandler上吗?头大….

raftHandler是s.RaftHandler() 返回的, 值是 EtcdServer.r.transport.Handler()返回值。
EtcdServer.r是在etcdserver/server.go的414行赋值的

r: *newRaftNode(
			raftNodeConfig{
				isIDRemoved: func(id uint64) bool { return cl.IsIDRemoved(types.ID(id)) },
				Node:        n,
				heartbeat:   heartbeat,
				raftStorage: s,
				storage:     NewStorage(w, ss),
			},
		),

etcdserver/server.go 的511行 EtcdServer.r.transport = tr进行赋值,类型为rafthttp.Transport ,其Handler 如下代码:

func (t *Transport) Handler() http.Handler {
	pipelineHandler := newPipelineHandler(t, t.Raft, t.ClusterID)
	streamHandler := newStreamHandler(t, t, t.Raft, t.ID, t.ClusterID)
	snapHandler := newSnapshotHandler(t, t.Raft, t.Snapshotter, t.ClusterID)
	mux := http.NewServeMux()
	mux.Handle(RaftPrefix, pipelineHandler)
	mux.Handle(RaftStreamPrefix+"/", streamHandler)
	mux.Handle(RaftSnapshotPrefix, snapHandler)
	mux.Handle(ProbingPrefix, probing.NewHandler())
	return mux
}

这里又有几个 Handler ,而且都是 raft 处理相关的,代码读起来比较吃力,很难理解作者意图,

打印http.dial堆栈:

要想从这里查找 TCP 连接的发起代码,恐怕是难上加难,所以,我换了个思路,直接在 TCP 连接 Dail 的地方,判断 addr 是 etcd leader 的地址,打印调用栈。使用
runtime/debug下的debug.PrintStack()函数输出。同时打印了连接完成后的remote.addr,用于校对。不过,这改动是 golang 的系统 runtime 类库,(改了系统库,记得改回来)。要小心谨慎,在http/transport.go 882行 增加如下

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
	if t.DialContext != nil {
		return t.DialContext(ctx, network, addr)
	}
	if addr == "leader_ip:2380" {
		debug.PrintStack()
	}
	if t.Dial != nil {
		c, err := t.Dial(network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
		}
		return c, err
	}
	return zeroDialer.DialContext(ctx, network, addr)
}

重新编译 etcd server,替换到集群中的某一个 member 节点上。 开始压测这个节点,重现大量 TIME-WAIT 的现象。现象重现后,在日志里也拿到了运行堆栈

  0. net.(*Dialer).Dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/dial.go:323
  1. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial() /Users/cfc4n/gopat
h/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeout_d
ialer.go:29
  2. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial-fm() /Users/cfc4n/go
path/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeou
t_transport.go:51
  3. net/http.(*Transport).dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:887
  4. net/http.(*Transport).dialConn() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:1060
  5. net/http.(*Transport).getConn.func4() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:943
  6. runtime.goexit() /usr/local/Cellar/go/1.9.1/libexec/src/runtime/asm_amd64.s:2337
 

在压测时,我用 netstat 抽样打印几次当前 member 上的 TCP 连接信息数据,也是用于校对日志输出的数据。
在压测的日志数据里,每建立一个 TCP 连接,都打印了这些堆栈,并且打印了远程 IP 的地址、端口。跟前面抽样打印的 TCP 数据核对一下,发现连接 leader 2380端口的几个 TCP 连接五元组跟 上面打印的remote.addr的五元组一致。
也就是说,是这个函数创建了大量的从 member 到 leader 之间的 TCP 连接。好了,如果找到大量 TCP 连接发起的函数了,又找到了该函数的调用栈了,那岂不是找到问题了?

然而,事情并没有这么简单,在上面的调用栈日志里,只能定位到coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial,并不能打印出所有调用堆栈,为什么?

// src/net/http/transport.go:886、887行 @TODO 高亮下面两行
func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
	if t.DialContext != nil {
		return t.DialContext(ctx, network, addr)
	}
	if t.Dial != nil {
		c, err := t.Dial(network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
		}
		return c, err
	}
	return zeroDialer.DialContext(ctx, network, addr)
}

找到 etcd member 10s 后主动断开 TCP 连接的原因:

在系统类库里的net/http/transport.go的Transport.dial方法中,如果该Transport自定义了Dial方法,则调用自定义的…这个自定义的Dial方法是在github.com/coreos/etcd/pkg/transport/transport.go里定义的,如下:

func NewTransport(info TLSInfo, dialtimeoutd time.Duration) (*http.Transport, error) {
	cfg, err := info.ClientConfig()
	if err != nil {
		return nil, err
	}

	t := &http.Transport{
		Proxy: http.ProxyFromEnvironment,
		Dial: (&net.Dialer{
			Timeout: dialtimeoutd,
			// value taken from http.DefaultTransport
			KeepAlive: 30 * time.Second,
		}).Dial,
		// value taken from http.DefaultTransport
		TLSHandshakeTimeout: 10 * time.Second,
		TLSClientConfig:     cfg,
	}

	dialer := (&net.Dialer{
		Timeout:   dialtimeoutd,
		KeepAlive: 30 * time.Second,
	})
	dial := func(net, addr string) (net.Conn, error) {
		return dialer.Dial("unix", addr)
	}

	tu := &http.Transport{
		Proxy:               http.ProxyFromEnvironment,
		Dial:                dial,
		TLSHandshakeTimeout: 10 * time.Second,	//高亮,这里是10s,也解开了最初的10s TLS 超时的原因
		TLSClientConfig:     cfg,
	}
	ut := &unixTransport{tu}

	t.RegisterProtocol("unix", ut)
	t.RegisterProtocol("unixs", ut)

	return t, nil
}

在代码dial := func(net, addr string) (net.Conn, error) … 这里,自定义了一个 Dial 方法,赋值到http.Transport.Dial 函数属性上,并且设定TLS 的超时时间是 TLSHandshakeTimeout: 10 time.Second,这也就是之前抓包时,看到的10s 后,member 节点主动断开连接的原因—-超过 TLS 设定的timeout时间了。
函数NewTransport的调用地方,只在NewTimeoutTransport这里找到,如下:

func NewTimeoutTransport(info TLSInfo, dialtimeoutd, rdtimeoutd, wtimeoutd time.Duration) (*http.Transport, error) {
	tr, err := NewTransport(info, dialtimeoutd)
	if err != nil {
		return nil, err
	}

	if rdtimeoutd != 0 || wtimeoutd != 0 {
		// the timed out connection will timeout soon after it is idle.
		// it should not be put back to http transport as an idle connection for future usage.
		tr.MaxIdleConnsPerHost = -1
	} else {
		// allow more idle connections between peers to avoid unnecessary port allocation.
		tr.MaxIdleConnsPerHost = 1024
	}

	tr.Dial = (&rwTimeoutDialer{
		Dialer: net.Dialer{
			Timeout:   dialtimeoutd,
			KeepAlive: 30 * time.Second,
		},
		rdtimeoutd: rdtimeoutd,
		wtimeoutd:  wtimeoutd,
	}).Dial
	return tr, nil
}

也就是说,找到NewTimeoutTransport调用着,就能确定发起大量 TCP 的代码…但之前的堆栈打印,并没有打印到这些函数调用,只好再想其他办法。

NewTimeoutTransport的代码里,有一句对 HTTP Keepalive的最大持有空闲长连接数字的设置tr.MaxIdleConnsPerHost = 1024,这个参数决定 golang 的 http client 包对 server 发起 http 请求时,能持有的最大空闲 TCP 连接数量,在 etcd 的代码里,被设置为1024了,也就是说,etcd member 跟 etcd leader 之间空闲时,持有TCP 连接数可以是1024个,便于下次复用。(插一句,有人说,http 是最好的 RPC,这里我是不认同的,光从TCP 连接复用上,就有非常大的差异,在http 1.1 TCP 链接 「另类」复用,http 1.1 的keepalive属性也是仅当前一个请求响应后,这个链接才能继续复用,请求响应串行化。而在 http2里,是不需要等待上一个请求返回的,才有真正意义上的多路复用见:Streams and Multiplexing , 如果你做过 rpc 的功能实现,你就知道 rpc 在 tcp 上是如何实现请求响应的一一对应的,你就能很快明白http2\http1.1的 TCP 链接复用的区别了。)

排查 etcd member 跟 etcd leader 之间的通讯:

回到正题,对于代码这里的定位,我尝试去阅读代码来协助定位,但 etcd 的源码太复杂了,我读到 raft 协议节点之间数据同步这块,卡住了。开始从其他路子尝试解决,之前的 cpu 占用较高问题,我们在 member 节点之间去掉了 tls 证书验证,那么我们在 member 上做 tcp 抓包,抓2380端口数据包,看看数据发送的都是什么内容。

数据包中,可以看出 member 向 leader 上发送了大量的 /leases的 POST 请求。根据这个 uri,找到发起 HTTP 请求的代码,在etcdserver/v3_server.go的252行

func (s *EtcdServer) LeaseRenew(ctx context.Context, id lease.LeaseID) (int64, error) {
	ttl, err := s.lessor.Renew(id)
	if err == nil { // already requested to primary lessor(leader)
		return ttl, nil
	}
	if err != lease.ErrNotPrimary {
		return -1, err
	}

	cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())
	defer cancel()

	// renewals don't go through raft; forward to leader manually
	for cctx.Err() == nil && err != nil {
		leader, lerr := s.waitLeader(cctx)
		if lerr != nil {
			return -1, lerr
		}
		for _, url := range leader.PeerURLs {
			lurl := url + leasehttp.LeasePrefix
			ttl, err = leasehttp.RenewHTTP(cctx, id, lurl, s.peerRt)
			if err == nil || err == lease.ErrLeaseNotFound {
				return ttl, err
			}
		}
	}
	return -1, ErrTimeout
}

阅读这段代码,可以看到函数调用leasehttp.RenewHTTP将请求发送到leader.PeerURLs,即发送到 leader 节点上。继续跟进leasehttp.RenewHTTP

func RenewHTTP(ctx context.Context, id lease.LeaseID, url string, rt http.RoundTripper) (int64, error) {
	// will post lreq protobuf to leader
	lreq, err := (&pb.LeaseKeepAliveRequest{ID: int64(id)}).Marshal()
	if err != nil {
		return -1, err
	}

	cc := &http.Client{Transport: rt}
	req, err := http.NewRequest("POST", url, bytes.NewReader(lreq))
	if err != nil {
		return -1, err
	}
	req.Header.Set("Content-Type", "application/protobuf")
	req.Cancel = ctx.Done()

	resp, err := cc.Do(req)
	if err != nil {
		return -1, err
	}

可以看出,该请求使用 golang 标准包的http.Client.Do来将请求发送出去。http.Client.Transport使用参数传入的rt http.RoundTripper。

定位到10s TLS timeout 的相关函数调用:

回到上一层代码

  • s.peerRt参数来自*EtcdServer.peerRt,在etcdserver/server.go的429行被赋值peerRt: prt
  • prt变量在该文件的288行被赋值prt, err := rafthttp.NewRoundTripper(cfg.PeerTLSInfo, cfg.peerDialTimeout())
  • rafthttp.NewRoundTripper函数调用rafthttp.NewTimeoutTransport
  • 没错,rafthttp.NewTimeoutTransport调用了rafthttp.NewTransport函数,也就是之前堆栈打印时,发现大量 TCP 链接创建使用的这个函数,也就是10s TLS timeout 的地方。

这里可以关联上了,确认大量发起 TCP 连接的代码调用关系,证明方向正确,剩下的就是顺藤摸瓜,找到发起大量请求的原因即可。
回到LeaseRenew函数调用的地方,调用func (s *EtcdServer) LeaseRenew函数代码见github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go的88行附近

func (ls *LeaseServer) leaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) error {
	for {
		req, err := stream.Recv()
		if err == io.EOF {
			return nil
		}
		if err != nil {
			if isClientCtxErr(stream.Context().Err(), err) {
				plog.Debugf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())
			} else {
				plog.Warningf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())
			}
			return err
		}

		// Create header before we sent out the renew request.
		// This can make sure that the revision is strictly smaller or equal to
		// when the keepalive happened at the local server (when the local server is the leader)
		// or remote leader.
		// Without this, a lease might be revoked at rev 3 but client can see the keepalive succeeded
		// at rev 4.
		resp := &pb.LeaseKeepAliveResponse{ID: req.ID, Header: &pb.ResponseHeader{}}
		ls.hdr.fill(resp.Header)

		ttl, err := ls.le.LeaseRenew(stream.Context(), lease.LeaseID(req.ID))
		if err == lease.ErrLeaseNotFound {
			err = nil
			ttl = 0
		}

		if err != nil {
			return togRPCError(err)
		}

		resp.TTL = ttl
		err = stream.Send(resp)
		if err != nil {
			if isClientCtxErr(stream.Context().Err(), err) {
				plog.Debugf("failed to send lease keepalive response to gRPC stream (%q)", err.Error())
			} else {
				plog.Warningf("failed to send lease keepalive response to gRPC stream (%q)", err.Error())
			}
			return err
		}
	}
}

  • 71行 func (ls *LeaseServer) LeaseKeepAlive函数调用func (ls *LeaseServer) leaseKeepAlive()
  • etcdserver/etcdserverpb/rpc.pb.go的3417行func _Lease_LeaseKeepAlive_Handler(srv interface{}, stream grpc.ServerStream)调用了func (ls *LeaseServer) LeaseKeepAlive函数
  • etcdserver/etcdserverpb/rpc.pb.go的3480行将函数_Lease_LeaseKeepAlive_Handler设置为 StreamName: “LeaseKeepAlive”的 Handler,在通过*grpc.Server的RegisterService方法,在启动时,注册到监听器中
  • leaseKeepAlive函数是一个事件循环,从stream.Recv()拿到数据,则进行发送到 leader 节点上
  • stream.Recv()函数是google.golang.org/grpc/stream.go中的函数,这个不是 etcd 官方自己的代码,而是另外一个开源项目,意味着我还需要再把这块代码再阅读阅读,理解理解。

通过漫长的阅读,可以看到这里收到的消息都是来自 client 发送来的消息,其实可以从变量命名上看到端倪,比如 client 跟 member 之间通讯处理的变量都包含stream字样,而member到 leader 或 member 节点之间都带有peer字样,其实,client 跟 etcd 节点之间通讯用的是 http2传输协议,消息编码用的是grpc+protobuf;而etcd member节点之间是http1.1传输,消息编码也是 protobuf

etcd lease 租约续期的时机:

就是说,这里的大量lease 租约续期的请求,来自 client 的请求,在我的案例中,我把主机 lease 的 TTL 设置为10S,这块会有异常吗?client 的 lease 续期是谁来发送?client 自己吗?多久发送一次?过期前的1秒?

在 etcd 的 client 类库中,对于租约续期的代码在clientv3/lease.go的474行

	// send update to all channels
	nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0)
	ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second)
	for _, ch := range ka.chs {
		select {
		case ch <- karesp:
			ka.nextKeepAlive = nextKeepAlive
		default:
		}
	}

nextKeepAlive变量是 lease TTL 的三分之一,用于下次续期时的触发时刻,在后面的sendKeepAliveLoop函数续期时,回每隔500ms 进行一次循环,检测是否续期。也就是说,我们场景下,10s 的 TTL 会每隔3S 一次续期。3W 客户端的话,每秒1W 的写请求,而之前我们压测结果是集群节点可以有4W 以上的 (官方压测数据)qps,不至于这点都撑不住。这是个疑点。不过,前面的分析判断,大量的续约续期导致大量 TCP 链接的建立,大量 TCP 链接的建立,导致 leader 节点上大量对象分配,大量内存申请,golang gc 时,消耗大量 CPU,那这个问题也应该要优化一下,为此,我们把 TTL 设定为600S,即10分钟,也就是说,会在200秒后进行一次续期。

扩大lease TTL,尝试优化集群负载:

修复后的版本,我们在线下发布了6000多台,观察效果,果然,在发布后,内存有明显的下降,CPU 指标也有,但是在1个小时后,指标又上升了。。。

问题依旧:

这是为什么呢?中间大约1个小时的负载指标下降是因为发布过程带来的假象吗?上次变更的 lease 有效果吗?可我更改时,是测试过的,测试报告都有呢—-agent 的 etcd lease租约时长 调整 ,TTL 续期是200S 一次了,但这是为什么?为了确认这个问题,我在其中一台 member 上抓包确认

上面确实是大量的 lease 包,那么每个 lease id 之间的发送间隔是多少?是我们预期的200秒吗?或者这些 lease 包是同一个 lease id 吗?一共有多少个 lease id 呢?

分析 tcpdump 的pcapng文件:

之前从 etcd 的源码分析可以看出,etcd node 节点之间的通讯消息编码也是 protobuf 的,我们需要把 http response body 的字节流复制出来,用 protobuf 的对应结果体解码即可。可是从 2018-07-05 16:38:31.015981 到 2018-07-05 16:43:20.420401 之间,间隔5分钟之间,一共103万个数据包,手动复制,实在没法统计。借用pcap包,对整个 tcpdump抓包的pcapng文件,分析, 过滤响应类型是application/protobuf的,取 http response body内容,并对其做 pb.LeaseKeepAliveRequest protobuf的结构体解析,计算 lease 总数与唯一数


运行结果: lease id 总数数: 77550, lease id 唯一数:1382。

这些结果说明,相同的 lease id 在频繁发送续约请求。再次写程序,把这些 lease id 取出来,到 etcd 集群上查看 TTL 剩余时间是否是预期的400-600秒之间,但结果却总是599秒,以其中一个 lease id 为例,每隔1秒去抓取 TTL 信息,但每次结果 TTL 都是599秒, grantTTL 是600,意味着 etcd member 持续不断的为这些 lease id 不停的续期。 我们之前排查结果是, member 的续期是来自 client 的 lease keepalive 请求,那么方向得放到 client 上。

抽查其中一台 client,抓包分析,也确实是在频繁发送/etcdserverpb.Lease/LeaseKeepAlive
请求,这就很奇怪,跟之前我的测试结果完全不一样,明明测试时,是每个200S 一起续期请求,为什么这里变得这么频繁,每秒都发起。

回到 etcd clientv3的代码上:

现象很奇怪,排查这个问题,只能再次从源码层面入手,方向也放到 /etcdserverpb.Lease/LeaseKeepAlive的地方,函数sendKeepAliveLoop这里开始循环处理判断是否需要续期租约

// sendKeepAliveLoop sends keep alive requests for the lifetime of the given stream.
func (l *lessor) sendKeepAliveLoop(stream pb.Lease_LeaseKeepAliveClient) {
	for {
		var tosend []LeaseID

		now := time.Now()
		l.mu.Lock()
		for id, ka := range l.keepAlives {
			if ka.nextKeepAlive.Before(now) {
				tosend = append(tosend, id)
			}
		}
		l.mu.Unlock()

		for _, id := range tosend {
			r := &pb.LeaseKeepAliveRequest{ID: int64(id)}
			if err := stream.Send(r); err != nil {
				// TODO do something with this error?
				return
			}
		}

		select {
		case <-time.After(500 * time.Millisecond):
		case <-stream.Context().Done():
			return
		case <-l.donec:
			return
		case <-l.stopCtx.Done():
			return
		}
	}
}

如上代码,每隔 500毫秒一次循环,遍历l.keepAlives map,判断每个LeaseID的nextKeepAlive是否到了,client 决定是否发起LeaseKeepAliveRequest。这个nextKeepAlive的赋值是在clientv3/lease.go的recvKeepAlive函数内,第473行

// send update to all channels
nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0)
ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second)
for _, ch := range ka.chs {
	select {
	case ch <- karesp:
		ka.nextKeepAlive = nextKeepAlive
	default:
	}
}

是的,这块代码之前看过,之前关注的只是 TTL 时间的设定,但仔细看看这段代码ka.nextKeepAlive = nextKeepAlive,这里做下一次lease 续期的赋值,判断方法是select case,判断条件是ch <- karesp,在 golang 里,这种写法表示该channle能写入的话,则进入case里面的流程,对ka.nextKeepAlive进行赋值;否则不赋值。。。否则,不赋值。。。否则,不赋值。。。

那么,这么 channel 长度多少呢?会不会写入的速度大于读取导致没写进去,导致下一次的续期时间没赋值成功呢?继续追查,在clientv3/lease.go的222行KeepAlive函数里

func (l *lessor) KeepAlive(ctx context.Context, id LeaseID) (<-chan *LeaseKeepAliveResponse, error) {
	ch := make(chan *LeaseKeepAliveResponse, leaseResponseChSize)

	l.mu.Lock()
	// ensure that recvKeepAliveLoop is still running
	select {
	case <-l.donec:
		err := l.loopErr
		l.mu.Unlock()
		close(ch)
		return ch, ErrKeepAliveHalted{Reason: err}
	default:
	}
	ka, ok := l.keepAlives[id]
	if !ok {
		// create fresh keep alive
		ka = &keepAlive{
			chs:           []chan<- *LeaseKeepAliveResponse{ch},

ch := make(chan *LeaseKeepAliveResponse, leaseResponseChSize)进行chan 的申请,长度leaseResponseChSize是常量,值是16。chs是个slice,长度1,只有这一个。那么,接下来,去确认这个 ch 读取的代码吧,可是我读完整个 lease.go也没找到哪里读取这个 channel 的代码,除了几处 close 的回收操作。那问题来了,会不会是这里把 ch channel 写满了,就无法再写入进去,导致ka.nextKeepAlive = nextKeepAlive流程走不到,导致ka.nextKeepAlive一直是上次的过去的续期时间,导致频繁进行续期呢?

找到疑点,每秒2次 lease keepalive 的请求:

开始测试,在 ch channel 写入的 default 分支里,打印“无法写入”等字样,并在for 后面打印这个ka的 TTL 等信息,用于核对。log.Println(fmt.Sprintf("karesp.TTL:%d, nextKeepAlive:%s,ka.nextKeepAlive:%s, ka.deadline:%s",karesp.TTL, nextKeepAlive.String(),ka.nextKeepAlive.String(), ka.deadline.String()))

改动后,重新编译代码,上传至某台 client 上 ,关掉 DEBUG 模式,观察日志输出。 文件见 /xxxx/xxx/core/nohup.out.1 ,在 2018/07/05 21:10:42 启动程序,经过1个多小时的观察,果然在2018/07/05 22:04:05时,lease 续约续期的请求从200秒变成了每秒2次。同时到 etcd 集群上查询响应 lease id,发现 TTL 均为599,grantTTL 为600,则可说明 是客户端在频繁重新发送租约请求。 第一个版本没打印ka.nextKeepAlive ,调整后,重新打印,为了让等待时间变短些,我把600s 改成60,便于观察。得到如下日志

2018/07/05 22:40:02 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:02 karesp.TTL:60, nextKeepAlive:2018-07-05 22:40:22.811626819 +0800 CST m=+40.135091508,ka.nextKeepAlive:2018-07-05 22:40:22.811626819 +0800 CST m=+40.135091508, ka.deadline:2018-07-05 22:41:02.811627601 +0800 CST m=+80.135092217

2018/07/05 22:40:22 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:22 karesp.TTL:60, nextKeepAlive:2018-07-05 22:40:42.922698966 +0800 CST m=+60.246163715,ka.nextKeepAlive:2018-07-05 22:40:42.922698966 +0800 CST m=+60.246163715, ka.deadline:2018-07-05 22:41:22.922699917 +0800 CST m=+100.246164470

2018/07/05 22:40:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:03.330494852 +0800 CST m=+80.653959932,ka.nextKeepAlive:2018-07-05 22:41:03.330494852 +0800 CST m=+80.653959932, ka.deadline:2018-07-05 22:41:43.330496602 +0800 CST m=+120.653961489

2018/07/05 22:41:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:23.341112484 +0800 CST m=+100.664577164,ka.nextKeepAlive:2018-07-05 22:41:23.341112484 +0800 CST m=+100.664577164, ka.deadline:2018-07-05 22:42:03.341113662 +0800 CST m=+140.664578212

2018/07/05 22:41:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:23 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:43.350264635 +0800 CST m=+120.673729313,ka.nextKeepAlive:2018-07-05 22:41:43.350264635 +0800 CST m=+120.673729313, ka.deadline:2018-07-05 22:42:23.350265498 +0800 CST m=+160.673730050

2018/07/05 22:41:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:03.572888276 +0800 CST m=+140.896353032,ka.nextKeepAlive:2018-07-05 22:42:03.572888276 +0800 CST m=+140.896353032, ka.deadline:2018-07-05 22:42:43.572889271 +0800 CST m=+180.896353840
2018/07/05 22:42:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:23.86879775 +0800 CST m=+161.192262488,ka.nextKeepAlive:2018-07-05 22:42:23.86879775 +0800 CST m=+161.192262488, ka.deadline:2018-07-05 22:43:03.868798687 +0800 CST m=+201.192263243
2018/07/05 22:42:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:23 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:43.877613966 +0800 CST m=+181.201079020,ka.nextKeepAlive:2018-07-05 22:42:43.877613966 +0800 CST m=+181.201079020, ka.deadline:2018-07-05 22:43:23.877615637 +0800 CST m=+221.201080417

2018/07/05 22:42:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:03.884576518 +0800 CST m=+201.208041122,ka.nextKeepAlive:2018-07-05 22:43:03.884576518 +0800 CST m=+201.208041122, ka.deadline:2018-07-05 22:43:43.884577361 +0800 CST m=+241.208041904

2018/07/05 22:43:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:23.893274193 +0800 CST m=+221.216738996,ka.nextKeepAlive:2018-07-05 22:43:23.893274193 +0800 CST m=+221.216738996, ka.deadline:2018-07-05 22:44:03.893275148 +0800 CST m=+261.216739702

2018/07/05 22:43:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:24 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:44.058766258 +0800 CST m=+241.382230998,ka.nextKeepAlive:2018-07-05 22:43:44.058766258 +0800 CST m=+241.382230998, ka.deadline:2018-07-05 22:44:24.058767168 +0800 CST m=+281.382231711
2018/07/05 22:43:44 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:44 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:04.413012991 +0800 CST m=+261.736477737,ka.nextKeepAlive:2018-07-05 22:44:04.413012991 +0800 CST m=+261.736477737, ka.deadline:2018-07-05 22:44:44.41301386 +0800 CST m=+301.736478391

2018/07/05 22:44:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:24.420796492 +0800 CST m=+281.744261106,ka.nextKeepAlive:2018-07-05 22:44:24.420796492 +0800 CST m=+281.744261106, ka.deadline:2018-07-05 22:45:04.420797607 +0800 CST m=+321.744262161

2018/07/05 22:44:24 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:24 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:44.428145652 +0800 CST m=+301.751610589,ka.nextKeepAlive:2018-07-05 22:44:44.428145652 +0800 CST m=+301.751610589, ka.deadline:2018-07-05 22:45:24.428147239 +0800 CST m=+341.751612018

2018/07/05 22:44:44 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:44 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:45:44.436152168 +0800 CST m=+361.759616723

2018/07/05 22:45:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:04 无法写入
2018/07/05 22:45:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:24.441563525 +0800 CST m=+341.765028110,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:46:04.44156426 +0800 CST m=+381.765028789
2018/07/05 22:45:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:04 无法写入
2018/07/05 22:45:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:24.941897369 +0800 CST m=+342.265361977,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:46:04.941898136 +0800 CST m=+382.265362694
2018/07/05 22:45:05 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:05 无法写入

再16次之后,即第17次,开始恢复每秒2次的租约请求。。。所以,也验证了此 bug。
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!

之前的134版本灰度灰度后,中间CPU 降低了大约1个小时的样子,这是为什么呢?

解释 cpu 降低1小时的疑点:

我们这么计算, TTL 600,续约续约是200, ka.chs的 数组长度是1,第一个chan <- LeasesKeepAliveResponse的长度是16 ,全部写满的时间是20016/60 = 53分钟。 也非常契合 falcon 上的 cpu 信息截图

另外再从我加日志的第一次测试时间来算: 2018/07/05 22:04:05 – 2018/07/05 21:10:42 约为53分钟。也就是说,将近1小时,跟 falcon 的统计吻合的。。。

找到真凶:

定问题了,就是修复方式了,我并没看到 ch channel 的使用的地方,或许我没理解作者的用意,我觉得,不管 ch channel 是否写入,都必须将下一次nextKeepAlive的时间赋值为最新时间。将ka.nextKeepAlive = nextKeepAlive移到for select外面即可。

不过,我们产品急于推广,保险起见,牺牲了 lease 的需求,去掉了这个功能。发布到服务器后,负载情况有非常大的改善,CPU 从40%-50%降低到2%-5%左右。

回顾总结:

回顾etcd 集群遇到的所有现象,配合这个问题,所有问题都好解释了,那么,我按照事情发生发展的顺序总结一下原因:

  • etcd clientv3的lease keepalive 类库有 bug,不管 ttl 设置多久,都会在续期16次后,变为0.5s 一次
  • agent 部署到大量服务器上后(2W 台),每个 agent 持有一个 lease id
  • agent(相对于 etcd 来说,是client)连接到 etcd member 上,将lease 续期请求每秒2次发送到 member 上
  • etcd member 跟 leader 在续期 lease 上是使用 http 1.1,TCP 连接在上一个请求没响应之前,没法复用,几乎都是新创建的 TCP 连接
  • etcd leader 节点每秒收到 2W * 2 个 TCP 连接请求
  • 早期 etcd node 之间是 https 通讯,TCP 建立后,会进行 TLS 握手,TLS 校验是 RSA 的数学计算,会消耗大量 CPU,导致 leader 节点 CPU 飙升
  • etcd leader 使用 grpc 类库监听的 http server,每一个 client 连接过来时,都会创建对象、结构体、channel用来通讯,占用更多内存
  • etcd leader 收到 etcd member转发来的 lease keepalive请求, 放到一个slice里(实际上这块是grpc做的)
  • etcd leader 对存放recvMsg的b.backlog 写入速度大于消费速度,导致 slice 越来越大,占用更多内存,导致内存不停上升,导致内存溢出假象
  • etcd leader 处理请求的速度小于写入速度,导致请求堆积,超过 lease id 的 TTL,导致 lease id 判断过期,删除 TTL,也解释了 hids server 的 key 莫名其妙消失的现象
  • etcd member 大量新建 TCP 请求到 etcd leader,leader 处理后就关闭连接,导致 leader 侧会出现大量 TIME-WAIT 状态的链接。
  • etcd leader 处理 member lease 的请求太慢,超过 TLS 的10s 后,member 触发 TLS 超时,主动断开连接,但 leader 仍处理不过来,TCP 连接变为 CLOSE-WAIT 状态

最终,稻草压死骆驼。 该 bug 已反馈至 etcd 官方 clientv3 sdk will send /etcdserverpb.Lease/LeaseKeepAlive twice every second for lease id after over time

困扰我1个多月的 bug,不会忘记这个时刻

关于 etcd 过载保护的想法:

  • etcd lease 的 keepalive 放到2/3的时间点去做是不是更合适?
  • grpc 的过载保护是不是也要做下,比如接受 client 请求这里从数组改用定长队列,超过丢弃,来保证集群可用。
  • etcd 节点之间通讯改用 http2是不是更合适

声明

对你的etcd 产品是否影响:

  • 使用etcd clientv3(golang 的客户端类库) ,所有v3版本的客户端(包括3.0.x,3.1.x,3.2.x,3.3.x,3.4.x)。
  • 使用 lease 租约,并设置 keepalive
  • client 量比较大能看出来效果,几百台的话,对性能影响不大。

官方已修复

coreos etcd在2018年7月24日已经修复该 bug,涉及etcd client v3的3个版本效率非常高。详情如下

Fix lease keepalive interval updates when response queue is full.
If <-chan *clientv3LeaseKeepAliveResponse from clientv3.Lease.KeepAlive was never consumed or channel is full, client was sending keepalive request every 500ms instead of expected rate of every "TTL / 3" duration.

知识共享许可协议CFC4N的博客CFC4N 创作,采用 署名—非商业性使用—相同方式共享 4.0 进行许可。基于https://www.cnxct.com上的作品创作。转载请注明转自:etcd clientv3的lease keepalive租约频繁续期bug 排查小记

8 thoughts on “etcd clientv3的lease keepalive租约频繁续期bug 排查小记

    • 是的。没有去读取keepaliveResponse的 chan,我看文档里、事例代码里都没有说一定要读取。不过,我觉得这个问题,不管有没有读取 response,都不该影响下一次 TTL 的续期时间。

  1. // KeepAlive keeps the given lease alive forever. If the keepalive response
    // posted to the channel is not consumed immediately, the lease client will
    // continue sending keep alive requests to the etcd server at least every
    // second until latest response is consumed.
    KeepAlive的注释中明确说明了应该去读response,否则至少每秒发一次…

    • 确实如此,您说的很对,我当初没认真看说明。。
      不过类似上面的回复,我觉得不管是否读取,都不该影响下一次TTL续期,不是么?

  2. 调用KeepAlive返回的chan会一直被关闭,请问大佬知道这个问题怎么解决吗

  3. 大佬,请教一下,文中提到的官方修复其实并没有修复下面这个问题吧?
    > etcd leader 处理请求的速度小于写入速度,导致请求堆积,超过 lease id 的 TTL,导致 lease id 判断过期,删除 TTL,也解释了 hids server 的 key 莫名其妙消失的现象

Comments are closed.