在kubernetes中,kube-apiserver组件是外部操作集群资源的入口,作为一个完善的api类组件的实现,为保障api组件平稳运行,限流功能自然是必不可少的。本文将通过简单的实验验证加上源码阅读的方式来分析kube-apiserver的限流功能,以期达到更进一步的了解kube-apiserver工作原理的目的。本文导读如下(基于kubernetes 1.22.2代码) :
- kube-apiserver自身提供了哪些metrics可供运维人员观察呢?
- 通过哪些指标可以观察到kube-apiserver因为限流而丢弃请求呢?
- 通过哪些指标可以观察到kube-apiserver当前正在处理以及等待处理的请求数呢?
- 请求被限流之后,kube-apiserver有没有回复客户端呢?错误码是多少呢?
- kube-apiserver提供了哪些限流的配置呢?各个配置项的具体含义又是什么?
- 是不是所有的请求都会受限流的控制呢?
- 限流发生在kube-apiserver处理流水线的哪一个环节呢?
- 哪些请求属于long-running-request类的请求呢?
- 理解kube-apiserver自带metrics的重要性。
- k8s原生的限流高阶玩法APF解决了什么问题?
限流参数
按照官方文档的描述,kube-apiserver限流的配置有三个。
- max-mutating-requests-inflight int Default: 200
This and --max-requests-inflight are summed to determine the server's total concurrency limit (which must be positive) if --enable-priority-and-fairness is true. Otherwise, this flag limits the maximum number of mutating requests in flight, or a zero value disables the limit completely. - max-requests-inflight int Default: 400
This and --max-mutating-requests-inflight are summed to determine the server's total concurrency limit (which must be positive) if --enable-priority-and-fairness is true. Otherwise, this flag limits the maximum number of non-mutating requests in flight, or a zero value disables the limit completely. - enable-priority-and-fairness Default: true
If true and the APIPriorityAndFairness feature gate is enabled, replace the max-in-flight handler with an enhanced one that queues and dispatches with priority and fairness
从文档的描述中,这两个值类型参数是限制kube-apiserver接受请求并发的个数,一个是限制mutating(这个要怎么翻译呢?)的请求数,默认值是200,一个是限制非mutating的请求数,默认值是400,至于如何限制,是每秒的并发数?如何观察呢,文档并没有给出详细的解释。带着这些问题,开始下面的实验。
压测kube-apiserver
搭建测试实验环境
显而易见,通过prometheus来观察kube-apiserver的表现是最简单直接的了(prometheus的部署可以自行百度)。
如何配置prometheus抓取kube-apiserver的metrics,也是直接从网上照搬下来的,并没有什么难度。为了更加细粒度的观察实验结果,将采样间隔设置为1s。
[root@localhost ~]# kubectl describe cm -n monitoring prometheus-server-conf
Name: prometheus-server-conf
Namespace: monitoring
Data
====
prometheus.yml:
----
global:
scrape_interval: 1s
scrape_timeout: 1s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
压测kube-apiserver
习惯了使用kubectl直接访问kube-apiserver,学习一下如何用最原始的方式来访问kube-apiserver也是一项重要的技能。
最简单的用curl命令来访问kube-apiserve(非客户端证书方式)。通过一个有适当权限的TOKEN来访问kube-apiserver。
# curl -k 表示跳过证书验证
export TOKEN=`kubectl get secrets node-controller-token-qgqqs -n kube-system -o=jsonpath='{.data.token}' | base64 -d`
curl https://9.9.9.134:6443/api/v1/nodes --header "Authorization: bearer $TOKEN" -k
{
"kind": "NodeList",
"apiVersion": "v1",
"metadata": {
"resourceVersion": "5400924"
},
"items": [
{
.....
}
]
}
高版本的kube-apiserver默认只提供https服务,端口为6443,根证书为私有证书,wrk只能用来压测http或者有公认根证书的https,所以如何简单的用wrk来压测kube-apiserver就成了探究kube-apiserver限流的第一个拦路虎。
在centos系统中,可以将kubernetes的根证书ca.crt放到系统默认的根证书路径下并使之生效即可。
[root@localhost ~]# cp -f /etc/kubernetes/pki/ca.crt /etc/pki/ca-trust/source/anchors/
[root@localhost ~]# update-ca-trust
[root@localhost ~]#
到这里,使用wrk来压测kube-apiserver的环境就搭建好了,之后就可以调整wrk的参数来触发kube-apiserver的限流了。
# 这里的TOKEN就是上面curl中export出来的TOKEN变量。
# 我的测试环境只有一个kube-apiserver实例。
[root@localhost ~]# kubectl get pods -n kube-system -o wide | grep api
kube-apiserver-localhost.localdomain 1/1 Running 0 24m 9.9.9.134 localhost.localdomain <none> <none>
[root@localhost ~]# wrk -d 5s -c 4 -t 1 --header "Authorization: bearer $TOKEN" https://9.9.9.134:6443/api/v1/nodes --latency
Running 5s test @ https://9.9.9.134:6443/api/v1/nodes
1 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.09ms 2.10ms 36.77ms 87.13%
Req/Sec 1.01k 170.39 1.41k 70.00%
Latency Distribution
50% 3.59ms
75% 4.60ms
90% 6.10ms
99% 12.26ms
5039 requests in 5.03s, 69.68MB read
Requests/sec: 1001.54
Transfer/sec: 13.85MB
[root@localhost ~]#
限流源码分析
实验环境中,没有修改kube-apiserver的限流参数,全部使用的默认值。从上面的结果看,单机的kube-apiserver每秒处理的请求可以达到1000,而且code都是200,全部是正常的访问,这比两个默认值200 + 400 = 600 还要高近一倍,并没有触发kube-apiserver的限流,难道是这两个参数不生效吗?
接下来对源码做一下解析。
// 省略不相关代码
func DefaultBuildHandlerChain(apiHandler http.Handler, c *Config) http.Handler {
if c.FlowControl != nil {
handler = filterlatency.TrackCompleted(handler)
handler = genericfilters.WithPriorityAndFairness(handler, c.LongRunningFunc, c.FlowControl, c.RequestWidthEstimator)
handler = filterlatency.TrackStarted(handler, "priorityandfairness")
} else {
// 分析这里最简单的限流方案。
handler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc)
}
}
// WithMaxInFlightLimit limits the number of in-flight requests to buffer size of the passed in channel.
func WithMaxInFlightLimit(
handler http.Handler,
nonMutatingLimit int,
mutatingLimit int,
longRunningRequestCheck apirequest.LongRunningRequestCheck,
) http.Handler {
// 如果限流参数都配置为0,直接返回,不做校验
if nonMutatingLimit == 0 && mutatingLimit == 0 {
return handler
}
var nonMutatingChan chan bool
var mutatingChan chan bool
// 分别创建mutating和非mutaing类型的缓存的channel,channel的容量就是两个限流参数。
if nonMutatingLimit != 0 {
nonMutatingChan = make(chan bool, nonMutatingLimit)
watermark.readOnlyObserver.SetX1(float64(nonMutatingLimit))
}
if mutatingLimit != 0 {
mutatingChan = make(chan bool, mutatingLimit)
watermark.mutatingObserver.SetX1(float64(mutatingLimit))
}
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
requestInfo, ok := apirequest.RequestInfoFrom(ctx)
// Skip tracking long running events.
// 这里略过longRuningRequest的检查,何为LongRuning,下面会介绍
if longRunningRequestCheck != nil && longRunningRequestCheck(r, requestInfo) {
handler.ServeHTTP(w, r)
return
}
var c chan bool
isMutatingRequest := !nonMutatingRequestVerbs.Has(requestInfo.Verb)
if isMutatingRequest {
c = mutatingChan
} else {
c = nonMutatingChan
}
if c == nil {
handler.ServeHTTP(w, r)
} else {
select {
// 如果这个channel可以写成功,说明channel还有空间,可以处理。
case c <- true:
// We note the concurrency level both while the
// request is being served and after it is done being
// served, because both states contribute to the
// sampled stats on concurrency.
if isMutatingRequest {
watermark.recordMutating(len(c))
} else {
watermark.recordReadOnly(len(c))
}
// 这个defer操作会读取channel的值,这样缓存channel的有效容量会加1。
defer func() {
<-c
if isMutatingRequest {
watermark.recordMutating(len(c))
} else {
// 这里会记录当前缓存channel的有效容量,这个值就是metrics apiserver_current_inflight_requests的值。
watermark.recordReadOnly(len(c))
}
}()
handler.ServeHTTP(w, r)
default:
// 代码走到这里,说明带缓存的channel没有容量了,要触发限流操作了。
// at this point we're about to return a 429, BUT not all actors should be rate limited. A system:master is so powerful
// that they should always get an answer. It's a super-admin or a loopback connection.
if currUser, ok := apirequest.UserFrom(ctx); ok {
// 如果请求的用户是特权用户,也不受限流影响。
for _, group := range currUser.GetGroups() {
if group == user.SystemPrivilegedGroup {
handler.ServeHTTP(w, r)
return
}
}
}
// We need to split this data between buckets used for throttling.
// 这里通过metrics记录被drop掉的请求。
if isMutatingRequest {
metrics.DroppedRequests.WithContext(ctx).WithLabelValues(metrics.MutatingKind).Inc()
} else {
metrics.DroppedRequests.WithContext(ctx).WithLabelValues(metrics.ReadOnlyKind).Inc()
}
// 这里记录请求被终止,对应于metrics apiserver_request_terminations_total,所以用该值去判断apiserver是否触发了限流操作相对来说客观一些。
// 同时该metrics中也可以显示出429错误码
metrics.RecordRequestTermination(r, requestInfo, metrics.APIServerComponent, http.StatusTooManyRequests)
// 在该函数中会在回复头中设置retry-after和错误码429,告知客户端1s后重试
tooManyRequests(r, w)
}
}
})
}
func tooManyRequests(req *http.Request, w http.ResponseWriter) {
// Return a 429 status indicating "Too Many Requests"
w.Header().Set("Retry-After", retryAfter)
http.Error(w, "Too many requests, please try again later.", http.StatusTooManyRequests)
}
何为long-runing-request呢?BasicLongRunningRequestCheck函数给出了明确的定义,对于不同的处理,这个long-running-request的定义是不同的。
对于限流操作来说,watch类的请求和pprof的请求属于long-runing-request请求。
LongRunningFunc = genericfilters.BasicLongRunningRequestCheck(sets.NewString("watch"), sets.NewString())
// BasicLongRunningRequestCheck returns true if the given request has one of the specified verbs or one of the specified subresources, or is a profiler request.
func BasicLongRunningRequestCheck(longRunningVerbs, longRunningSubresources sets.String) apirequest.LongRunningRequestCheck {
return func(r *http.Request, requestInfo *apirequest.RequestInfo) bool {
if longRunningVerbs.Has(requestInfo.Verb) {
return true
}
if requestInfo.IsResourceRequest && longRunningSubresources.Has(requestInfo.Subresource) {
return true
}
if !requestInfo.IsResourceRequest && strings.HasPrefix(requestInfo.Path, "/debug/pprof/") {
return true
}
return false
}
}
什么操作属于mutating类操作呢
get,list和watch等查询操作属于非mutating操作,其他都属于mutating操作。
var (
nonMutatingRequestVerbs = sets.NewString("get", "list", "watch")
watchVerbs = sets.NewString("watch")
)
到此,限流的两个参数max-mutating-requests-inflight和max-requests-inflight完全理解清楚,也搞清楚了mutating的定义和long-running-request的定义。
源码比较简单,但是还是有必要梳理一下kube-apiserver限流的处理流程。
- 从kube-apiserver的代码得知,限流的处理是在认证之后,鉴权之前。
- 限流中的两个参数,一个是作用于mutating类的请求,一个是作用于非mutating类的请求。可以粗略的认为非mutating是查询资源,mutaing是操作资源。
- 限流的实现就是,已非mutating操作为例,申请一个容量大小为max-requests-inflight的缓存channel,当请求到达时,占用一个channel的坑位,请求处理完之后,释放一个channel的坑位,所以kube-apiserver限流参数值中的值并不是每秒的请求数。
- 对于long-running-request和特权用户的请求,是不受限流控制的。
- 正是因为这样的限流过于简单粗暴,所以社区开发了高级的限流玩法APF(API Priority and Fairness)限流。
观测kube-apiserver的限流
耳听为虚眼见为实,有了上面的理论分析基础,下面就可以增大wrk中参数的并发量来压测kube-apiserver,并借助prometheus的面板量化的观察kube-apiserver的限流。
实验环境kubernetes版本为1.22,而kube-apiserver的APF特性从1.18版本就默认开启了,本次实验中修改kube-apiserver的配置,将APF关闭,更利于观察kube-apiserver的限流
# 其中enable-priority-and-fairness参数设置为false
root 74107 25.5 10.0 1172420 377168 ? Ssl 08:56 0:07 kube-apiserver --advertise-address=9.9.9.134 --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --feature-gates=IPv6DualStack=true --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-account-signing-key-file=/etc/kubernetes/pki/sa.key --service-cluster-ip-range=10.96.0.0/24,2001:db8:42:1::/112 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key --enable-priority-and-fairness=false
# 这里将c设置为700,超过kube-apiserver非mutating的默认值400。
[root@localhost ~]# wrk -d 30s -c 700 -t 2 --header "Authorization: bearer $TOKEN" https://9.9.9.134:6443/api/v1/nodes --latency
Running 30s test @ https://9.9.9.134:6443/api/v1/nodes
2 threads and 700 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 96.19ms 236.12ms 2.00s 90.64%
Req/Sec 8.77k 2.73k 15.65k 67.43%
Latency Distribution
50% 14.79ms
75% 31.67ms
90% 301.99ms
99% 1.23s
511597 requests in 30.02s, 251.76MB read
Socket errors: connect 0, read 0, write 0, timeout 521
Non-2xx or 3xx responses: 503952
Requests/sec: 17044.57
Transfer/sec: 8.39MB
登录prometheus的面板,查看metrics apiserver_request_terminations_total、apiserver_dropped_requests_total、apiserver_current_inflight_requests的值,从图可以看出并发值大于400的时候,就会触发kube-apiserver的限流操作
低版本的kubernetes指标值名称可能不一样。
wrk中超时的和非200或300的请求之和是503952 + 521 = 504383。
prometheus面板中的两个值的和是 504086 + 297 = 504383 。
这两个的值加起来正好是可以对应上的。
在看下apiserver_dropped_requests_total的值
在看下apiserver_current_inflight_requests的值,最大值就是400,与kube-apiserver中的默认值是吻合的。
实验环境还开启了审计日志,从审计日志可以看到请求的具体信息,包括错误码。
当集群节点数目不多的情况下,kube-apiserver依然触发了限流,大概率是因为某个客户端的不恰当的使用,单从kube-apiserver暴露的的metrics来找到这个客户端就比较困难了,所以这里可以借助审计日志来分析异常请求。例如configmap类型的请求过多,可以配置审计策略只记录configmap资源的请求。
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"5ebb71a2-9fd4-4527-8c7b-4f45c182a067","stage":"ResponseComplete","requestURI":"/api/v1/nodes","verb":"list","user":{"username":"system:serviceaccount:kube-system:node-controller","uid":"79d53bde-776f-4494-a2c7-09b98e153a3f","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["9.9.9.134"],"objectRef":{"resource":"nodes","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":429},"requestReceivedTimestamp":"2022-04-14T03:26:42.099782Z","stageTimestamp":"2022-04-14T03:26:42.099821Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:node-controller"}}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"331ac4a2-1a6d-4063-908d-6588ca8a92c7","stage":"ResponseComplete","requestURI":"/api/v1/nodes","verb":"list","user":{"username":"system:serviceaccount:kube-system:node-controller","uid":"79d53bde-776f-4494-a2c7-09b98e153a3f","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["9.9.9.134"],"objectRef":{"resource":"nodes","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":429},"requestReceivedTimestamp":"2022-04-14T03:26:42.094597Z","stageTimestamp":"2022-04-14T03:26:42.099842Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:node-controller"}}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"1afc35c1-1da2-4a98-bcac-f22f45e22905","stage":"ResponseComplete","requestURI":"/api/v1/nodes","verb":"list","user":{"username":"system:serviceaccount:kube-system:node-controller","uid":"79d53bde-776f-4494-a2c7-09b98e153a3f","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["9.9.9.134"],"objectRef":{"resource":"nodes","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":429},"requestReceivedTimestamp":"2022-04-14T03:26:42.099792Z","stageTimestamp":"2022-04-14T03:26:42.099849Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:node-controller"}}
APF (API Priority and Fairness)
通过前面的分析,可以看出这种最简的限流算法粒度非常粗,只将请求分成了mutating和非mutating(readonly)请求。当某个客户端错误向kube-apiserver发起大量的请求时,必然会触发kube-apiserver的限流操作,影响其他的客户端的请求。因此,高阶的限流玩法APF就诞生了。
APF的基本原理就是对请求进行更细粒度的划分,以期有限保证高优先级的请求的目的,从名字上看,有两个关键词。
- Priority 请求是分优先级的,高优先级的请求要比低优先级的请求处理的要多一些。
- Fairness 公平,同级别的请求是被公平对待的。
举个简单的例子,kube-apiserver限流配额是600,那么可以按照User或者namespace进行份额分配,这样即使某个User或者namespace的请求异常了,也不会影响其他的User或namespace的请求。
从官方文档的描述中,可以看到APF的概念很多,metrics指标也很多(metrics指标多更容易分析kube-apiserver的请求分布情况),源码实现也是挺复杂,所以这里只是简单介绍一下APF,主要目的还是介绍如何通过kube-apiserver暴露的metrics指标观察kube-apiserver的表现,当kube-apiserver触发限流时,能更进一步的确认是客户端请求异常还是kube-apiserver本身到达了瓶颈。
下面是APF默认的一些规则,可以看出system-leader-election(选主)请求的优先级是比较高的。
[root@localhost ~]# kubectl get prioritylevelconfigurations.flowcontrol.apiserver.k8s.io
NAME TYPE ASSUREDCONCURRENCYSHARES QUEUES HANDSIZE QUEUELENGTHLIMIT AGE
catch-all Limited 5 <none> <none> <none> 115d
exempt Exempt <none> <none> <none> <none> 115d
global-default Limited 20 128 6 50 115d
leader-election Limited 10 16 4 50 115d
node-high Limited 40 64 6 50 115d
system Limited 30 64 6 50 115d
workload-high Limited 40 128 6 50 115d
workload-low Limited 100 128 6 50 115d
[root@localhost ~]# kubectl get flowschemas.flowcontrol.apiserver.k8s.io
NAME PRIORITYLEVEL MATCHINGPRECEDENCE DISTINGUISHERMETHOD AGE MISSINGPL
exempt exempt 1 <none> 115d False
probes exempt 2 <none> 115d False
system-leader-election leader-election 100 ByUser 115d False
workload-leader-election leader-election 200 ByUser 115d False
system-node-high node-high 400 ByUser 115d False
system-nodes system 500 ByUser 115d False
kube-controller-manager workload-high 800 ByNamespace 115d False
kube-scheduler workload-high 800 ByNamespace 115d False
kube-system-service-accounts workload-high 900 ByNamespace 115d False
service-accounts workload-low 9000 ByUser 115d False
global-default global-default 9900 ByUser 115d False
catch-all catch-all 10000 ByUser 115d False
apiserver中部分metrics的定义
这里只是部分kube-apiserver的metrics值的定义。
从名称可以大概看出metrics的作用,如requestCounter表示请求的总量,RegisteredWatchers表示连接到kube-apiserver的watch请求的个数,要想深入的了解kube-apiserver的metrics,还是要看相应的源码描述及实现。
metrics = []resettableCollector{
deprecatedRequestGauge,
requestCounter,
longRunningRequestGauge,
requestLatencies,
responseSizes,
DroppedRequests,
TLSHandshakeErrors,
RegisteredWatchers,
WatchEvents,
WatchEventsSizes,
currentInflightRequests,
currentInqueueRequests,
requestTerminationsTotal,
apiSelfRequestCounter,
requestFilterDuration,
requestAbortsTotal,
requestPostTimeoutTotal,
}
参考文档
https://github.com/kubernetes/kubernetes/commit/73614ddd4e42728a36c7ac6b7b20f27c8032cafb
APF从kubernetes1.18版本开始默认被开启。