consul+upsync 实现ingress controller 无损更新

背景

ingress-controller 实现了集群内部服务的负载均衡，对于公有云环境，我们可以通过LoadBalance 类型的Service实现ingress-controller 的负载均衡。但是私有云环境，对于负载均衡的支持有限，虽然有MetalLB这样的开源的解决方案。但是真正在生产应用的案例并不多。而且对于已有一套完整架构的公司，直接将ingress-controller 暴露给公网业务或者内部调用的情况并不多，大部分都是在ingress-controller 作为upstream 挂载为一组负载均衡（nginx）的后端来提供服务。因此。如何实现ingress-controller 的无损更新也就成了如何实现upstream 的server 如何无损摘除的问题。这里我们采用的是微博的upsync 插件和 consul 来实现。

具体实现

大致说明

集群中创建单独的node节点（配置可以略低），仅用来运行ingress-controller(daemonset)，通过给节点打标签和污点的方式实现，ingress-controller采用HostNetwork方式占用宿主机端口，并创建ClusterIP 的service。
集群中部署consul-sync-catalog 服务，consul-sync-catalog 可以将kubernetes中的service 同步到consul 集群中注册为服务
nginx 通过微博的upsync 组件动态获取consul 中服务对应instance 的ip 和端口。当kubernetes 中的endpoints 发生变动时，consul-sync-catalog同步对应的变动到consul，upsync 组件自动变更upstream对应的server列表，无需reload nginx。

拓扑图

操作步骤

集群操作

设置node节点role，并打污点

$ kubectl label  node pg-k8s-node-01   node-role.kubernetes.io/edge=edge
$ kubectl label  node pg-k8s-node-02   node-role.kubernetes.io/edge=edge
$ kubectl taint node pg-k8s-node-01 node-role.kubernetes.io/edge:NoSchedule
$ kubectl taint node pg-k8s-node-02 node-role.kubernetes.io/edge:NoSchedule

修改ingress-controller 为daemonset（我们使用的是istio的ingressgateway），修改部分见yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: istio-ingressgateway
    istio: ingressgateway
    operator.istio.io/component: IngressGateways
    operator.istio.io/managed: Reconcile
    operator.istio.io/version: 1.5.0
    release: istio
  name: istio-ingressgateway
  namespace: istio-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2020-09-10T10:50:07+08:00"
        sidecar.istio.io/inject: "false"
      creationTimestamp: null
      labels:
        app: istio-ingressgateway
        chart: gateways
        heritage: Tiller
        istio: ingressgateway
        release: istio
        service.istio.io/canonical-name: istio-ingressgateway
        service.istio.io/canonical-revision: "1.5"
    spec:
      containers:
      - args:
        - proxy
        - router
        - --domain
        - $(POD_NAMESPACE).svc.cluster.local
        - --proxyLogLevel=warning
        - --proxyComponentLogLevel=misc:error
        - --log_output_level=default:info
        - --drainDuration
        - 45s
        - --parentShutdownDuration
        - 1m0s
        - --connectTimeout
        - 10s
        - --serviceCluster
        - istio-ingressgateway
        - --zipkinAddress
        - zipkin.istio-system:9411
        - --proxyAdminPort
        - "15000"
        - --statusPort
        - "15020"
        - --controlPlaneAuthPolicy
        - NONE
        - --discoveryAddress
        - istio-pilot.istio-system.svc:15012
        - --trust-domain=cluster.local
        env:
        - name: SERVICE_NAME
          value: ingress-test
        - name: JWT_POLICY
          value: first-party-jwt
        - name: PILOT_CERT_PROVIDER
          value: istiod
        - name: ISTIO_META_USER_SDS
          value: "true"
        - name: CA_ADDR
          value: istio-pilot.istio-system.svc:15012
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: INSTANCE_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: SERVICE_ACCOUNT
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.serviceAccountName
        - name: ISTIO_META_WORKLOAD_NAME
          value: istio-ingressgateway
        - name: ISTIO_META_OWNER
          value: kubernetes://apis/apps/v1/namespaces/istio-system/deployments/istio-ingressgateway
        - name: ISTIO_META_MESH_ID
          value: cluster.local
        - name: ISTIO_AUTO_MTLS_ENABLED
          value: "true"
        - name: ISTIO_META_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: ISTIO_META_CONFIG_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: ISTIO_META_ROUTER_MODE
          value: sni-dnat
        - name: ISTIO_META_CLUSTER_ID
          value: Kubernetes
        image: dockerhub.piggy.xiaozhu.com/istio/proxyv2:1.5.0
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - sleep 40
        name: istio-proxy
        ports:
        - containerPort: 15020
          hostPort: 15020
          protocol: TCP
        - containerPort: 80
          hostPort: 80
          protocol: TCP
        - containerPort: 443
          hostPort: 443
          protocol: TCP
        - containerPort: 15029
          hostPort: 15029
          protocol: TCP
        - containerPort: 15030
          hostPort: 15030
          protocol: TCP
        - containerPort: 15031
          hostPort: 15031
          protocol: TCP
        - containerPort: 15032
          hostPort: 15032
          protocol: TCP
        - containerPort: 15443
          hostPort: 15443
          protocol: TCP
        - containerPort: 15011
          hostPort: 15011
          protocol: TCP
        - containerPort: 8060
          hostPort: 8060
          protocol: TCP
        - containerPort: 853
          hostPort: 853
          protocol: TCP
        - containerPort: 15090
          hostPort: 15090
          name: http-envoy-prom
          protocol: TCP
        readinessProbe:
          failureThreshold: 30
          httpGet:
            path: /healthz/ready
            port: 15020
            scheme: HTTP
          initialDelaySeconds: 1
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "2"
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 128Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/run/secrets/istio
          name: istiod-ca-cert
        - mountPath: /var/run/ingress_gateway
          name: ingressgatewaysdsudspath
        - mountPath: /etc/istio/pod
          name: podinfo
        - mountPath: /etc/istio/ingressgateway-certs
          name: ingressgateway-certs
          readOnly: true
        - mountPath: /etc/istio/ingressgateway-ca-certs
          name: ingressgateway-ca-certs
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet  # 如果不修改，pod 的dns-server 会变成宿主机的，无法访问集群内部的svc
      hostNetwork: true # pod 使用宿主机网络
      nodeSelector:
        node-role.kubernetes.io/edge: edge   # 进部署在edge(边缘)节点
      tolerations:
      - key: "node-role.kubernetes.io/edge"
        operator: "Exists"
        effect: "NoSchedule"  # 增加容忍
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: istio-ingressgateway-service-account
      serviceAccountName: istio-ingressgateway-service-account
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: istio-ca-root-cert
        name: istiod-ca-cert
      - emptyDir: {}
        name: data
      - configMap:
          defaultMode: 420
          name: consul-client-config
        name: config
      - downwardAPI:
          defaultMode: 420
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels
            path: labels
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.annotations
            path: annotations
        name: podinfo
      - emptyDir: {}
        name: ingressgatewaysdsudspath
      - name: ingressgateway-certs
        secret:
          defaultMode: 420
          optional: true
          secretName: istio-ingressgateway-certs
      - name: ingressgateway-ca-certs
        secret:
          defaultMode: 420
          optional: true
          secretName: istio-ingressgateway-ca-certs
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

部署consul-sync-catalog 服务

$ helm repo add hashicorp https://helm.releases.hashicorp.com # helm添加consul源
$ cat config.yaml
syncCatalog:
  enabled: true  # 默认不安装sync，需要手动开启
  toConsul: true # 开启k8s->consul的同步
  toK8S: false # 关闭consul->k8s的同步
  default: false # If true, all valid services in K8S are synced by default. If false, the service must be annotated properly to sync. In either case an annotation can override the default
$ helm install consul hashicorp/consul --set global.name=consul -n consul -f config.yaml # 会安装consul，consul-server,consul-sync-catalog

通过helm 安装的consul-server pod 处于pending 中，因为他需要pvc 而我本地并没有，我们将需要持久化的数据目录修改为emptydir 来解决。并修改consul-server 的svc 类型为nodeport 以便外部访问

给ingressgateway 的svc 添加annotations

apiVersion: v1
kind: Service
metadata:
  annotations:
    consul.hashicorp.com/service-name: ingressgateway
    consul.hashicorp.com/service-port: http2
    consul.hashicorp.com/service-sync: "true"
...

从consu-ui 中查看service ，已经有一个名为ingressgateway 的服务注册成功了。包含两个instances。

负载均衡配置

安装upsync 插件

$ cd /root 
$ yum install git pcre-devel openssl-devel # 安装openresty 依赖相关包
$ git clone https://github.com/weibocom/nginx-upsync-module.git # 下载upsync 插件
$ wget https://openresty.org/download/openresty-1.17.8.1.tar.gz -o openresty-1.17.8.1.tar.gz && tar zxf openresty-1.17.8.1.tar.gz && cd openresty-1.17.8.1
$ ./configure --prefix=/usr/local/openresty --add-module=../nginx-upsync-module/
$ gmake -j 4 && gmake install

配置upstream 从consul 获取server列表

upstream app {
   upsync 10.4.10.176:8500/v1/catalog/service/ingressgateway upsync_timeout=6m upsync_interval=1000ms upsync_type=consul_services strong_dependency=off;
   upsync_dump_path /tmp/servers_app.conf;
   include /tmp/servers_app.conf;
   server 0.0.0.0:80 down;   # 如果不加这一行，第一次reload 会因为没有server 而报错。
}
server {
  listen       80;
  server_name  api.itanony.com;
  charset utf-8;

  location /upstream_list {
      upstream_show;
  }
  location = /api/v1/products {
      proxy_pass http://app;
      proxy_http_version 1.1;
  }

}

相关问题

502

在ingressgateway滚动更新过程中进行压测，使用如下命令，日志中还是会有502 的情况，怀疑是consul-sync-catalog同步不及时。

for i in `seq 1 1000`; do  curl -o /dev/null -s -w "%{time_total}:%{http_code}\n"  http://api.xiaozhu.com/api/v1/products| tee -a 1.log; done

查看到官方文档中有consulWriteInterval的配置

consulWriteInterval (string: null) - Override the default interval to perform syncing operations creating Consul services.

这个参数应该是控制consul-sync-catalog向consul 集群同步间隔的(看来consul-sync-catalog不是实时的)。这种情况下，我们可以通过调小这个间隔或者通过给ingressgateway添加一段prestop来解决。注意sleep 的时间要大于consulWriteInterval的值

最后的解决方法：

$ cat config.yaml
syncCatalog:
  enabled: true
  toConsul: true
  toK8S: false
  default: false
  consulWriteInterval: 10s
$ helm upgrade consul hashicorp/consul --set global.name=consul -n consul -f config.yaml
$ kubectl get daemonsets.apps  -n istio-system  istio-ingressgateway  -o yaml
...
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - sleep 40
...

一些思考

这种方式的好处是ingress-controller 直接通过宿主机网络来实现流量收发。性能相对比nodeport 要高，而且可以避免nodeport 的一些弊端。相比lvs 方案，也可以避免vrrp 切换中间的流量损失
为什么不用采用NodePort 方式暴露ingress？在测试中。将NodePort的外部流量策略改为Local 或者 Cluster 的情况下。consul-sync-catalog 均会根据pod 的分布，将没有pod 处于不可用状态的node节点从instance 中摘除。但是NodePort 相比直接采用宿主机网络会经过一次目的地址转换。效率自然相比宿主机网络模式要低一点。
弊端：虽然保证了pod滚动更新情况下的完全无损。但是。如果pod 因为一些原因，就绪探针失败而从svc 上摘除，这块还是没法实现完全无损，不考虑同步的网络延时。这个故障间隔最大是consulWriteInterval。这里可以通过nginx_upstream_check_module 在负载层做健康监测主动摘除减低影响。或者应用+负载层的重试来避免。
如果企业内部已经通过consul 来做服务发现。那其实我们可以借助一些非cluster network plugins如macvlan来实现内部业务的无缝迁移kubernetes。仅使用kubernetes 的调度和资源编排能力，业务的服务发现和服务注册依然通过consul 来实现。
对于同一个svc中定义了多个端口的服务，consul-sync-catalog默认会以第一个端口作为instance 的port。其他端口通过metadata 的形式写入consul中。当然我们也可以通过添加consul.hashicorp.com/service-port 的注解来显式指定哪个端口作为instance 的端口。其他端口想同步。那我们就再定义几个服务吧。。。

参考

https://www.consul.io/docs/k8s/service-sync#syncing-kubernetes-and-consul-services

https://www.consul.io/docs/k8s/helm