Sharing about chaos-monkey in our products

As you know, chaos-monkey is a solution of Chaos-Engineer, which is a popular topic of computer scientist domain. Chaos Engineer is aimed at simulating the unexpected situations, such as lost connection, network traffic, network latency and etc. to reduce the risks of production. We are in information era now, and everything grow faster and faster. If the broken product occurred in production environment, which may led to the lost the revenues, lost the marketplace, lost our clients without any sympathy. Therefore, we should take actions to break the passive situation and make a positive response to detect the something which will be occurred when it is published. Of course, there are many monkeys, like doctor monkey[scan the product with security], heavy monkey[overload flows] ..., Chaos monkey is a tool, which simulates the network issues to help engineer to detect the potential problems。
Our team has also implemented the simple chaos monkey, a tool to test suite robustness by randomly make something be broken of the whole system. The environment is deployed on the vm in kubernetes. We can see many pods in the cluster with active status. The tool is help us to analysis the system's health. If it met some errors like some pods are inactive, which means some backend services don't provide the support, whether the whole system still can save itself or not.
Kubernetes official website provides the interfaces "kubernetes-clients", which give us an opportunity to monitor or perform operation in the pod level. kubernetes-client is a module of python, we can use python to execute the some commands that we want to simulate the network traffic.
If we want to list all pods in cluster, we can use below codes:

from kubernetes import client, config

# Configs can be set in Configuration class directly or using helper utility
config.load_kube_config()

v1 = client.CoreV1Api()
print("Listing pods with their IPs:")
ret = v1.list_pod_for_all_namespaces(watch=False)
for i in ret.items:
    print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name))

as you see, the code is very short and small. At the meanwhile, you also can get the delete command as below:

            body = client.V1DeleteOptions()
            one_pod = pod_list[0]
            namespace = one_pod.metadata.namespace
            pod_name = one_pod.metadata.name
            delete_pod_name_list.append(pod_name)
            logging.info("start deleting the pod %s in namespace %s." % (pod_name, namespace))
            result = self.client.delete_namespaced_pod(name=pod_name, namespace=namespace, body=body)

We define the pod list in excel, which contains the information of pods we want to execute the random delete action, cron express language: we want to do this action's period, label-selector: identity the pod to do operation, strategy: one, random and all, which used to define the matched pods to do operation. If pick up the one, and matched pods' number is more than one, it will only choose the first matched pod to execute the command.

image.png

In order to perform the chaos money test in random environment, we also provide a simple road to achieve this target. We make a Dockerfile and build a image that contains our core features of chaos money. In tend to monitoring the status of cluster and describing the test result. we use flask architect to show them. If someone ask you to test the environment, you just need to copy the deployment, replace the value of placeholder "hostname" with the current cluster master node, after that, run the kubectl create/apply -f chaos-monkey.yaml, the pod will be started, and you will see some pods restarted randomly in cluster and the report will be shown in website.
If you want to disable the job, you can run this command: kubectl delete -f chaos-monkey.yaml, the job will be stopped.
currently, we only implement the kill pod automatically.It is very useful for us to ensure the published products more reliable and boost our confidence. we will continue to enhance this tool (add network latency, overload flows ...) in the future.

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • Chapter 1 In the year 1878, I took my degree of Doctor of...
    foxgti阅读 9,296评论 0 6
  • 又是一个起床工作的一天,这几天很冷很冷,你要多穿点呀
    暗黑系少女阅读 773评论 0 0
  • 花姑娘在春天绽放 冬天的薄雾吹凉了心的寄寓 我在画中的凉亭 手持宋唐画扇 点一抹绿 在漫山遍野等到落叶遍地 画纸上...
    疯子苏小柒阅读 3,696评论 0 48
  • 转眼间又过去很多天了。这段时间也是特别的忙。再次打开简书,已经十几天没写东西了。当初对自己的承诺,也被理所当然地抛...
    岁月静好1阅读 1,547评论 0 1
  • 昨晚上又拎着箱子到了火车站,六盘水,昆明,攀枝花,或者是成都……这都是我经常要经过的车站,好多车站很多年都没有变过...
    e4c7dd0e1543阅读 819评论 0 0