前言
ElasticDL是基于 TensorFlow2.0 的支持弹性调度的深度学习系统。可以认为是Kubeflow的升级版。更重要的是ElasticDL是国人使用Python开发的软件。由于ElasticDL是调用 Kubernetes API 来起止进程,所以必须安装Kubernetes。又因为众所周知的原因,在本地机器安装Kubernetes会出现拉取镜像失败的情况,建议大家使用阿里云的香港或国外地区的云主机。创建一个按量付费的4核16G的云主机,使用完之后停机就不再扣费,是体验和学习AI的最省钱方案。快捷通道
安装Python3
ElasticDL 要求Python >= 3.6
Ubuntu18.04 自带Python3.6,满足条件。
Ubuntu16.04 自带Python3.5,需要升级成python3.6。详情查看
安装Docker
$ sudo apt-get update
# 安装依赖包
$ sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
# 添加 Docker 的官方 GPG 密钥
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
# 验证您现在是否拥有带有指纹的密钥
$ sudo apt-key fingerprint 0EBFCD88
# 设置稳定版仓库
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
安装 Docker Engine-Community
# 更新
$ sudo apt-get update
# 安装最新的Docker-ce
$ sudo apt-get install docker-ce
# 启动
$ sudo systemctl enable docker sudo systemctl start docker
安装kubectl
$ curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin
安装minikube
$ curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
装完后,验证一下版本:
$ minikube version
minikube version: v1.11.0
commit: 57e2f55f47effe9ce396cea42a1e0eb4f611ebbd
安装ElasticDL客户端和下载源码
$ pip install elasticdl_client
$ git clone https://github.com/sql-machine-learning/elasticdl.git
创建Kubernetes集群
$ sudo mkdir /data
$ minikube start --vm-driver=none --cpus 2 --memory 6144 --disk-size=50gb --mount=true --mount-string="/data:/data"
$ cd elasticdl
$ kubectl apply -f elasticdl/manifests/elasticdl-rbac.yaml
创建docker分布式训练镜像
$ cd model_zoo
$ elasticdl zoo init
$ elasticdl zoo build --image=elasticdl:mnist .
准备mnist数据
$ docker pull elasticdl/elasticdl:dev
$ cd ..
$ docker run --rm -it \
-v $HOME/.keras/datasets:/root/.keras/datasets \
-v $PWD:/work \
-w /work elasticdl/elasticdl:dev \
bash -c "scripts/gen_dataset.sh data"
$ sudo cp -r data/* /data
开始训练
$ elasticdl train \
--image_name=elasticdl:mnist \
--model_zoo=model_zoo \
--model_def=mnist_functional_api.mnist_functional_api.custom_model \
--training_data=/data/mnist/train \
--validation_data=/data/mnist/test \
--num_epochs=2 \
--master_resource_request="cpu=0.2,memory=1024Mi" \
--master_resource_limit="cpu=1,memory=2048Mi" \
--worker_resource_request="cpu=0.4,memory=1024Mi" \
--worker_resource_limit="cpu=1,memory=2048Mi" \
--ps_resource_request="cpu=0.2,memory=1024Mi" \
--ps_resource_limit="cpu=1,memory=2048Mi" \
--minibatch_size=64 \
--num_minibatches_per_task=2 \
--num_ps_pods=1 \
--num_workers=1 \
--evaluation_steps=50 \
--grads_to_wait=1 \
--job_name=test-mnist \
--log_level=INFO \
--image_pull_policy=Never \
--volume="host_path=/data,mount_path=/data" \
--distribution_strategy=ParameterServerStrategy
# 检查job状态和日志
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
elasticdl-test-mnist-master 1/1 Running 0 33s
elasticdl-test-mnist-ps-0 1/1 Running 0 30s
elasticdl-test-mnist-worker-0 1/1 Running 0 30s
kubectl logs elasticdl-test-mnist-worker-0 | grep "Loss"
[2020-04-14 02:46:28,535] [INFO] [worker.py:879:_process_minibatch] Loss is 3.07190203666687
[2020-04-14 02:46:28,920] [INFO] [worker.py:879:_process_minibatch] Loss is 9.413976669311523
[2020-04-14 02:46:29,120] [INFO] [worker.py:879:_process_minibatch] Loss is 3.9641590118408203
[2020-04-14 02:46:29,344] [INFO] [worker.py:879:_process_minibatch] Loss is 15.329755783081055
[2020-04-14 02:46:29,551] [INFO] [worker.py:879:_process_minibatch] Loss is 3.8414430618286133
[2020-04-14 02:46:29,817] [INFO] [worker.py:879:_process_minibatch] Loss is 2.7703640460968018
[2020-04-14 02:46:30,041] [INFO] [worker.py:879:_process_minibatch] Loss is 6.920175075531006
[2020-04-14 02:46:30,242] [INFO] [worker.py:879:_process_minibatch] Loss is 4.37514925003051
$ kubectl logs elasticdl-test-mnist-master | grep "Evaluation"
[2020-04-14 02:46:21,836] [INFO] [master.py:192:prepare] Evaluation service started
[2020-04-14 02:46:40,750] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=50]: {'accuracy': 0.21933334}
[2020-04-14 02:46:53,827] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=100]: {'accuracy': 0.5173333}
[2020-04-14 02:47:07,529] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=150]: {'accuracy': 0.6253333}
[2020-04-14 02:47:23,251] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=200]: {'accuracy': 0.752}