klever

解决问题:

模型的管理和分发
模型解析和转换
在线模型服务部署和管理

组件

ormb：模型打包、解压、上传、下载工具，
model-registry：模型仓库及模型服务 API 管理层，model-registry上传文件的时候会有bug，文件未保存，简单改下代码就行了
modeljob-operator：ModelJob controller，管理模型解析、模型转换任务
klever-web：前端组件

依赖组件

Istio：开源服务网格组件，模型服务通过 Istio 对外暴露模型服务地址，实现模型服务按内容分流和按比例分流
Harbor：模型底层存储组件，对模型配置和模型文件进行分层存储
Seldon Core：开源模型服务管理的 Seldon Deployment CRD 的 controller，通过 SeldonDeployment CR 实现模型服务的管理

ORMB使用

模型定义config，mediaType 暂定为 application/vnd.caicloud.model.config.v1alpha1+json

模型文件较难分层存储，ormb在设计中，模型文件以 application/tar+gzip 的 mediaType 压缩归档后上传到镜像仓库

注意：ormb pull或者push加上--plain-http，跳过https连接

ormb login 当前目录生成config.json，保存令牌

$ ormb login  --insecure 192.168.194.129:30022 -u admin -p Ormb123456

ormb save (将模型目录中的文件保存在本地文件系统的缓存中cache)

$ ormb save <model directory> 192.168.194.129:30022/lgy/fashion-model:v1

ormb push (将保存在缓存中的模型推送到远端仓库中)

$ ormb push 192.168.194.129:30022/lgy/fashion-model:v1

ormb tag

ormb tag 192.168.194.129:30022/lgy/fashion-model:v1 192.168.194.129:30022/lgy/fashion-model:v2

ormb pull

ormb pull --plain-http 192.168.194.129:30022/lgy/fashion-model:v1

ormb export

ormb export 192.168.194.129:30022/lgy/fashion-model:v1

ormb-storage-initializer使用

该命令整合ormb的login、pull、export，需要设置环境变量ORMB_USERNAME（Harbor账号）和ORMB_PASSWORD（Harbor密码）

ormb-storage-initializer pull-and-export 192.168.194.129:30022/lgy/fashion-model:v1 ./model

拉取模型文件保存在./model目录下，结构：

[root@localhost model]# tree
.
├── 1
│   ├── saved_model.pb
│   └── variables
│       ├── variables.data-00000-of-00001
│       └── variables.index
└── ormbfile.yaml

modeljob-operator

模型解析：

apiVersion: kleveross.io/v1alpha1
kind: ModelJob
metadata:
  name: modeljob-savedmodel-extract
  namespace: default
  labels:
    modeljob/extract: "true"
spec:
  # Add fields here
  model: "192.168.194.129:30022/lgy/fashion-model:v1"
  extraction:
    format: "SavedModel"

模型转换：

apiVersion: kleveross.io/v1alpha1
kind: ModelJob
metadata:
  name: modeljob-caffe-convert
  namespace: default
  labels:
    modeljob/convert: "true"
spec:
  # Add fields here
  model: "192.168.194.129:30022/lgy/caff-model:v1"
  desiredTag: "192.168.194.129:30022/lgy/caff-model:v2"
  conversion:
    mmdnn:
      from: "CaffeModel"
      to: "NetDef"

MLflow

模块	功能
Tracking	记录实验参数和比较结果的API，并提供了可视化的UI界面
Projects	提供了一种标准目录格式，包括一个描述文件，把机器学习代码打包成可复用，可重现，可分享的项目
Models	提供通用的模型文件管理和部署能力，支持多种框架和多种平台的部署
Model Registry	提供模型全生命周期管理和协作中心，包括版本管理，环境迁移等等

安装

pip install mlflow
pip install conda

Tracking

import os
from random import random, randint

from mlflow import log_metric, log_param, log_artifacts

if __name__ == "__main__":
    print("Running mlflow_tracking.py")

    log_param("param1", randint(0, 100))

    log_metric("foo", random())
    log_metric("foo", random() + 1)
    log_metric("foo", random() + 2)

    if not os.path.exists("outputs"):
        os.makedirs("outputs")
    with open("outputs/test.txt", "w") as f:
        f.write("hello world!")

    log_artifacts("outputs")

python mlflow_tracking.py

Tracking的结果会记录在目录下，生成mlruns目录

效果

mlflow ui

访问

image

Projects

代码和执行环境打包

新建MLproject文件和conda.yaml

conda.yaml

name: tutorial
channels:
  - conda-forge
dependencies:
  - python=3.6
  - pip
  - pip:
    - scikit-learn==0.23.2
    - mlflow>=1.0
    - pandas

MLproject

name: tutorial

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      alpha: {type: float, default: 0.5}
      l1_ratio: {type: float, default: 0.1}
    command: "python train.py {alpha} {l1_ratio}"

这样就会记录该模型所需的环境信息，执行如下命令即可复现模型结果。如果不需要conda，则需要保障运行的环境已经安装了必要的依赖，在命令上加上--no-conda即可

mlflow run sklearn_elasticnet_wine -P alpha=0.5 -P l1_ratio=0.1

image

Models

$ python sklearn_logistic_regression/train.py
Score: 0.6666666666666666
Model saved in run 96f95c78fe7d4de88199a89f87a89762

启动一个web服务器来服务一个用MLflow保存的模型(算法服务)

$ mlflow models serve -m runs:/c19192871687493b940100db7c461fd3/model
2021/09/09 15:14:44 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2021/09/09 15:14:44 INFO mlflow.pyfunc.backend: === Running command 'source /root/miniconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-258677fee9248770821ae816e559134654b19176 1>&2 && gunicorn --timeout=60 -b 127.0.0.1:5000 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app'
[2021-09-09 15:14:44 +0800] [90727] [INFO] Starting gunicorn 20.1.0
[2021-09-09 15:14:44 +0800] [90727] [INFO] Listening at: http://127.0.0.1:5000 (90727)
[2021-09-09 15:14:44 +0800] [90727] [INFO] Using worker: sync
[2021-09-09 15:14:44 +0800] [90757] [INFO] Booting worker with pid: 90757

请求：

curl -d '{"columns":["x"], "data":[[1], [-1]]}' -H 'Content-Type: application/json; format=pandas-split' -X POST 127.0.0.1:5000/invocations
[1, 0]

model registry

集中的模型存储,apis,UI,用来全周期的管理model，他能提供一种模型血缘，模型版本，以及模型的阶段切换

minio作为模型数据的存储后台，sqlite作为模型元数据的存储

export AWS_ACCESS_KEY_ID=admin      
export AWS_SECRET_ACCESS_KEY=liguoyu3564      
export MLFLOW_S3_ENDPOINT_URL=http://localhost:9000     
mlflow server \
--host 0.0.0.0 -p 5002 \
--default-artifact-root s3://mlflow \
--backend-store-uri sqlite:///mlflow.db

image

可以进行stage的切换，默认stage是None，Staging 表示正在筹备阶段，Production表示已经在线上环境阶段，Archived 表示存档阶段，也就是处于抛弃状态

image

启动方式的改变：

mlflow models serve -m "models:/newmodel/Production" -p 12346 -h 0.0.0.0 --no-conda

Hub

解决问题：数据的管理和预处理

hub可以存储数据集合作为单一的numpy类型的数组，数据大小可以到PT级别，并存储在云上，无缝地在任何机器上访问和使用这些数据。

Hub使得任何类型的存储在云上的数据，可以同前端存储一样快速地被使用，数据类型包括图片音频和视频。可以与pytorch和TensorFlow集成

安装

$ pip3 install hub

创建数据集

(base) [root@localhost hub]# tree animals/ -C   // 现有数据集目录
animals/
├── cats
│   ├── image_1.jpg
│   └── image_2.jpg
└── dogs
    ├── image_3.jpg
    └── image_4.jpg

创建hub数据集

手动创建

import hub
from PIL import Image
import numpy as np
import os
# 创建空数据集
ds = hub.empty('./animals_hub') # Creates the dataset
# 便利获取需要上传的数据
dataset_folder = './animals'

class_names = os.listdir(dataset_folder)

files_list = []
for dirpath, dirnames, filenames in os.walk(dataset_folder):
    for filename in filenames:
        files_list.append(os.path.join(dirpath, filename))
        
# 创建数据张量和元数据        
with ds:
    ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
    ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
    ds.info.update(description = 'My first Hub dataset')
    ds.images.info.update(camera_type = 'SLR')
# 数据填充 
with ds:
    # Iterate through the files and append to hub dataset
    for file in files_list:
        label_text = os.path.basename(os.path.dirname(file))
        label_num = class_names.index(label_text)
        
        ds.images.append(hub.read(file))  # Append to images tensor using hub.read
        ds.labels.append(np.uint32(label_num)) # Append to labels tensor

自动创建

src = "./animals"
dest = './animals_hub_auto' // 数据集，这里采用本地存储

ds = hub.ingest(src, dest)

数据压缩

ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')

数据访问

import hub

# Local Filepath
ds = hub.load('./my_dataset_path')

# S3
ds = hub.load('s3://my_dataset_bucket', creds={...})

## Activeloop Storage - See Step 6
# Public Dataset hosted by Activeloop
ds = hub.load('hub://activeloop/public_dataset_name')

# Dataset in another workspace on Activeloop Platform
ds = hub.load('hub://workspace_name/dataset_name')

### NO HIERARCHY ###
ds.images # is equivalent to
ds['images']

ds.labels # is equivalent to
ds['labels']

### WITH HIERARCHY - COMING SOON ###
ds.localization.boxes # is equivalent to
ds['localization/boxes']

ds.localization.labels # is equivalent to
ds['localization/labels']

# Indexing
W = ds.images[0].numpy() # Fetch an image and return a NumPy array
X = ds.labels[0].numpy(aslist=True) # Fetch a label and store it as a 
                                    # list of NumPy arrays

# Slicing
Y = ds.images[0:100].numpy() # Fetch 100 images and return a NumPy array
                             # The method above produces an exception if 
                             # the images are not all the same size

Z = ds.labels[0:100].numpy(aslist=True) # Fetch 100 labels and store 
                                         # them as a list of NumPy arrays

DVC

模块	功能
Data Versioning	提供大数据文件、数据集和机器学习模型的版本管理能力。数据会被单独存储，类似git for data。
Data Access	提供了项目之外访问DVC管理的数据工件的能力，比如下载某个版本的模型文件并部署。
Data Pipelines	定义了模型和其它数据工件加工生成的流水线，类似传统意义上的Makefile。
Metrics, parameters, and plots	流水线上各环节记录的信息
Experiments	一个可视化的浏览比较工具

image

DVC和git结合，对数据、模型、代码进行版本管理。
安装简单，pip3 install dvc
使用方便，dvc push; dev pull等
速度快，在dvc add之后，会生成一个新的文件，如，dvc add data.sql,会生成data.sql.dvc（kb级别），git会上传data.sql.dvc这个文件，dvc根据。dvc的文件可以pull到对应的文件。如果需要指定版本的data、model、code，只需要git checkout 版本号，然后dvc pull就好。

安装 python 3.6+

pip3 install dvc

使用

DVC 目前支援以下七种remote 类型:

local - 本地目录
s3 - Amazon S3
gs - Google 云端
azure - Azure Blob
ssh - ssh
hdfs - Hadoop 分佈式文件系統
http - HTTP 和 HTTPS

对数据和模型进行版本管理

git init  // git初始化

dvc init  // dvc初始化，目录下生成.dvc目录，其内包括 .dvc/.gitignore 、 .dvc/cache/ 、 . dvc/config 其中最重要的是 .dvc/cache/， DVC 会在这里保存档案的缓存，也是最终会 push 到云端的档案

image

dvc add data/data.xml // data.xml加入dvc版控

image

git add data/data.xml.dvc data/.gitignore
git commit -m "Add raw data"

存储和共享

DVC 支持多种远程存储类型，包括 Amazon S3、SSH、Google Drive、Azure Blob Storage 和 HDFS

dvc remote add -d storage s3://mybucket/dvcstore 
dvc remote add -d localstorage /tmp/dev_storage
git add .dvc/config
git commit -m "Configure remote storage"
dvc push // 数据保存到前面设置的存储位置
git push -u origin master  // 提交引用文件到git

image

S3

$ dvc remote add -d s3 s3://dvc
$ dvc remote modify s3 access_key_id admin
$ dvc remote modify s3 secret_access_key liguoyu3564
$ dvc remote modify s3 endpointurl http://127.0.0.1:9000
$ cat .dvc/config 
[core]
    remote = s3
['remote "storage"']
    url = s3://mybucket/dvcstore
['remote "localstorage"']
    url = /tmp/dvc-storage
['remote "s3"']
    url = s3://dvc
    access_key_id = admin
    secret_access_key = liguoyu3564
    endpointurl = http://127.0.0.1:9000

拉取数据

dvc pull

image

数据修改

image

版本切换

git checkout HEAD~1 data/data.xml.dvc
dvc checkout

列出文件

(base) [root@localhost data]# dvc list http://192.168.194.129:3000/liguoyu/test data
.gitignore                                                                    
data.xml
data.xml.dvc

文件下载

该指令指挥下载数据源文件

(base) [root@localhost test]# dvc get http://192.168.194.129:3000/liguoyu/test data/data.xml -o datatest/data.xml
(base) [root@localhost test]# tree datatest/                                                                                                                                      
datatest/
└── data.xml

0 directories, 1 file

文件导出

(base) [root@localhost test]# dvc import http://192.168.194.129:3000/liguoyu/test data/data.xml -o datatest/data.xml
Importing 'data/data.xml (http://192.168.194.129:3000/liguoyu/test)' -> 'datatest/data.xml'
                                                                                                                                                                                  
To track the changes with git, run:

    git add datatest/.gitignore datatest/data.xml.dvc
(base) [root@localhost test]# tree datatest/ -C
datatest/
├── data.xml
└── data.xml.dvc

0 directories, 2 files
(base) [root@localhost test]# ls -al datatest/
total 37012
drwxr-xr-x. 2 root root       60 Sep  8 11:43 .
drwxr-xr-x. 6 root root       76 Sep  8 11:43 ..
-rw-r--r--. 1 root root 37891863 Sep  8 11:43 data.xml
-rw-r--r--. 1 root root      272 Sep  8 11:43 data.xml.dvc
-rw-r--r--. 1 root root       10 Sep  8 11:43 .gitignore

API调用

这样可以在运行时直接从应用程序内部访问数据内容

import dvc.api

with dvc.api.open(
    'data/data.xml',
    repo='http://192.168.194.129:3000/liguoyu/test'
) as fd:

配置数据，设置训练和验证集

$ wget https://code.dvc.org/get-started/code.zip
$ unzip code.zip
$ rm -f code.zip
$ tree
.
├── params.yaml
└── src
    ├── evaluate.py
    ├── featurization.py
    ├── prepare.py
    ├── requirements.txt
    └── train.py  
$ pip3 install -r src/requirements.txt
$ dvc run -n prepare \
          -p prepare.seed,prepare.split \
          -d src/prepare.py -d data/data.xml \
          -o data/prepared \
          python src/prepare.py data/data.xml

DAG:一个阶段的输出指定为另一个阶段的依赖项

$ dvc run -n featurize \
          -p featurize.max_features,featurize.ngrams \
          -d src/featurization.py -d data/prepared \
          -o data/features \
          python src/featurization.py data/prepared data/features

指标收集

$ dvc run -n evaluate \
          -d src/evaluate.py -d model.pkl -d data/features \
          -M scores.json \
          --plots-no-cache prc.json \
          --plots-no-cache roc.json \
          python src/evaluate.py model.pkl \
                 data/features scores.json prc.json roc.json

结论：

与git类系统深度整合，命令也非常类似，看起来学习上手成本低
对于大数据文件存储的支持方案丰富，从云存储到私有的大数据存储，应有尽有，包括Google Drive, Amazon S3, Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP等
多种编程语言和深度学习框架支持，如Python, R, Julia, Scala Spark, custom binary, Notebooks, flatfiles/TensorFlow, PyTorch

klever、MLflow、Hub、DVC的简单使用

klever

解决问题:

组件

依赖组件

相关命令

ORMB使用

ormb-storage-initializer使用

modeljob-operator

MLflow

安装

Tracking

效果

Projects

Models

model registry

Hub

安装

创建数据集

创建hub数据集

手动创建

自动创建

数据压缩

数据访问

DVC

安装 python 3.6+

使用

DVC 目前支援以下七种remote 类型:

对数据和模型进行版本管理

存储和共享

S3

拉取数据

数据修改

版本切换

列出文件

文件下载

文件导出

API调用

配置数据，设置训练和验证集

DAG:一个阶段的输出指定为另一个阶段的依赖项

指标收集

相关阅读更多精彩内容

友情链接更多精彩内容