《A Berkeley View of systems challenges for AI》总结

一. 本文之前的工作

a berkeley view of 系列共出现过2篇，除了本文要总结的这篇，还有2009年发布的另一篇《Above the Clouds：A Berkeley View of Cloud 》，其Google scolar的引用数达到了12042。
简单回顾下2009年对于云计算的技术的预测，从今天的角度回看过去，预测的还是比较准的

1. 云提供商是否可以盈利？

在偏远地点，提供中等配置服务器，足够规模和利用率下，是可以盈利的（省电，就近部署节省外网带宽）
对应共有云出现的年份
aws 2006（现世界最大）
阿里云 2009 （现中国最大）
百度云 2015
腾讯云 2016

2. 企业研发用户采用共有云是否可行，是否划算？

结论是可行的，且是划算的
硬件随着时间价格会降低。且不同的硬件减低的速度不同
摩尔定律

每GB磁盘的价格

每GB内存的价格

因摩尔定理导致商业变革故事

google 早期的其它产品，如 gmail，why not an “unlimited” inbox？也是这一思想的产物。梅姐说，我们坚信摩尔定律，所以我们大胆地做了这个尝试：gmail 被口口相传，每个人都等待着自己有幸被邀请，而一般的用户头一两年用不了多少存储，等累积的数据多起来时，每 GB 存储的价格早已掉了个量级。所以你看，当观念转变，想别人不敢想之事时，思路就开阔许多，做事的路子陡然不同，进而成本结构也完全不一样。当 4M 以下免费，4M 以上收费的 yahoo 邮箱发现用户像潮水般涌入 gmail，急忙跟进时，却发现，自己用的 IOE 体系，成本结构根本无法竞争，这就尴尬了：是硬着头皮流血跟进，还是壮士断腕，重建系统？

3. 云计算需要解决的问题 challenges

1）用户的程序要适应虚拟机（性能损失：cpu ram 4% ，disk io 26%）

  应用服务和存储服务分离提供

2）要支持快速的启动和停止

  虚拟机要求尽可能小

3）怎样更高可用（分布化，微服务化）

  微服务化，且跨区部署

4）防止云锁定

  企业用户会尝试部署在多个云，灵活迁移

5）隐私数据访问和保护

  数据可以同步到私有云

二. 本文背景介绍

1. 作者以及实验室介绍

1）作者们（14位）大部分是学界和工业界的牛人，startup创始人
https://people.eecs.berkeley.edu/~istoica/
https://people.eecs.berkeley.edu/~dawnsong/
http://people.eecs.berkeley.edu/~jordan/
2）作者从事多个方向的研究
分布式系统，AI，安全，统计学算法，网络, 数据库系统，嵌入式系统

三. 主要讨论的问题

4种应用场景以及对应的 9个待解决的技术问题

challenges

1. AI4个应用上的趋势

1)Mission-critical AI
协助人完成特定任务，且有可能比人完成的更好：手术机器人，自动驾驶，扫地机器人

Challenges: Design AI systems that learn continually by interact-ing with a dynamic environment, while making decisions that aretimely, robust, and secure.

2)Personalized AI
通过收集用户的数据，提供更个性化的AI服务，如：个人助理（助理来也），iphone X的个性化的人脸开锁，不同驾驶风格的自动驾驶系统

Challenges: Design AI systems that enable personalized applica-tions and services yet do not compromise users’ privacy and security.

3)AI across organizations
在保护数据归属的前提下：共享训练的数据，如：医院和银行行业（其行业之间有竞争关系）

Challenges: Design AI systems that can train on datasets owned by diferent organizations without compromising their condentiality, and in the process provide AI capabilities that span the boundaries of potentially competing organization.

4)AI demands outpacing the Moore’s Law
预期2018年有400ZB（1ZB=102410241024*1024GB）数据产生，到2025年还会有指数级增加。摩尔定律已经失效，不能满足AI需求.无论从计算能力，存储能力还是网络能力

Challenges: Develop domain-specic architectures and software systems to address the performance needs of future AI applicationsin the post-Moore’s Law era, including custom chips for AI work-loads, edge-cloud systems to eciently process data at the edge, and techniques for abstracting and sampling data.

2. 9个待解决的技术问题

Acting in dynamic environments

巡逻机器人的例子
consider a group of robots providing security for an building. When one robot breaks or a new one is added, the other robots must update their strategies for navigation, planning, and control in a coordinatedmanner.
Similarly, when the environment changes, either due to the robots’ own actions or to external conditions (e.g., an elevator goingout of service, or a malicious intruder), all robots must re-calibratetheir actions in light of the change.

R1: Continual learning.

现在训练模型的方式离线训练 -> 优化 ->在线预测。最高的时效性也需要几小时级别

ML pipeline

为了提升适应性，会引进更自动的pipeline，这就会带来后面所说的安全问题

online learning，在线训练更新模型
RL预期是方向，在模拟的环境中充分训练，但是系统上需要很多优化
requiring millions or even billions of simulations to explore the solution space and “solve"complex tasks. 现在还没有合适的系统

Simulated reality (SR).
SR enables an agent to learn not only much faster but also much more safely.
Consider a robot cleaning an environment that encoun-ters an object it has not seen before, e.g., a new cellphone. The robotcould physically experiment with the cellphone to determine howto grasp it, but this may require a long time and might damage thephone. In contrast, the robot could scan the 3D shape of the phone into a simulator, perform a few physical experiments to determinerigidity, texture, and weight distribution, and then use SR to learnhow to successfully grasp it without damage.

在 Apollo 1.5 模拟系统上要花 30 分钟进行的测试任务，在优化后的模拟系统上测试只需要 30 秒。” —baidu王京傲 ces 2018

待解决的技术点:
（1） Build systems for RL that fully exploit parallelism,while allowing dynamic task graphs, providing milli second-level latencies, and running on heterogeneous hardware under stringent deadlines.
（2）Build systems that can faithfully simulate the real-worldenvironment, as the environment changes continually and unexpect-edly, and run faster than real time.

R2: Robust decisions.

一个例子
the Microsoft Tay chat bot relied heavily on human interaction to develop rich naturaldialogue capabilities. However, when exposed to Twitter messages, Tay quickly took on a dark personality

如果已经上线了在线学习，如果遇到负面的数据或者非常不确定的数据。AI系统应该不做决策操作或者只做预定的保险操作。（比如：自动驾驶的减速停车）

待解决的技术点:
(1) Build fine grained provenance support into AI systems to connect outcome changes (e.g., reward or state) to the data sources that caused these changes, and automatically learn causal,source-specic noise models.
(2) Design API and language support for developing systems that maintain condence intervals for decision-making, and in particular can process unforeseen inputs.

R3: Explainable decisions.

尤其在医疗AI领域
输入数据的哪些部分导致了结论

For example, one may wish to know what features of a particular or-gan in an X-ray (e.g., size, color, position, form) led to a particulardiagnosis and how the diagnosis would change under minor pertur-bations of those features.

待解决的技术点:
(1) Build AI systems that can support interactive diagnostic analysis, that faithfully replay past executions, and that can help to determine the features of the input that are responsible for a particular decision, possibly by replaying the decision task against past perturbed inputs. More generally, provide systems support for causal inference.

Secure AI
直接攻击，掌握系统
tensorfow 披露漏洞

“这个漏洞出问题的点是在处理 AI 模型的时候，一个攻击场景是，黑客在网上提供一个AI 模型给大家用，大家下载回来一运行就中招了。或者黑客能够控制某个系统的 AI 模型就能实施攻击。所以，使用 TensorFlow 的系统要注意不要使用有问题/被黑客修改过的 AI 模型。
目前已知的公开发现 AI 框架漏洞有两个：一个是之前 360 发现的三个 AI 框架引入的第三方组件带来的漏洞,另一个是此次我们发现的框架本身的漏洞”

R4: Secure enclaves.

例如：在公有云等集群部署是，在代码运行时上的隔离，隔离区的代码可以访问到数据，其他进程访问不到隔离区运行的代码，硬件上执行。实际使用建议将代码分为保密区和非保密区，运行在不同的runtime。

Intel sgx

Intel’sSoftware Guard Extensions (SGX) [5], which provides a hardware-enforced isolated execution environment. Code inside SGX cancompute on data, while even a compromised operating system orhypervisor (running outside the enclave) cannot see this code or data. SGX also provides remote attestation [6], a protocol enabling aremote client to verify that the enclave is running the expected code.

arm trustzone

待解决的技术点:
(1)Build AI systems that leverage secure enclaves to ensure data con dentiality, user privacy and decision integrity, possibly by splitting the AI system’s code between a minimal code base runningwithin the enclave, and code running outside the enclave. Ensure thecode inside the enclave does not leak information, or compromisedecision integrity.

R5: Adversarial learning.

evasion attacks
inference 阶段：修改图像导致错误的分类
现阶段没有什么好办法
data poisoning attacks
train阶段：混入错误label的数据到训练数据集，尤其在AI系统持续学习的前提下，未授信的train data 更容易导致错误
可以利用回放和可解释性剔除部分影响数据

待解决的技术点:
(1) Build AI systems that are robust against adversarialinputs both during training and prediction (e.g., decision making),possibly by designing new machine learning models and network architectures, leveraging provenance to track down fraudulent datasources, and replaying to redo decisions after eliminating the fraudu-lent sources.

R6: Shared learning on condential data.

示例：既是竞争又是合作
银行共享防欺诈的模型和数据
医院共享流感识别的数据和模型

训练模型保证数据的私密性
一个方法是全部使用R4中所说的安全隔离的硬件环境
另一个方法是使用特殊的算法，但是对train性能影响比较大

multi-party com-putation (MPC) ：多个团体共同完成一个计算：
(1) local computation and 本地算梯度
(2) computation using MPC 合并梯度

待解决的技术点:
Build AI systems that (1) can learn across multipledata sources without leaking information from a data source duringtraining or serving, and (2) provide incentives to potentially competing organizations to share their data or models.

AI-specic architectures

R7: Domain specic hardware.

摩尔定律失效，并且AI对于计算，对于内存访问的需求更强
对于CPU的更新：
TPU， FPGA
对于DRAM和SSD的更新：
3D XPoint from Intel and Micron aims to provide 10⇥ storagecapacity with DRAM-like performance. （更牛的内存）
STT MRAM aims to succeed Flash, which may hit similar scaling limits as DRAM. （更牛的ssd）
服务器的配置会多种多样，更加异构

数据中心架构

架构设计参考：
https://bar.eecs.berkeley.edu/projects/2015-firebox.html

待解决的技术点:
(1) Design domain-specic hardware architectures to improve the performance and reduce power consumption of AI ap-plications by orders of magnitude, or enhance the security of theseapplications. (多，省电，安全)
(2) Design AI software systems to take advantage of these domain-specic architectures, resource disaggregation architectures, and future non-volatile storage technologies.（调度更多异构硬件的系统）

R8: Composable AI systems

Model composition

模块化，复用的重要性：类比现在的微服务架构
预期未来AI系统也会有分层的api服务
组合的方式：比如我们的设计：一次检测多次识别分类
准确度从低到高排序的模型序列，串行查询（平衡延迟和准确度）
小模型在终端，大模型在云端
待解决的技术点:
(1) designing a declarative language to capture the topology of these components and specifying performance targets of the applications,
(2) providing accurate performance models for each component, including resource demands, latency and throughput, and
(3) scheduling and optimization algorithms to compute the execution plan across components, and map components to the available resources to satisfy latency and throughput requirements while minimizing costs.

类似于sql 查询解析器的工作，充分利用资源，batch with configurable latency controls.
参考架构（服务器端）：
tensorflow serving
clipper

Action composition
把更细粒度的操作组合成高级别的option
更高级别的option，较少选择数目，更快的训练速度

示例：
比如自动驾驶，抽象出来的option：换车道线 = ( 加速 or 减速左转 or 右转打变道信号灯)
待解决的技术点:
(1) Design AI systems and APIs that allow the composition of models and actions in a modular and exible manner, and develop rich libraries of models and options using these APIs to dramatically simplify the development of AI applications

R9: Cloud-edge systems

终端的优势
edge devices to improve security, privacy, latency and safety

技术上的困难
适配多种终端和软件系统的难度

compilers and just-in-time (JIT) technologies to eciently compile on-the-fly complex algorithms and run them on edge devices. This approach can leverage recent code generation tools, such as TensorFlow’s XLA [107], Halide [50], and Weld [83].

nnvm+tvm

终端小模型云端大模型已经应用于video识别系统，负载需要灵活的在终端和云端切换
终端模型：小，准确度低，更新频率低
云上模型：大，准确度高，更新频率高

即便是有了5g和强大的云端，从网络和存储的能力和成本考虑，我们都不能全部存储设备产生的数据.所以需要对端上的数据进行samples and sketches（上传统计数据和抽样存储）

待解决的技术点:
Design cloud-edge AI systems that
(1) leverage the edge to reduce latency, improve safety and security, and implement intelligent data retention techniques,
(2) leverage the cloud to share data and models across edge devices, train sophisticated computation-intensive models, and take high quality decisions.

《A Berkeley View of systems challenges for AI》总结