Horovod 0.16.0 released: PySpark + Horovod

这两天Horovod 0.16.0 release了!!!来看官方release blog: Horovod Adds Support for PySpark and Apache MXNet and Additional Features for Faster Training

做AI的同学,估计已经很熟悉Uber创始的Horovod了。Horovod是一个分布式深度学习框架,致力于大幅度简化和改善分布式模型训练!支持当今最流行的各种框架,比如TensorFlow,PyTorch,Keras,和Apache MXNet。

在这个release,Horovod 支持了PySpark!!!和PySpakr的融合其实是情理之中。之前,需要维护多个cluster来支持数据处理和分布式训练,到现在,只需要一个cluster完成数据准备,模型训练和模型评估!生活瞬间变得美好了!

Capable of handling a massive volume of data, Apache Spark is used across many machine learning environments. The ease-of-use, in-memory processing capabilities, near real-time analytics, and rich set of integration options, like Spark MLlib and Spark SQL, has made Spark a popular choice.

Given its scalability and ease-of-use, Horovod has received interest from broader, Python-based machine learning communities, including Apache Spark. With the release of PySpark support and integration, Horovod becomes useful to a wider set of users. 

A typical workflow for PySpark before Horovod was to do data preparation in PySpark, save the results in the intermediate storage, run a different deep learning training job using a different cluster solution, export the trained model, and run evaluation in PySpark. Horovod’s integration with PySpark allows performing all these steps in the same environment.

In order to smooth out data transfer between PySpark and Horovod in Spark clusters, Horovod relies on Petastorm, an open source data access library for deep learning developed by Uber Advanced Technologies Group (ATG). Petastorm, open sourced in September 2018, enables single machine or distributed training and evaluation of deep learning models directly from multi-terabyte datasets.

A typical Petastorm use case entails preprocessing the data in PySpark, writing it out to storage in Apache Parquet, a highly efficient columnar storage format, and reading the data in TensorFlow or PyTorch using Petastorm.

Both Apache Spark and Petastorm are also used in some applications internally at Uber, so extending Horovod’s support to include PySpark and Petastorm has been a natural step in the process of making Horovod a more versatile tool.

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,448评论 0 10
  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,504评论 0 13
  • 查询语句中select from where group by having order by limit的执行顺...
    许小小晴阅读 3,234评论 1 3
  • When I took my baby to the garden, we met a woman named A...
    莲桂人阅读 112评论 0 0
  • 1 MQTT协议的特点 基于client-server的消息发布/订阅传输协议 轻量、简单、开放和易于实现 2 M...
    附庸风雅_阅读 2,622评论 0 0