Horovod 0.16.0 released: PySpark + Horovod

这两天Horovod 0.16.0 release了！！！来看官方release blog: Horovod Adds Support for PySpark and Apache MXNet and Additional Features for Faster Training

做AI的同学，估计已经很熟悉Uber创始的Horovod了。Horovod是一个分布式深度学习框架，致力于大幅度简化和改善分布式模型训练！支持当今最流行的各种框架，比如TensorFlow，PyTorch，Keras，和Apache MXNet。

在这个release，Horovod 支持了PySpark！！！和PySpakr的融合其实是情理之中。之前，需要维护多个cluster来支持数据处理和分布式训练，到现在，只需要一个cluster完成数据准备，模型训练和模型评估！生活瞬间变得美好了！

Capable of handling a massive volume of data, Apache Spark is used across many machine learning environments. The ease-of-use, in-memory processing capabilities, near real-time analytics, and rich set of integration options, like Spark MLlib and Spark SQL, has made Spark a popular choice.

Given its scalability and ease-of-use, Horovod has received interest from broader, Python-based machine learning communities, including Apache Spark. With the release of PySpark support and integration, Horovod becomes useful to a wider set of users.

A typical workflow for PySpark before Horovod was to do data preparation in PySpark, save the results in the intermediate storage, run a different deep learning training job using a different cluster solution, export the trained model, and run evaluation in PySpark. Horovod’s integration with PySpark allows performing all these steps in the same environment.

In order to smooth out data transfer between PySpark and Horovod in Spark clusters, Horovod relies on Petastorm, an open source data access library for deep learning developed by Uber Advanced Technologies Group (ATG). Petastorm, open sourced in September 2018, enables single machine or distributed training and evaluation of deep learning models directly from multi-terabyte datasets.

A typical Petastorm use case entails preprocessing the data in PySpark, writing it out to storage in Apache Parquet, a highly efficient columnar storage format, and reading the data in TensorFlow or PyTorch using Petastorm.

Both Apache Spark and Petastorm are also used in some applications internally at Uber, so extending Horovod’s support to include PySpark and Petastorm has been a natural step in the process of making Horovod a more versatile tool.

Horovod 0.16.0 released: PySpark + Horovod

推荐阅读更多精彩内容