1. Abstract
- study stochastic optimization problems when the data is sparse. 研究数据稀疏的随机优化问题
- we derive matching upper and lower bounds on the minimax rate for optimization
and learning with sparse data. 给出了稀疏优化问题minimax rate的上界和下界 - show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms. 提出了一些并行异步的算法
- async AdaGrad, async Dual Averaging
2. Intro
- In this paper, we take a two-pronged approach. 分两方面分析问题
- First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition. 首先,分析了稀疏优化和学习问题的根本限制。得出了任何稀疏条件下优化算法的优化误差的下界
- As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity. 另一方面,研究怎么在并行计算中利用稀疏性质来给出更快的算法
- We develop two new algorithms, asynchronous dual averaging (ASYNCDA) and asynchronous ADAGRAD (ASYNCADAGRAD), which allow asynchronous parallel solution of the problem (1) for general convex f and X. 提出了两个新算法,异步的dual averaging和异步的AdaGrad
3. Minimax rates for sparse optimization
- 给出了任何这类问题算法的minimax convergence rate的bound
4. Parallel and asynchronous optimization with sparsity
- we first revisit Niu et al.’s HOGWILD! [12]. HOGWILD! is an asynchronous (parallelized) stochastic
gradient algorithm for optimization over product-space domains, meaning that X in problem (1)
decomposes as X = X1 × · · · × Xd, where Xj ⊂ R. 首先回顾HOGWILD,异步并行随机梯度算法,数据的domain是product space - The key of HOGWILD! is that in step 2, the parameter x is allowed to be inconsistent—it may have received partial gradient updates from many processors—and for appropriate problems, this inconsistency is negligible. Hogwild的参数是inconsistent,是不一致的.
4.1. Asynchronous dual averaging
- Hogwild的缺点,好像需要域是product space点积空间(笛卡尔积)
- ASYNCDA maintains and upates a centralized dual vector z instead of a parameter x, and a pool of
processors perform asynchronous updates to z, where each processor independently iterates:
- 提出了一种异步的dual averaging算法,AsyncDA,保存和更新一个中心的dual向量z,线程池异步地更新z
- The only communication point between any of the processors is the addition operation in step 3.
Since addition is commutative and associative, forcing all asynchrony to this point of the algorithm
is a natural strategy for avoiding synchronization problems. 线程之间的唯一通信是addition操作。因为addition是可交换和可结合的,让这里成为为异步的点是很自然的策略。
4.2. Asynchronous AdaGrad
- AdaGrad扩展到异步方式
5. Experiments
- URL dataset. The dataset in this case consists of an anonymized collection of URLs labeled as malicious (e.g., spam, phishing, etc.) or benign over a span of 120 days.
- We also experiment on a proprietary datasets consisting of search ad impressions. Each example
corresponds to showing a search-engine user a particular text ad in response to a query string. From this, we construct a very sparse feature vector based on the text of the ad displayed and the query string (no user-specific data is used). The target label is 1 if the user clicked the ad and -1 otherwise. 转悠的广告数据集. - 算法是LR