无约束优化理论基础

在无约束优化问题中，我们通过寻找实数变量来使得目标函数变为最小。在尤其是在机器学习领域，有大量问题需要无约束优化理论进行分析，比如线性回归、Logistic Regression等。通常无约束问题典型算法有Gradient descent(梯度下降)、Stochastic GD(随机梯度下降)、TR(迁移学习)到CG(Conjugate gradient)、Newton、(L-)BFGS等。

无约束优化问题的数学表达为：

$\min _{x} f(x)，x \in \mathbb{R}^{n}$

and for $x$ is a real vector with n>1 components and $f$ is a smooth function.

Global solution & Local solution

全局优化：

A point $x^*$ is a global minimizer if $f(x^*) \leq f(x)$ for all $x$

局部优化(弱局部优化)：

A point $x^*$ is a local minimizer(AKA weak local minimizer) if there is a neighborhood $\mathcal{N}$ of $x^*$ such that $f(x^*) \leq f(x)$ for all $x \in \mathcal{N}$

局部优化(强局部优化)：

A point $x^*$ is a strict local minimizer (AKA strong local minimizer) if there is a neighborhood $\mathcal{N}$ of $x^*$ that the $x^*$ is the only local minimizer in $\mathcal{N}$ .

Recognizing a local minimum

When the function is smooth, if the function $f$ is twice differentiable, by examing the gradient $\nabla f(x^*)$ and the Hessian matrix $\nabla^2 f(x^*)$ can tell if a local minimizer(Even a Strict local minimizer).

1. Taylor's Theorem：泰勒公式是一个用函数在某点的信息描述其附近取值的公式。如果函数满足一定的条件，泰勒公式可以用函数在某一点的各阶导数值做系数构建一个多项式来近似表达这个函数 $f(x)=f\left(x_{0}\right)+f^{\prime}\left(x_{0}\right)\left(x-x_{0}\right)+o\left(x-x_{0}\right)$

若函数 $f(x)$ 在包含 $x_0$ 的开区间 $(a,b)$ 上具有 $(n+1)$ 阶导数，那么对于任一 $x \in (a,b)$ ：

$f(x)=\frac{f\left(x_{0}\right)}{0 !}+\frac{f^{\prime}\left(x_{0}\right)}{1 !}\left(x-x_{0}\right)+\frac{f^{\prime \prime}\left(x_{0}\right)}{2 !}\left(x-x_{0}\right)^{2}+\ldots+\frac{f^{(n)}\left(x_{0}\right)}{n !}\left(x-x_{0}\right)^{n}+R_{n}(x)$

对于优化问题：

$f: \mathbb{R}^{n} \rightarrow \mathbb{R}$ is continuously differentiable and that $p \in \mathbb{R}^{n}$ ， for $t \in (0,1)$

$f(x+p)=f(x)+\nabla f(x+t p)^{T} p$

$f$ twice differentiable， for $t \in (0,1)$

$\nabla f(x+p)=\nabla f(x)+\int_{0}^{1} \nabla^{2} f(x+t p) p d t$

$f(x+p)=f(x)+\nabla f(x)^{T} p+\frac{1}{2} p^{T} \nabla^{2} f(x+tp) p$

2. First-order Necessary Conditions 局部最小的一阶必要条件

If $x^*$ is a local minimizer and $f$ is continuously differentiable in an open neighborhood of $x^*$ ,then $\nabla f\left(x^{*}\right)=0$ .

如果 $x^*$ 为局部最优解并且函数 $f$ 一阶可导，则在 $x^*$ 的邻域内 $\nabla f\left(x^{*}\right)=0$ .

3. Second-order Necessary Conditions 局部最小的二阶必要条件

If $x^*$ is a local minimizer of $f$ and $\nabla^2 f$ exists and is continuous in an open neighborhood of $x^*$ , then $\nabla f\left(x^{*}\right)=0$ and $\nabla^{2} f\left(x^{*}\right)$ is positive definite.

如果 $x^*$ 为局部最优解并且一阶和二阶可导，则 $\nabla f\left(x^{*}\right)=0$ 并且 $\nabla^{2} f\left(x^{*}\right)$ 正定.

4. Second-order Necessary Conditions 局部最小的二阶充分条件

Suppose that $\nabla^2 f$ is continuous in an open neighborhood of $x^*$ and that $\nabla f\left(x^{*}\right)=0$ and $\nabla^{2} f\left(x^{*}\right)$ is positive definite. Then $x^*$ is a strict local minimizer of $f$ .

如果函数 $f$ 在 $x^*$ 处满足 $\nabla f\left(x^{*}\right)=0$ 并且 $\nabla^{2} f\left(x^{*}\right)$ 正定，则 $x^*$ 为局部最优解.

5.凸函数最优解

When $f$ is convex, any local minimizer $x^*$ is a global minimizer of $f$ . If in addition $f$ is differentiable, then any stationary point $x^*$ is a global minimizer of $f$ .

如果函数 $f$ 为凸函数，则 $f$ 的任何局部最优解都为全局最优解。

Nonsmooth Problems

如果函数为连续但在某点不可微分，我们可以使用次梯度(Subgradient) 或者广义梯度(Generalized gradient).

Overview of Algorithms 算法概览

1. 从起始点 $x_{0}$ 开始，进行 $x_{0}$ ， $x_{1}$ ， $x_{2}$ ...至 $x_{k+1}$ 的迭代操作，对于此种近进，通常有两种基本方法。

2.线搜索(Line search)和置信域(Trust region)

线搜索：算法在点 $x_{k}$ 时沿寻一个方向 $p_{k}$ 寻找步长 $\alpha$ 以使得

$\min _{\alpha>0} f\left(x_{k}+\alpha p_{k}\right)$

通常线搜索会创造一小批量的步长进行尝试直至找到最佳情况，到下一个点时，搜索方向和步长会被更新，这个过程是重复性的。

置信域：算法 $f$ 构建一个 $m_{k}$ 作为 $x_{k}$ 的近似，通过解决优化问题

$\min _{\alpha>0} m_{k}\left(x_{k}+ p_{k}\right)$

$p_{k}$ 是一个candidate step 以保证 $m_{k}$ 较好的近似， $x_{k}+ p_{k}$ 需要在一个信赖区域内。如果candidate solution 并不能给予 $f$ 较好的下降，则我们需要缩减置信区域。通常情况下，置信区域是由 $||p||_{2} \leq \Delta$ 定义的一个球体，标量 $\Delta >0$ 被称为置信域半径。在某些情况下置信域也可使用椭圆或者盒型体。

对于 $m_{k}$ 的数学定义为一个近似泰勒二次方程：

$m_{k}\left(x_{k}+ p_{k}\right)\approx f_{k}+ p^T \nabla f_{k}+ \frac{1}{2} p^T B_{k} p$

其中 $f_{k}$ 为scalar， $\nabla f_{k}$ 为vector， $B_{k}$ 为matrix，通常 $B_{k}$ 为Hessian matrix $\nabla^2 f_{k}$ , 或者其他近似矩阵。

3. 优化过程中，LS从一个固定的方向 $p_{k}$ 开始，然后确认合适步长 $\alpha_{k}$ . TR先选择一个置信域半径 $\Delta_{k}$ 的最大值，然后在此约束内找寻最佳方向和步长。如果结果不理想，则进一步减少 $\Delta_{k}$ 的值。

Search Directions for Line search methods 线搜索的搜索方向讨论

1.最速下降方向(Steepest descent direction) $-\nabla f_{k}$ 是最广泛的应用之一。通过泰勒公式进行定义，即

$f\left(x_{k}+\alpha p\right)=f\left(x_{k}\right)+\alpha p^{T} \nabla f_{k}+\frac{1}{2} \alpha^{2} p^{T} \nabla^{2} f\left(x_{k}+t p\right) p, \quad \text { for some } t \in(0, \alpha)$

沿寻 $p$ 方向， $f$ 的变化情况即是 $\alpha$ 项的系数 $p^{T} \nabla f_{k}$ ，所以需要寻找

$min \space p^{T} \nabla f_{k} \space s.t. \|p\|=1$

$\implies p^{T} \nabla f_{k}=\|p\|\left\|\nabla f_{k}\right\| \cos \theta=\left\|\nabla f_{k}\right\| \cos \theta \space (\theta\space is \space the\space angle\space between\space p \space and \space \nabla f_{k} )$

$\implies cos \theta = -1 \space \space\space\space\space\space \space p=-\nabla f_{k} / \| \nabla f_{k} \|$

最速下降法在对于复杂问题较为缓慢。

2. 通用搜索方向(General descent)

从上述公式中可以看出，只要满足 $p^{T} \nabla f_{k} <0$ 都可作为搜索方向, 但是下降速度会较慢。

3.牛顿方向(Newton direction), 从泰勒公式二次项 (Second-order series)推导出来。

$f\left(x_{k}+p\right) \approx f_{k}+p^{T} \nabla f_{k}+\frac{1}{2} p^{T} \nabla^{2} f_{k} p \stackrel{\text { def }}{=} m_{k}(p)$

$\nabla^{2} f_{k}$ 需要为正定(Positive definite)，通过解决 $min \space m_{k}(p)$ 来寻找牛顿方向，deriviate $m_{k}(p)$ to zero.

$\min m_{k}(p) \\\implies \nabla m_{k}(p)=0 \\\implies \nabla f_{k}+\nabla f_{k}^{2} p=0 \\\implies p_{k}^{N}=-\left(\nabla^{2} f_{k}\right)^{-1} \nabla f_{k}$

当 $\nabla^{2} f_{k}$ 非正定矩阵时， $(\nabla^{2} f_{k})^{-1}$ 可能不存在，不能满足下降条件。

4.拟牛顿方向(Quasi-Newton direction) (AKA 伪牛顿/准牛顿方向)，由于Hessian的计算复杂性，通过使用近似 $B_{k}$ 取代Hessian matrix $\nabla^{2} f_{k}$ 大致推导过程为：根据泰勒公式

$\nabla f(x+p)=\nabla f(x)+\int_{0}^{1} \nabla^{2} f(x+t p) p d t$

$\Rightarrow \nabla f(x+p)=\nabla f(x)+\nabla^{2} f(x) p+\int_{0}^{1}\left[\nabla^{2} f(x+t p)-\nabla^{2} f(x)\right] p d t$

$\Rightarrow \nabla f_{k+1}=\nabla f_{k}+\nabla^{2} f_{k}\left(x_{k+1}-x_{k}\right)+o\left(\left\|x_{k+1}-x_{k}\right\|\right)$ $(p=x_{k+1}-x_{k})$

$\Rightarrow \nabla^{2} f_{k}\left(x_{k+1}-x_{k}\right) \approx \nabla f_{k+1}-\nabla f_{k}$

此时，令 $s_{k}=x_{k+1}-x_{k}, \quad y_{k}=\nabla f_{k+1}-\nabla f_{k}$ ， $B_{k+1}$ 为Hessian Approximation.

$\Rightarrow B_{k+1} s_{k}=y_{k}$ (AKA Secant equation 正割公式)

通常要求 $B_{k+1}$ 有一些附加条件，如对称，低秩(Low rank) etc.

5.两个拟牛顿方向近似算法： Symmetric-rank-one(SR1) & BFGS

Symmetric-rank-one(SR1) : $B_{k+1}=B_{k}+ \frac{\left(y_{k}-B_{k} s_{k}\right)\left(y_{k}-B_{k} s_{k}\right)^{T}}{\left(y_{k}-B_{k} s_{k}\right)^{T} s_{k}}$