Multiple features

Notation:

n = number of features
$x^{(i)}$ = input (features) of $i^{th}$ training example.
$x^{(i)}_j$ = value of feature j in $i^{th}$ training example.

$h_\Theta (x)=\Theta_0+\Theta_1x_1+\Theta_2x_2+...+\Theta_nx_n$
For convenience of notation, define $x_0=1$
$x= \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \\ ... \\ x_3 \\ \end{matrix} \right] $$ \Theta = \left[ \begin{matrix} \Theta_0 \\ \Theta_1 \\ \Theta_2 \\ ... \\ \Theta_3 \\ \end{matrix} \right]$
$h_\Theta(x)=\Theta^Tx$

向量内积
Multivariate linear regression

Gradient descent for multiple variables

Hypothesis: $h_\Theta (x)=\Theta_0+\Theta_1x_1+\Theta_2x_2+...+\Theta_nx_n$
Parameters: $\Theta$ （a n+1 dimensional vector）
Cost function: $j(\Theta_0,\Theta_1,...,\Theta_n)=\frac{1}{2m}\sum^m_{i=1}(h_\Theta(x^{(i)}-y^{(i)})^2$

Gradient descent:
Repeat {
$\Theta_j:=\Theta_j-\alpha\frac{\partial}{\partial\Theta_j}J(\Theta_0,...,\Theta_n)$
}

Feature Scaling

Idea:Make sure features are on a similar scale

一般认为在 $-3\le x\le +3$ 就可以认为是可以的
Feature Scaling:Get every feature into approximately a $-1\le x_i\le 1$ range.
Mean normalization:Replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean.(Do not apply to $x_0=1$ )

Learning rate

Make sure gradient descent is working correctly.
$J(\theta)$ should decrease on every iteration.But if $\alpha$ is too small, gradient descent can be slow to converge.