Course syllabus

Week 1:

Introduction
Model and Cost Function
Gradient Descent
Linear regression with one variable
Linear algebra review

Week 2:

Linear regression with multiple variables
Octave/Matlab Tutorial

Week 3:

Classification and Hypothesis Representation
Multiclass Classification
Logistic regression
Regularization

Week 4:

Neural networks: representation

Week 5:

Neural networks: learning
Cost Function and Backpropagation
Gradient Checking and Random Initialization

Week 6:

Advice for applying machine learning
Evaluating a learning algorithm
Bias vs. Variance
Handling Skewed Data

Week 7:

Support vector machines
Large Margin Classification
Kernels

Week 8:

Unsupervised learning
K-Means
Dimensionality reduction

Week 9:

Density Estimation
Anomaly detection
Recommender systems

Week 10:

Gradient Descent with Large Datasets

Week 11:

Application example: photo OCR

Software

Octave/MatLab

Week 1

Course content

Machine learning definition ( Field of study that gives computers the ability to learn without being explicitly programmed. )
Machine learning algorithms: Supervised learning and unsupervised learning, reinforcement learning, recommender systems.
Supervised learning (e.g., regression, classification, naive bayesian model, decision tree, random forest model, neural networks, support vector machines)
Unsupervised learning (clustering)
Model representation
m = Number of training examples
x = "input" variable/features
y = "output" variable/"target" variable
h = maps from x's to y's
$h_{\theta}(x) = \theta_{0} + \theta_{1}x$
Cost function
$\min_{\theta_{0},\theta_{1}} \quad \frac{1}{2m} \sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})^2}$
$Cost function(squared error function): J(\theta_{0},\theta_{1}) = \frac{1}{2m} \sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})^2}$
$J(\theta_{1})$ Bowl shape in 2D
$J(\theta_{1}, \theta_{2})$ Bowl shape in 3D

Capture.PNG
Gradient descent algorithm
Repeat until convergence:
$\theta_{j} := \theta_{j} - \alpha\frac{\partial}{\partial {\theta_{j}}}J(\theta_{0},\theta_{1})\quad (for \quad j = 0 \quad and \quad j = 1)$
$\alpha$ is called learning rate. A smaller $\alpha$ would result in a smaller step and a larger $\alpha$ results in a larger step. The direction in which the step is taken is determined by the partial derivative of $J(\theta_0,\theta_1)$ . Depending on where one starts on the graph, one could end up at different points.
If $\alpha$ is too small, gradient descent can be slow, while if $\alpha$ is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
As we approach a local minimum, gradient descent will automatically take smaller steps. So no need to decrease $\alpha$ over time.
Repeat until convergence
$\theta_{0} := \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})} \\ \theta_{1} := \theta_{1} - \alpha \frac{1}{m} \sum_{i=1}^{m}{((h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)})}$
"Batch": Each step of gradient descent uses all the training examples.
Vector is an n $\times$ 1 matrix
Matrices are usually denoted by uppercase names while vectors are lowercase.
"Scalar" means that an object is a single value, not a vector or matrix.

Week 2

Course content

Multiple features
$x^{(i)}_{j}$ - value of feature j in the $i^{th}$ training example
$x^{(i)}$ - the input (features) of the $i^{th}$ training example
$m$ - the number of training examples
$n$ - the number of features
Hypothesis(or called Prediction): $\begin{equation} h_{\theta}(x) = \begin{bmatrix} \theta_{0} & \theta_{1} & \dots & \theta_{n} \end{bmatrix} \begin{bmatrix} x_{0}\\ x_{1}\\ \vdots\\ x_{n}\\ \end{bmatrix} = \theta^{T}x \end{equation}$
Cost function
$J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$
Vectorized implementation:
$J(\theta) = \frac{1}{2m} (X \theta-\vec{y})^T (X \theta - \vec{y})$
Gradient descent
Repeat until convergence
$\theta_{j} := \theta_{j} - \alpha\frac{1}{m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})}{x^{(i)}_{j}}$
simultaneously update $\theta_{j}$ for $j = 0, ..., n$

Gradient descent implementation steps

Vectorized implementation:
$\theta = \theta-\alpha \times \frac{1}{m} \times X^T \times (X \times \theta - \vec{y})$
Feature scaling (get every feature into approximately a $-1 \leq x_{i} \leq 1$ range)
*Feature scaling speeds up gradient descent by making it require fewer iterations to get to a good solution. *
Mean normalization (Replace $x_{i}$ with $x_{i}-\mu_{i}$ to make features have approximately zero mean (Do not apply to $x_{0}=1$ ))
$x_{i}=\frac{x_{i}-\mu_{i}}{s_{i}}$
With:
$\mu_{i}$ - averaged value of x in training set
$s_{i}$ - range(max-min) or standard deviation
Normal equation

Normal equation example and vectorized implementation
Comparison of gradient descent and normal equation

Gradient Descent	Normal equation
Need to choose $\alpha$	No need to choose $\alpha$
Needs many iterations	Don't need to iterate
Works well even when n (features) is large	Slow if n (features) is very large

What if $X^TX$ is non-invertible?

Redundant features (linearly dependent)
Too many features (e.g., $m \leq n$ ), then delete some features, or use regularisation.

Unvectorized implementation: $h_{\theta}(x)=\sum_{i=0}^{n}{\theta_j}{x_j}$
Vectorized implementation: $h_{\theta}(x)={\theta}^Tx$

Programming Assignment: Linear regression

Week 3

Course content

Classification (Logistic regression):
$y \in \{0,1\}$
$0 \leq h_{\theta}(x) \leq 1$
Hypothesis representation
logistic function (or sigmoid function):
$h_{\theta}(x)=g(\theta^T x)$
$z=\theta^T x$
$g(z)=\frac{1}{1+e^{-z}}$
Decision boundary
$\theta^T x=0$
Predict $y=1$ if $\theta^T x \geq 0$ or $h_{\theta}(x) \geq 0.5$ ;
Predict $y=0$ if $\theta^T x < 0$ or $h_{\theta}(x) < 0.5$ .
Cost function
Convex problem has one global optimum, while non convex problem has multiple local optimum and it is hard to find which one is global optimum.
$J(\theta)=\frac{1}{m}\sum_{i=1}^{m}Cost{(h_{\theta}(x^{(i)}), y^{(i)})}$
$Cost(h_{\theta}(x),y)=-log(h_{\theta}(x))$ if $y=1$
$Cost(h_{\theta}(x),y)=-log(1-h_{\theta}(x))$ if $y=0$
When $y=1$ , we get the following plot for $J(\theta)$ vs $h_{\theta}(x)$

When $y=0$ , we get the following plot for $J(\theta)$ vs $h_{\theta}(x)$

$Cost(h_{\theta}(x), y)=0$ if $h_{\theta}(x)=y$
$Cost(h_{\theta}(x), y) \to \infty$ if $y=0$ and $h_{\theta}(x) \to 1$
$Cost(h_{\theta}(x), y) \to \infty$ if $y=1$ and $h_{\theta}(x) \to 0$
Note that writing the cost function in this way guarantees that $J(\theta)$ is convex for logistic regression.
Simplified logistic cost function:
$J(\theta)=\frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^{(i)}), y{(i)})=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log \ h_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]$
To fit parameter $\theta$ :
$\min_{\theta}J(\theta)$
To make a prediction given new $x$ :
output $h_{\theta}(x)=\frac{1}{1+e^{-\theta^T x}}$
Gradient descent with vectorized implementation is:
Vectorized implementation:
$\theta = \theta - \alpha \times \frac{1}{m} \times (X^T) \times (g(X \theta)-\vec{y})$
Advanced optimization algorithms:
1. Gradient descent
2. Conjugate gradient
3. BFGS
4. L-BFGS
  With algorithms No.2,3 and 4, $\alpha$ doesn't need to be picked manually, and they are often faster than gradient descent, but they are more complex.

function [J_value, gradient] = costFunction(theta)
  J_value = [code to compute cost function]
  gradient (1) = [code to compute partial derivative of cost function with respect to theta 0]
  .
  .
  .
  gradient (n+1) = [code to compute partial derivative of cost function with respect to theta n]

Multiclass classification
one-vs-all (one-vs_rest):
We are basically choosing one class and lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returns the highest value as our prediction.
$h_{\theta}^{(i)}(x)=P(y=i | x;\theta), i = 1,2,3$
Overfitting
1. underfitting, high bias
2. just right
3. overfitting, high variance
  Definition of overfitting: if we have too many features, the learned hypothesis may fit the training set very well ( $J(\theta) \approx 0$ ), but fail to generalize to new examples.

Note: If we have a lot of features, and , very little training data, then, overfitting can become a problem.

How to address overfitting:
1. Reduce number of features
  - Manually select which features to keep
  - Model selection algorithm
2. Regularization
  - Keep all the features, but reduce magnitude/values of parameters $\theta_{j}$
  - Works well when we have a lot of features, each of which contributes a bit to predicting $y$ .
In regularized linear regression, we choose $\theta$ to minimize:
$J(\theta) = \frac{1}{2m}[\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^{n} \theta_{j}^2]$
If $\lambda$ is set to an extremely large value, algorithm results in underfitting (fails to fit even the training set)
Regularized linear regression
gradient descent:
$\theta_{j} := \theta_{j}(1-\alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta} (x^{(i)})-y^{(i)})x_j^{(i)}$
where $1-\alpha \frac{\lambda}{m}$ will always be less than 1.
Normal equation:
$\theta = (X^TX+\lambda \cdot L)^{-1} X^T y$
where
$L = \begin{bmatrix} 0 & & &\\ & 1 & &\\ & & \ddots &\\ & & & 1 \end{bmatrix}$
L is a matrix with 0 at the top left and 1 is down the diagonal with 0's everywhere else. It should have dimension $(n+1)\times(n+1)$
Recall that is $m<n$ then $X^TX$ is non-invertible. However, when we add the term $\lambda \cdot L$ , then $X^T X+\lambda \cdot L$ becomes invertible.
Regularized logistic regression- Gradient descent
Repeat until convergence
$\theta_{j} := \theta_{j} - \alpha[\frac{1}{m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})}{x^{(i)}_{j}}+\frac{\lambda}{m}\theta_{j}]$
simultaneously update $\theta_{j}$ for $j = 1, ..., n$

Programming Assignment: Logistic regression

Feature mapping
One way to fit the data better is to create more features from each data point. In the provided function mapFeature.m, we will map the features into all polynomial terms of $x_1$ and $x_2$ up to the sixth power.
$mapFeature(x) = \begin{bmatrix} 1\\ x_1\\ x_2\\ x_1^2\\ x_1x_2\\ x_2^2\\ x_1^3\\ \vdots\\ x_1x_2^5\\ x_2^6 \end{bmatrix}$
Different regularization parameter $\lambda$

With a small $\lambda$ , you should find that the classifier gets almost every training example correct, but draws a very complicated boundary, thus overfitting the data.
If $\lambda$ is set to too high a value, you will not get a good fit and the decision boundary will not follow the data so well, thus underfitting the data.

Week 4

Course content

Suppose that you are learning to recognize cars from 2×2 pixel images (grayscale, not RGB). Let the features be pixel intensity values. If you train logistic regression including all the quadratic terms $(x_ix_j)$ as features. you will have 10 features. If there are more pixels, then there will be millions of features. Therefore, you need neural network to perform.
Model representation
Neuron model: logistic unit
sigmoid(logistic) activation function: $h_{\theta}(x)=\frac{1}{1+e^{-\theta^Tx}}$
weights means parameters in $\theta$
$a_i^{(j)}$ = "activation" of unit i in layer j
$\Theta^{(j)}$ = matrix of weights controlling function mapping from layer j to layer j+1
if network has $s_j$ units in layer $j$ , $s_{j+1}$ units in layer $j+1$ , then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j+1)$
Forward propagation vectorized implementation
$z^{(j)}=\Theta^{(j-1)} a^{(j-1)}$
$a^{(j)}=g(z^{(j)})$
$z^{(j+1)}=\Theta^{(j)} a^{(j)}$
$h_{\Theta}(x)=a^{(j+1)}=g(z^{(j+1)})$
Examples and intuitions
A simple example of applying neural networks is by predicting $x_1$ AND $x_2$

XNOR operator using a hidden layer with two nodes
Multiclass classification

Programming Assignment: Multi-class Classification and Neural Networks

You will be using multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes, you will need to train 10 separate logistic regression classifiers.
Regularized logistic regression
Cost function:
$J(\theta) =\frac{1}{m} \sum_{i=1}^m [-y^{(i)} \log(h_{\theta}(x^{(i)}))- (1 -y^{(i)}) \log(1-h_{\theta}(x^{(i)}))]+ \frac{\lambda}{2m} \sum_{j=1}^n{\theta_j^2}$
Partial derivative of regularized logistic regression cost for $\theta_{j}$ is defined as:
$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m ( h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\qquad \mathrm{for}\;j=0$
$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m ( h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j\qquad \mathrm{for}\;j\geq1$
For vectorized implementation, please check Matlab code uploaded in github.
One-vs-all classification
In this part of the exercise, you will implement one-vs-all classification by training multiple regularized logistic regression classifiers, one for each of the classes in our dataset.
One-vs-all prediction
*logistic regression cannot form more complex hypotheses as it is only a linear classier. (You could add more features such as polynomial features to logistic regression, but that can be very expensive to train.). The neural network will be able to represent complex models that form non-linear hypotheses. *
Your goal is to implement the feedforward propagation algorithm to use our trained weights for prediction. In next week's exercise, you will write the backpropagation algorithm for learning the neural network parameters.

Week 5

Course content

Neural network (classification)
$L$ - total no. of layers in network
$s_l$ - no. of units (not counting bias unit) in layer $l$
$K$ - number of output units/classes
Cost function of neural networks
$J(\Theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}[y_k^{(i)}log \ (h_{\theta}(x^{(i)}))_k + (1-y_k^{(i)})log(1 - (h_{\theta}(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2$
backpropagation:
"Backpropagation" is to minimizing cost function: $\min_{\Theta}J(\Theta)$

backpropagation algorithm

Note that $D_{ij}^{(l)}$ is wrong in picture above, it should be as follows
$D_{ij}^{(l)}=\frac{1}{m} \Delta_{ij}^{(l)} \quad for \ j=0\\ D_{ij}^{(l)}=\frac{1}{m} \Delta_{ij}^{(l)}+\frac{\lambda}{m} \Theta_{ij}^{(l)} \quad for \ j \geq 1$

backpropagation algorithm.png
Unrolling parameters
If the dimensions of Theta 1 is $10\times11$ , Theta 2 is $10 \times 11$ and Theta 3 is $1\times11$ , then we "unroll" all the elements and put them into one long vecotr:

thetaVector = [Theta1(:); Theta2(:); Theta3(:);]
deltaVector = [D1(:); D2(:); D3(:)]

We can also get back our original matrices from the "unrolled" versions as follows:

Theta1 = reshape(thetaVector(1:100), 10, 11)
Theta2 = reshape(thetaVector(111:220), 10, 11)
Theta3 = reshape(thetaVector(221:231), 1, 11)

Gradient checking
$\frac{\partial}{\partial \Theta_j} J(\Theta) \approx \frac{J(\Theta_1,...,\Theta_j+\epsilon,...,\Theta_n)-J(\Theta1,...,\Theta_j-\epsilon,...,\Theta_n)}{2\epsilon}$
A small value for {\epsilon}ϵ (epsilon) such as ${\epsilon = 10^{-4}}$ , guarantees that the math works out properly.

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

Once we compute gradApprox vector, we can check that gradApprox $\approx$ deltaVector.

Random initialization (symmetry breaking)
Initializing each $\Theta_{ij}^{(l)}$ to a random value between $[-\epsilon, \epsilon]$ .

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Training a neural network

Randomly initialize the weights
Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
Implement the cost function
Implement backpropagation to compute partial derivatives
Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

Programming Assignment: Neural Networks Learning

Feedforward and regularized cost function
Backpropagation with random initialization
Adding regularization to gradient
Gradient checking
Learning a good set of parameters (Theta1, Theta2...) using fmincg()
Optional exercise: To see how the performance of the neural network varies with the regularization parameter and number of training steps (the MaxIter option when using fmincg)

Week 6

Course content

Debugging a learning algorithm
1. Get more training examples
2. Try smaller sets of features
3. Try getting additional features
4. Try adding polynomial features ( $x_1^2, x_2^2, x_1x_2$ etc.)
5. Try decreasing $\lambda$
6. Try increasing $\lambda$
Typical split of dataset into training set and test set is 70% and 30% respectively.
The procedure using training set and test set:
1. Learn $\Theta$ and minimize $J_{train}(\Theta)$ using the training set
2. Compute the test set error $J_{test}(\Theta)$
The test set error
1. For linear regression mean square error stuff.
  $J_{test}(\Theta)=\frac{1}{2m_{test}} \sum_{i=1}^{m_{test}} (h_{\Theta}(x_{test}^{(i)}) - y_{test}^{(i)})^2$
2. For classification 0/1 misclassification error
  $J_{test}(\Theta)=- \frac{1}{m_{test}} \sum_{i=1}^{m_{test}} y_{test}^{(i)} log(h_{\Theta}(x_{test}^{(i)}) + (1-y_{test}^{(i)}) log(h_{\Theta}(x_{test}^{(i)}))$
Model selection (e.g., degree of polynomial)
Train/(Cross) validation/Test sets (typically 60%/20%/20%)
We can now calculate three separate error values for the three different sets using the following method:
1. Optimize the parameters in Θ using the training set for each polynomial degree.
2. Find the polynomial degree d with the least error using the cross validation set.
3. Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$ , (d = theta from polynomial with lower error);
High bias problem (underfitting problem) or high variance problem (overfitting problem)

The relationship between degree of the polynomial d and the underfitting or overfitting of our hypothesis
Choosing the regularization parameter $\lambda$

Screenshot 2020-12-08 at 20.31.32.png
In order to choose the model and the regularization term λ, we need to
1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
2. Create a set of models with different degrees or any other variants.
3. Iterate through the $\lambda$ s and for each $\lambda$ go through all the models to learn some $\Theta$ .
4. Compute the cross validation error using the learned Θ (computed with λ) on the $J_{CV}(\Theta)$ without regularization or λ = 0.
5. Select the best combo that produces the lowest error on the cross validation set.
6. Using the best combo Θ and λ, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.
Learning curve
1. If a learning algorithm is suffering from high bias (underfitting), adding polynomial features, or obtaining additional features will help, but getting more training data will not (by itself) help much.
  
  Typical learning curve for high bias
2. If a learning algorithm is suffering from high variance (overfitting), reducing feature set or increasing the regularization parameter or getting more training data is likely to help.
  
  Typical learning curve for high variance
"Small" neural network is prone to underfitting, while "large" neural network is more prone to overfitting (use regularization, e.g., increase $\lambda$ to address the overfitting)
How to improve the accuracy of the classifier:
1. Collect lots of data (for example "honeypot" project but doesn't always work)
2. Develop sophisticated features (for example: using email header data in spam emails)
3. Develop algorithms to process your input in different ways (recognizing misspellings in spam).
Error analysis
The recommended approach to solving machine learning problems is to:
1. Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
2. Plot learning curves to decide if more data, more features, etc. are likely to help.
3. Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
$Accuracy = \frac{(True \quad positives+true \quad negatives)}{total \quad examples}$
Error Metrics for Skewed Classes
1. Precision/Recall
  
  Precision/Recall
Trading off precision and recall
F score
$2\frac{Precision*Recall}{Precision+Recall}$

Programming Assignment: Regularized linear regression and bias vs. variance

Regularized linear regression
1. Visualizing the dataset
2. Regularized linear regression cost function
3. Regularized linear regression gradient
4. Fitting linear regression
Bias-variance
1. Learning curve
Polynomial regression
1. Learning polynomial regression
2. Selecting lambda using a cross validation set
Large data rationale
1. Use a learning algorithm with many parameters (e.g., logistic regression/linear regression with many features; neural network with many hidden units). - low bias algorithms
2. Using a very large training set (unlikely to overfit) - low variance
  A combination of both hopefully will deliver a small $J_{train}(\Theta) \approx J_{test}{(\Theta)}$
  Remember to normalizing polynomial features before using them
  If you are using cost function (linearRegCostFunction) to compute the training, cross validation and test error, you should call the function with the lambda argument set to 0. Do note that you will still need to use lambda when running the training to obtain the theta parameters.
  Add an error_metrics.m function to calculate error metrics.

Week 7

Course content

SVM hypothesis
$\min_{\theta} C \sum_{i=1}^{m}[y^{(i)} cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)})] + \frac{1}{2} \sum_{i=1}^{n} \theta^2_j$
Hypothesis:
$h_{\theta}(x)=1$ if $\theta^Tx \geq 0$
$h_{\theta}(x)=0$ otherwise
SVM Decision Boundary
$\min_{\theta}$ ~~C*0~~ $+\frac{1}{2} \sum_{i=1}^{n} \theta^2_j$
s.t. $\theta^Tx^{(i)} \geq 1$ if $y^{(i)}=1$
$\theta^Tx^{(i)} \leq -1$ if $y^{(i)}=0$
Large margin intuition
C is $\frac{1}{\lambda}$
Kernels and similarity:
Gaussian kernel function:
$f_1=similarity(x,l^{(1)})=exp(-\frac{||x-l^{(1)}||^2}{2\sigma^2})=exp(-\frac{\sum_{j=1}^{n}(x_j - l_j^{(1)})^2}{2\sigma^2})$
If $x \approx l^{(1)}$ :
$f_1 \approx exp(-\frac{0^2}{2\sigma^2}) \approx 1$
If $x$ is far from $l^{(1)}$ :
$f_1 = exp(-\frac{(large number)^2}{2 \sigma^2}) \approx 0$
SVM with Kernels
SVM parameters
Large C( $=\frac{1}{\lambda}$ ): lower bias, high variance
Small C( $=\frac{1}{\lambda}$ ): higher bias, low variance
Large $\sigma^2$ : Features $f_i$ vary more smoothly, higher bias, lower variance
Small $\sigma^2$ : Feature $f_i$ vary less smoothly, lower bias, higher variance
Use SVM software package (e.g., liblinear, libsvm)
Need to specify:
- Choice of parameter C
- choice of kernel (similarity function)
Kernel (similarity) function
Do perform feature scaling before using the Gaussian kernel
Many off-the-shelf kernels available
- Polynomial kernel
- String kernel
n=number of features, m=number of training examples
If n is large, use logistic regression or SVM without a kernel ("linear kernel")
If n is small, m is intermediate, use SVM with Gaussian kernel
If n is small, m is large, create/add more features, then use logistic regression or SVM without a kernel.
Neural network likely to work well for most of these settings, but may be slower to train.

Programming Assignment: Support Vector Machines

Part 1: using a Gaussian kernel with SVMs, 2D datasets
The C parameter is a positive value that controls the penalty for misclassified training examples.
Implementation Note:Most SVM software packages (including svmTrain.m) automatically add the extra feature $x_0=1$ for you and automatically take care of learning the intercept term $\theta_0$ . So when passing your training data to the SVM software, there is no need to add this extra feature $x_0=1$ yourself.
In the last section of first part, Optimal $C$ and $\sigma$ parameter is found by using the cross validation set.
Recall that for classification, the error is defined as the fraction of the cross validation examples that were classified incorrectly. In MATLAB, you can compute this error using mean(double(predictions ~= yval)), where predictions is a vector containing all the predictions from the SVM, and yval are the true labels from the cross validation set.
Part 2: spam classification

Week 8

Course content

Unsupervised learning
While supervised learning algorithms need labeled examples (x,y), unsupervised learning algorithms need only the input (x).
K-means algorithm
input:
- K (number of clusters)
- Training set $\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}$
  $c^{(i)}$ := index (from 1 to K) of cluster centroid closest to $x^{(i)}$
  $\mu_k$ := average (mean) of points assigned to cluster $k$
Optimization objective
$J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K) = \frac{1}{m} \sum_{i=1}^m ||x^{(i)}-\mu_{c^{(i)}}||^2 \\ \min_{c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K} J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K)$
K-means algorithm
Randomly initialize $K$ cluster centroids $\mu_1, \mu_2, ..., \mu_K \in R^n$
Repeat{
for i=1 to m
$c^{(i)} :=$ index (from 1 to K) of cluster centroid closest to $x^{(i)}$ .
for k=1 to K
$\mu_k :=$ average (mean) of points assigned to cluster k.
}
Random initialization
Should have $K<m$ m is the number of samples
Randomly pick $K$ training examples
Set $\mu_1, ..., \mu_K$ equal to these $K$ examples.
Try multiple random initialization to avoid that k-mean is stuck in local optimum.
Choosing the number of clusters
Elbow method
Evaluate K-means based on a metric for how well it performs for that later purpose.
Dimensionality reduction
1. Data compression (Maybe for batteries, this is not necessary, since there is lower dimensional data)
2. Visualization
  After dimensionality reduction, k=2 or 3 since we can plot 2D or 3D data but don't have ways to visualize higher dimensional data.
Principal Component Analysis (PCA) problem formulation
Reduce from n-dimension to k-dimension: Find k vectors ( $u^{(1)}, u^{(2)}, ..., u^{(k)} \in R^n$ ) onto which to project the data so as to minimize the projection error.
PCA is not linear regression
Principal Component Analysis algorithm
Reduce data from n-dimensions to k-dimensions
1. Compute "covariance matrix":
  $\Sigma = \frac{1}{m} \sum_{i=1}^n (x^{(i)}) (x^{(i)})^T$
2. Compute "eigen vectors" of matrix $\Sigma$
  $[U, S, V] = svd(Sigma);\\ Ureduce = U(:, 1:k);\\ z = Ureduce'*x;$
  svd - singular value decomposition
Obtaining $z_j=(u^{(j)})^Tx$
Reconstruction from compressed representation
Choosing the number of principal components k
Typically, choose k to be smallest value so that
$\frac{\frac{1}{m} \sum_{i=1}^m || x^{(i)} - x_{approx}^{(i)} ||^2}{\frac{1}{m} \sum_{i=1}^{m} || x^{(i)} ||^2} \leq 0.01$
or
$\frac{\sum_{i=1}^{k} S_{ii}} {\sum_{i=1}^{m} S_{ii}} \geq 0.99$
"99% of variance is retained"
Advice for applying PCA
1. Supervised learning speedup
  Bad use of PCA: To prevent overfitting. Use regularization instead
Before implementing PCA, first try running whatever you want to do with the original/raw data $x^{(i)}$ . Only if that doesn't do what you want, then implement PCA and consider using $z^{(i)}$ .

Programming Assignment: K-Means clustering and principal component analysis

K-Means clustering
K-means algorithm is as follows:

% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
    % Cluster assignment step: Assign each data point to the
    % closest centroid. idx(i) corresponds to c^(i), the index
    % of the centroid assigned to example i
    idx = findClosestCentroids(X, centroids);

    % Move centroid step: Compute means based on centroid
    % assignments
    centroids = computeMeans(X, idx, K);
end

1.1 Implementing K-Means
1.1.1 finding closest centroids
1.1.2 Computing centroid means

Week 9

Course content

Anomaly detection
Check which have $p(x) < \epsilon$
Density estimation
Training set: $\{x^{(1)}, ..., x^{(m)} \}$
Each example is $x \in R^n$
$\begin{align} p(x) &= p(x_1;\mu_1, \sigma_1^2)p(x_2; \mu_2, \sigma_2^2)...p(x_n;\mu_n, \sigma_n^2)\\ &=\prod_{j=1}^n p(x_j;\mu_j, \sigma_j^2) \end{align}$
Anomaly detection algorithm

Choose features $x_i={x_i^{(1)}, ..., x_i^{(m)}}$ that you think might be indicative of anomalous examples, each example is $x_i \in R^n$ .
Fit parameters $\mu_1, ..., \mu_n, \sigma_1^2, ..., \sigma_n^2$
$\begin{align} \mu_j = \frac{1}{m} \sum_{i=1}^m x_j^{(i)};\\ \sigma_j^2 = \frac{1}{m} \sum_{i=1}^m (x_j^{(i)}-\mu_j)^2 \end{align}$
Given new example $x$ , compute $p(x)$ :
$\begin{align} p(x) &= \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\\ &= \prod_{j=1}^n \frac{1}{\sqrt {2\pi}\sigma_j} \exp (- \frac{(x_j-\mu_j)^2}{2\sigma_j^2}) \end{align}$
Anomaly if $p(x) < \epsilon$

Developing and evaluating an anomaly detection system.

Comparison of anomaly detection and supervised learning
For non-gaussian features, some transformation on original data is needed in order to fit gaussian to it:

log(x)
log(x+c)
$\sqrt{x}$
$x^{(1/3)}$

Choosing features that might take on unusually large or small values in the event of an anomaly
Multivariate Gaussian (Normal) distribution
$x \in R^n$ , don't model $p(x_1), p(x_2), ...$ etc. separately. Model $p(x)$ all in one go.
Parameters: $\mu \in R^n, \Sigma \in R^{(n \times n)}(covariance matrix)$

Model expression	Original model	Multivariate Gaussian
expression	$p(x_1;\mu_1, \sigma_1^2 \times ... \times p(x_n; \mu_n, \sigma_n^2))$	p(x; \mu, \Sigma)
Correlation	Manually create features to capture anomalies where $x_1, x_2$ take unusual combinations of values	Automatically captures correlations between features
Computation power	cheaper	expensive
Application	OK even if m(training set size) is small	must have $m>n$ or else $\Sigma$ is non-invertible

Note: If $\Sigma$ is non-invertible, check 1. $m>n$ , 2. if there are redundant features.

Recommender systems
Collaborative filtering
vectorization: low rank matrix factorization
Mean normalization

This week's content is so similar to sensor fusion course.

Programming Assignment: Anomaly detection and recommender system

Anomaly detection
1.1 Gaussian distribution
1.2 Estimating parameters for a Gaussian
1.3 Selecting the threshold, $\epsilon$
1.4 High dimensional dataset
Recommender systems
2.1 Collaborative filtering cost
2.2 Collaborative filtering gradient
2.3 Regularized cost
2.4 Regularized gradient

Week 10

Course content

Learning with large datasets
"It is not who has the best algorithm that wins, it is who has the most data."
Stochastic gradient descent
1. Randomly shuffle dataset
2. Repeat{
  for $i := 1, ..., m$ {
  $\theta_j := \theta_j - \alpha(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}$
  (for every $j=0,...,n)$
  }
  }
  Note:

When the training set size $m$ is very large, stochastic gradient descent can be much faster than gradient descent.
The cost function should go down with every literation of batch gradient descent but not necessarily with stochastic gradient descent.

Mini-batch gradient descent
Batch gradient descent: Use all $m$ examples in each iteration.
Stochastic gradient descent: Use 1 example in each iteration.
Mini-batch gradient descent: Use $b (\leq m)$ examples in each iteration.
Checking for convergence
Plot $cost(\theta, (x^{(i)}, y^{(i)}))$ , averaged over the last 1000 examples. If curve is too noisy, try to increase number of examples or reducing learning rate $\alpha$ .
Learning rate $\alpha$ is typically held constant can slowly decrease $\alpha$ over time if we want $\theta$ to converge. (E.g., $\alpha = \frac{const1}{iterationNumber + const2}$ )
Online learning for large scale data
Learn $p(y=1|x;\theta)$ , which is called likelihood in sensor fusion course. $x$ stands for features here.
Map reduce and data parallelism

Map-reduce

Multi-core machines

This week may be used in my project later in the future when we have gigabytes data available from cloud.

Quiz

As for linear regression with stochastic gradient descent, there is no need to apply map reduce techniques to them again, since computational power has already been decreased.
suppose that you are training a logistic regression classifier using stochastic gradient descent. It is found that the cost( $cost(\theta, (x^{(i)}, y^{(i)}))$ , averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Then you should try decrease learning rate $\alpha$
Before running stochastic gradient descent, you should randomly shuffle the training set.

Week 11

Course content

Photo (Optical character recognition) OCR pipeline
1. Text detection
2. Character segmentation
3. Character recognition
Sliding windows for text detection
Make sure you have a low bias classifier before getting more data.
Estimating the errors due to each component (ceiling analysis)
What part of the pipeline should you spend the most time trying to improve?

Ceiling analysis example

Questions arise from this course:

Why don't go directly to deep learning methods instead of machine learning.
Because feature matters more especially for batteries with less amount of data!
Possible research ideas inspired by this course.

Machine Learning - Stanford University

Course syllabus

Week 1:

Week 2:

Week 3:

Week 4:

Week 5:

Week 6:

Week 7:

Week 8:

Week 9:

Week 10:

Week 11:

Software

Week 1

Course content

Week 2

Course content

Programming Assignment: Linear regression

Week 3

Course content

Programming Assignment: Logistic regression

Week 4

Course content

Programming Assignment: Multi-class Classification and Neural Networks

Week 5

Course content

Programming Assignment: Neural Networks Learning

Week 6

Course content

Programming Assignment: Regularized linear regression and bias vs. variance

Week 7

Course content

Programming Assignment: Support Vector Machines

Week 8

Course content

Programming Assignment: K-Means clustering and principal component analysis

Week 9

Course content

Programming Assignment: Anomaly detection and recommender system

Week 10

Course content

Quiz

Week 11

Course content

Questions arise from this course:

推荐阅读更多精彩内容