Machine Learning - Stanford University

Course syllabus

Week 1:

  • Introduction
  • Model and Cost Function
  • Gradient Descent
  • Linear regression with one variable
  • Linear algebra review

Week 2:

  • Linear regression with multiple variables
  • Octave/Matlab Tutorial

Week 3:

  • Classification and Hypothesis Representation
  • Multiclass Classification
  • Logistic regression
  • Regularization

Week 4:

  • Neural networks: representation

Week 5:

  • Neural networks: learning
  • Cost Function and Backpropagation
  • Gradient Checking and Random Initialization

Week 6:

  • Advice for applying machine learning
  • Evaluating a learning algorithm
  • Bias vs. Variance
  • Handling Skewed Data

Week 7:

  • Support vector machines
  • Large Margin Classification
  • Kernels

Week 8:

  • Unsupervised learning
  • K-Means
  • Dimensionality reduction

Week 9:

  • Density Estimation
  • Anomaly detection
  • Recommender systems

Week 10:

  • Gradient Descent with Large Datasets

Week 11:

  • Application example: photo OCR

Software

  • Octave/MatLab

Week 1

Course content

  • Machine learning definition ( Field of study that gives computers the ability to learn without being explicitly programmed. )
  • Machine learning algorithms: Supervised learning and unsupervised learning, reinforcement learning, recommender systems.
  • Supervised learning (e.g., regression, classification, naive bayesian model, decision tree, random forest model, neural networks, support vector machines)
  • Unsupervised learning (clustering)
  • Model representation
    m = Number of training examples
    x = "input" variable/features
    y = "output" variable/"target" variable
    h = maps from x's to y's
    h_{\theta}(x) = \theta_{0} + \theta_{1}x
  • Cost function
    \min_{\theta_{0},\theta_{1}} \quad \frac{1}{2m} \sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})^2}
    Cost function(squared error function): J(\theta_{0},\theta_{1}) = \frac{1}{2m} \sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})^2}
  • J(\theta_{1}) Bowl shape in 2D
  • J(\theta_{1}, \theta_{2}) Bowl shape in 3D
    Capture.PNG
  • Gradient descent algorithm
    Repeat until convergence:
    \theta_{j} := \theta_{j} - \alpha\frac{\partial}{\partial {\theta_{j}}}J(\theta_{0},\theta_{1})\quad (for \quad j = 0 \quad and \quad j = 1)
    \alpha is called learning rate. A smaller \alpha would result in a smaller step and a larger \alpha results in a larger step. The direction in which the step is taken is determined by the partial derivative of J(\theta_0,\theta_1). Depending on where one starts on the graph, one could end up at different points.
  • If \alpha is too small, gradient descent can be slow, while if \alpha is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
  • As we approach a local minimum, gradient descent will automatically take smaller steps. So no need to decrease \alpha over time.
  • Repeat until convergence
    \theta_{0} := \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})} \\ \theta_{1} := \theta_{1} - \alpha \frac{1}{m} \sum_{i=1}^{m}{((h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)})}
  • "Batch": Each step of gradient descent uses all the training examples.
  • Vector is an n \times 1 matrix
  • Matrices are usually denoted by uppercase names while vectors are lowercase.
  • "Scalar" means that an object is a single value, not a vector or matrix.

Week 2

Course content

  • Multiple features
    x^{(i)}_{j} - value of feature j in the i^{th} training example
    x^{(i)} - the input (features) of the i^{th} training example
    m - the number of training examples
    n - the number of features
    Hypothesis(or called Prediction):\begin{equation} h_{\theta}(x) = \begin{bmatrix} \theta_{0} & \theta_{1} & \dots & \theta_{n} \end{bmatrix} \begin{bmatrix} x_{0}\\ x_{1}\\ \vdots\\ x_{n}\\ \end{bmatrix} = \theta^{T}x \end{equation}
  • Cost function
    J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2
    Vectorized implementation:
    J(\theta) = \frac{1}{2m} (X \theta-\vec{y})^T (X \theta - \vec{y})
  • Gradient descent
    Repeat until convergence
    \theta_{j} := \theta_{j} - \alpha\frac{1}{m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})}{x^{(i)}_{j}}
    simultaneously update \theta_{j} for j = 0, ..., n
    Gradient descent implementation steps

    Vectorized implementation:
    \theta = \theta-\alpha \times \frac{1}{m} \times X^T \times (X \times \theta - \vec{y})
  • Feature scaling (get every feature into approximately a -1 \leq x_{i} \leq 1 range)
    *Feature scaling speeds up gradient descent by making it require fewer iterations to get to a good solution. *
  • Mean normalization (Replace x_{i} with x_{i}-\mu_{i} to make features have approximately zero mean (Do not apply to x_{0}=1))
    x_{i}=\frac{x_{i}-\mu_{i}}{s_{i}}
    With:
    \mu_{i} - averaged value of x in training set
    s_{i} - range(max-min) or standard deviation
  • Normal equation


    Normal equation example and vectorized implementation
  • Comparison of gradient descent and normal equation
Gradient Descent Normal equation
Need to choose \alpha No need to choose \alpha
Needs many iterations Don't need to iterate
Works well even when n (features) is large Slow if n (features) is very large
  • What if X^TX is non-invertible?
  1. Redundant features (linearly dependent)
  2. Too many features (e.g., m \leq n), then delete some features, or use regularisation.
  • Unvectorized implementation: h_{\theta}(x)=\sum_{i=0}^{n}{\theta_j}{x_j}
  • Vectorized implementation: h_{\theta}(x)={\theta}^Tx

Programming Assignment: Linear regression


Week 3

Course content

  • Classification (Logistic regression):
    y \in \{0,1\}
    0 \leq h_{\theta}(x) \leq 1
  • Hypothesis representation
    logistic function (or sigmoid function):
    h_{\theta}(x)=g(\theta^T x)
    z=\theta^T x
    g(z)=\frac{1}{1+e^{-z}}
  • Decision boundary
    \theta^T x=0
    Predict y=1 if \theta^T x \geq 0 or h_{\theta}(x) \geq 0.5;
    Predict y=0 if \theta^T x < 0 or h_{\theta}(x) < 0.5.
  • Cost function
    Convex problem has one global optimum, while non convex problem has multiple local optimum and it is hard to find which one is global optimum.
    J(\theta)=\frac{1}{m}\sum_{i=1}^{m}Cost{(h_{\theta}(x^{(i)}), y^{(i)})}
    Cost(h_{\theta}(x),y)=-log(h_{\theta}(x)) if y=1
    Cost(h_{\theta}(x),y)=-log(1-h_{\theta}(x)) if y=0
    When y=1, we get the following plot for J(\theta) vs h_{\theta}(x)

    When y=0, we get the following plot for J(\theta) vs h_{\theta}(x)

    Cost(h_{\theta}(x), y)=0 if h_{\theta}(x)=y
    Cost(h_{\theta}(x), y) \to \infty if y=0 and h_{\theta}(x) \to 1
    Cost(h_{\theta}(x), y) \to \infty if y=1 and h_{\theta}(x) \to 0
    Note that writing the cost function in this way guarantees that J(\theta) is convex for logistic regression.
  • Simplified logistic cost function:
    J(\theta)=\frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^{(i)}), y{(i)})=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log \ h_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]
    To fit parameter \theta:
    \min_{\theta}J(\theta)
    To make a prediction given new x:
    outputh_{\theta}(x)=\frac{1}{1+e^{-\theta^T x}}
    Gradient descent with vectorized implementation is:
    Vectorized implementation:
    \theta = \theta - \alpha \times \frac{1}{m} \times (X^T) \times (g(X \theta)-\vec{y})
  • Advanced optimization algorithms:
    1. Gradient descent
    2. Conjugate gradient
    3. BFGS
    4. L-BFGS
      With algorithms No.2,3 and 4, \alpha doesn't need to be picked manually, and they are often faster than gradient descent, but they are more complex.
function [J_value, gradient] = costFunction(theta)
  J_value = [code to compute cost function]
  gradient (1) = [code to compute partial derivative of cost function with respect to theta 0]
  .
  .
  .
  gradient (n+1) = [code to compute partial derivative of cost function with respect to theta n]
  • Multiclass classification
    one-vs-all (one-vs_rest):
    We are basically choosing one class and lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returns the highest value as our prediction.
    h_{\theta}^{(i)}(x)=P(y=i | x;\theta), i = 1,2,3
  • Overfitting
    1. underfitting, high bias
    2. just right
    3. overfitting, high variance
      Definition of overfitting: if we have too many features, the learned hypothesis may fit the training set very well (J(\theta) \approx 0), but fail to generalize to new examples.

Note: If we have a lot of features, and , very little training data, then, overfitting can become a problem.

  • How to address overfitting:
    1. Reduce number of features
      • Manually select which features to keep
      • Model selection algorithm
    2. Regularization
      • Keep all the features, but reduce magnitude/values of parameters \theta_{j}
      • Works well when we have a lot of features, each of which contributes a bit to predicting y.
  • In regularized linear regression, we choose \theta to minimize:
    J(\theta) = \frac{1}{2m}[\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^{n} \theta_{j}^2]
    If \lambda is set to an extremely large value, algorithm results in underfitting (fails to fit even the training set)
  • Regularized linear regression
    gradient descent:
    \theta_{j} := \theta_{j}(1-\alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta} (x^{(i)})-y^{(i)})x_j^{(i)}
    where 1-\alpha \frac{\lambda}{m} will always be less than 1.
    Normal equation:
    \theta = (X^TX+\lambda \cdot L)^{-1} X^T y
    where
    L = \begin{bmatrix} 0 & & &\\ & 1 & &\\ & & \ddots &\\ & & & 1 \end{bmatrix}
    L is a matrix with 0 at the top left and 1 is down the diagonal with 0's everywhere else. It should have dimension (n+1)\times(n+1)
    Recall that is m<n then X^TX is non-invertible. However, when we add the term \lambda \cdot L, then X^T X+\lambda \cdot L becomes invertible.
  • Regularized logistic regression- Gradient descent
    Repeat until convergence
    \theta_{j} := \theta_{j} - \alpha[\frac{1}{m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})}{x^{(i)}_{j}}+\frac{\lambda}{m}\theta_{j}]
    simultaneously update \theta_{j} for j = 1, ..., n

Programming Assignment: Logistic regression

  • Feature mapping
    One way to fit the data better is to create more features from each data point. In the provided function mapFeature.m, we will map the features into all polynomial terms of x_1 and x_2 up to the sixth power.
    mapFeature(x) = \begin{bmatrix} 1\\ x_1\\ x_2\\ x_1^2\\ x_1x_2\\ x_2^2\\ x_1^3\\ \vdots\\ x_1x_2^5\\ x_2^6 \end{bmatrix}
  • Different regularization parameter \lambda
  1. With a small \lambda, you should find that the classifier gets almost every training example correct, but draws a very complicated boundary, thus overfitting the data.
  2. If \lambda is set to too high a value, you will not get a good fit and the decision boundary will not follow the data so well, thus underfitting the data.

Week 4

Course content

  • Suppose that you are learning to recognize cars from 2×2 pixel images (grayscale, not RGB). Let the features be pixel intensity values. If you train logistic regression including all the quadratic terms (x_ix_j) as features. you will have 10 features. If there are more pixels, then there will be millions of features. Therefore, you need neural network to perform.
  • Model representation
    Neuron model: logistic unit
    sigmoid(logistic) activation function: h_{\theta}(x)=\frac{1}{1+e^{-\theta^Tx}}
    weights means parameters in \theta
    a_i^{(j)} = "activation" of unit i in layer j
    \Theta^{(j)} = matrix of weights controlling function mapping from layer j to layer j+1
    if network has s_j units in layer j, s_{j+1} units in layer j+1, then \Theta^{(j)} will be of dimension s_{j+1} \times (s_j+1)
  • Forward propagation vectorized implementation
    z^{(j)}=\Theta^{(j-1)} a^{(j-1)}
    a^{(j)}=g(z^{(j)})
    z^{(j+1)}=\Theta^{(j)} a^{(j)}
    h_{\Theta}(x)=a^{(j+1)}=g(z^{(j+1)})
  • Examples and intuitions
    A simple example of applying neural networks is by predicting x_1 AND x_2
    XNOR operator using a hidden layer with two nodes
  • Multiclass classification

Programming Assignment: Multi-class Classification and Neural Networks

  • You will be using multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes, you will need to train 10 separate logistic regression classifiers.
  • Regularized logistic regression
    Cost function:
    J(\theta) =\frac{1}{m} \sum_{i=1}^m [-y^{(i)} \log(h_{\theta}(x^{(i)}))- (1 -y^{(i)}) \log(1-h_{\theta}(x^{(i)}))]+ \frac{\lambda}{2m} \sum_{j=1}^n{\theta_j^2}
    Partial derivative of regularized logistic regression cost for \theta_{j} is defined as:
    \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m ( h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\qquad \mathrm{for}\;j=0
    \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m ( h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j\qquad \mathrm{for}\;j\geq1
    For vectorized implementation, please check Matlab code uploaded in github.
  • One-vs-all classification
    In this part of the exercise, you will implement one-vs-all classification by training multiple regularized logistic regression classifiers, one for each of the classes in our dataset.
  • One-vs-all prediction
    *logistic regression cannot form more complex hypotheses as it is only a linear classier. (You could add more features such as polynomial features to logistic regression, but that can be very expensive to train.). The neural network will be able to represent complex models that form non-linear hypotheses. *
  • Your goal is to implement the feedforward propagation algorithm to use our trained weights for prediction. In next week's exercise, you will write the backpropagation algorithm for learning the neural network parameters.

Week 5

Course content

  • Neural network (classification)
    L - total no. of layers in network
    s_l - no. of units (not counting bias unit) in layer l
    K - number of output units/classes
  • Cost function of neural networks
    J(\Theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}[y_k^{(i)}log \ (h_{\theta}(x^{(i)}))_k + (1-y_k^{(i)})log(1 - (h_{\theta}(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2
  • backpropagation:
    "Backpropagation" is to minimizing cost function: \min_{\Theta}J(\Theta)
    backpropagation algorithm

    Note that D_{ij}^{(l)} is wrong in picture above, it should be as follows
    D_{ij}^{(l)}=\frac{1}{m} \Delta_{ij}^{(l)} \quad for \ j=0\\ D_{ij}^{(l)}=\frac{1}{m} \Delta_{ij}^{(l)}+\frac{\lambda}{m} \Theta_{ij}^{(l)} \quad for \ j \geq 1
    backpropagation algorithm.png
  • Unrolling parameters
    If the dimensions of Theta 1 is 10\times11, Theta 2 is 10 \times 11 and Theta 3 is 1\times11 , then we "unroll" all the elements and put them into one long vecotr:
thetaVector = [Theta1(:); Theta2(:); Theta3(:);]
deltaVector = [D1(:); D2(:); D3(:)]

We can also get back our original matrices from the "unrolled" versions as follows:

Theta1 = reshape(thetaVector(1:100), 10, 11)
Theta2 = reshape(thetaVector(111:220), 10, 11)
Theta3 = reshape(thetaVector(221:231), 1, 11)
  • Gradient checking
    \frac{\partial}{\partial \Theta_j} J(\Theta) \approx \frac{J(\Theta_1,...,\Theta_j+\epsilon,...,\Theta_n)-J(\Theta1,...,\Theta_j-\epsilon,...,\Theta_n)}{2\epsilon}
    A small value for {\epsilon}ϵ (epsilon) such as {\epsilon = 10^{-4}}, guarantees that the math works out properly.
epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

Once we compute gradApprox vector, we can check that gradApprox \approx deltaVector.

  • Random initialization (symmetry breaking)
    Initializing each \Theta_{ij}^{(l)} to a random value between [-\epsilon, \epsilon].
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
  • Training a neural network
  1. Randomly initialize the weights
  2. Implement forward propagation to get h_\Theta(x^{(i)}) for any x^{(i)}
  3. Implement the cost function
  4. Implement backpropagation to compute partial derivatives
  5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
  6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

Programming Assignment: Neural Networks Learning

  • Feedforward and regularized cost function
  • Backpropagation with random initialization
  • Adding regularization to gradient
  • Gradient checking
  • Learning a good set of parameters (Theta1, Theta2...) using fmincg()
  • Optional exercise: To see how the performance of the neural network varies with the regularization parameter and number of training steps (the MaxIter option when using fmincg)

Week 6

Course content

  • Debugging a learning algorithm
    1. Get more training examples
    2. Try smaller sets of features
    3. Try getting additional features
    4. Try adding polynomial features (x_1^2, x_2^2, x_1x_2 etc.)
    5. Try decreasing \lambda
    6. Try increasing \lambda
  • Typical split of dataset into training set and test set is 70% and 30% respectively.
  • The procedure using training set and test set:
    1. Learn \Theta and minimize J_{train}(\Theta) using the training set
    2. Compute the test set error J_{test}(\Theta)
  • The test set error
    1. For linear regression mean square error stuff.
      J_{test}(\Theta)=\frac{1}{2m_{test}} \sum_{i=1}^{m_{test}} (h_{\Theta}(x_{test}^{(i)}) - y_{test}^{(i)})^2
    2. For classification 0/1 misclassification error
      J_{test}(\Theta)=- \frac{1}{m_{test}} \sum_{i=1}^{m_{test}} y_{test}^{(i)} log(h_{\Theta}(x_{test}^{(i)}) + (1-y_{test}^{(i)}) log(h_{\Theta}(x_{test}^{(i)}))
  • Model selection (e.g., degree of polynomial)
  • Train/(Cross) validation/Test sets (typically 60%/20%/20%)
  • We can now calculate three separate error values for the three different sets using the following method:
    1. Optimize the parameters in Θ using the training set for each polynomial degree.
    2. Find the polynomial degree d with the least error using the cross validation set.
    3. Estimate the generalization error using the test set with J_{test}(\Theta^{(d)}), (d = theta from polynomial with lower error);
  • High bias problem (underfitting problem) or high variance problem (overfitting problem)


    The relationship between degree of the polynomial d and the underfitting or overfitting of our hypothesis
  • Choosing the regularization parameter \lambda
    Screenshot 2020-12-08 at 20.31.32.png
  • In order to choose the model and the regularization term λ, we need to
    1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
    2. Create a set of models with different degrees or any other variants.
    3. Iterate through the \lambdas and for each \lambda go through all the models to learn some \Theta.
    4. Compute the cross validation error using the learned Θ (computed with λ) on the J_{CV}(\Theta) without regularization or λ = 0.
    5. Select the best combo that produces the lowest error on the cross validation set.
    6. Using the best combo Θ and λ, apply it on J_{test}(\Theta) to see if it has a good generalization of the problem.
  • Learning curve
    1. If a learning algorithm is suffering from high bias (underfitting), adding polynomial features, or obtaining additional features will help, but getting more training data will not (by itself) help much.


      Typical learning curve for high bias
    2. If a learning algorithm is suffering from high variance (overfitting), reducing feature set or increasing the regularization parameter or getting more training data is likely to help.


      Typical learning curve for high variance
  • "Small" neural network is prone to underfitting, while "large" neural network is more prone to overfitting (use regularization, e.g., increase \lambda to address the overfitting)
  • How to improve the accuracy of the classifier:
    1. Collect lots of data (for example "honeypot" project but doesn't always work)
    2. Develop sophisticated features (for example: using email header data in spam emails)
    3. Develop algorithms to process your input in different ways (recognizing misspellings in spam).
  • Error analysis
  • The recommended approach to solving machine learning problems is to:
    1. Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
    2. Plot learning curves to decide if more data, more features, etc. are likely to help.
    3. Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
  • Accuracy = \frac{(True \quad positives+true \quad negatives)}{total \quad examples}
  • Error Metrics for Skewed Classes
    1. Precision/Recall


      Precision/Recall
  • Trading off precision and recall
  • F score
    2\frac{Precision*Recall}{Precision+Recall}

Programming Assignment: Regularized linear regression and bias vs. variance

  • Regularized linear regression
    1. Visualizing the dataset
    2. Regularized linear regression cost function
    3. Regularized linear regression gradient
    4. Fitting linear regression
  • Bias-variance
    1. Learning curve
  • Polynomial regression
    1. Learning polynomial regression
    2. Selecting lambda using a cross validation set
  • Large data rationale
    1. Use a learning algorithm with many parameters (e.g., logistic regression/linear regression with many features; neural network with many hidden units). - low bias algorithms
    2. Using a very large training set (unlikely to overfit) - low variance
      A combination of both hopefully will deliver a small J_{train}(\Theta) \approx J_{test}{(\Theta)}
      Remember to normalizing polynomial features before using them
      If you are using cost function (linearRegCostFunction) to compute the training, cross validation and test error, you should call the function with the lambda argument set to 0. Do note that you will still need to use lambda when running the training to obtain the theta parameters.
      Add an error_metrics.m function to calculate error metrics.

Week 7

Course content

  • SVM hypothesis
    \min_{\theta} C \sum_{i=1}^{m}[y^{(i)} cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)})] + \frac{1}{2} \sum_{i=1}^{n} \theta^2_j
    Hypothesis:
    h_{\theta}(x)=1 if \theta^Tx \geq 0
    h_{\theta}(x)=0 otherwise
  • SVM Decision Boundary
    \min_{\theta} C*0 +\frac{1}{2} \sum_{i=1}^{n} \theta^2_j
    s.t. \theta^Tx^{(i)} \geq 1 if y^{(i)}=1
    \theta^Tx^{(i)} \leq -1 if y^{(i)}=0
  • Large margin intuition
    C is \frac{1}{\lambda}
  • Kernels and similarity:
    Gaussian kernel function:
    f_1=similarity(x,l^{(1)})=exp(-\frac{||x-l^{(1)}||^2}{2\sigma^2})=exp(-\frac{\sum_{j=1}^{n}(x_j - l_j^{(1)})^2}{2\sigma^2})
    If x \approx l^{(1)}:
    f_1 \approx exp(-\frac{0^2}{2\sigma^2}) \approx 1
    If x is far from l^{(1)}:
    f_1 = exp(-\frac{(large number)^2}{2 \sigma^2}) \approx 0
  • SVM with Kernels
  • SVM parameters
    Large C(=\frac{1}{\lambda}): lower bias, high variance
    Small C(=\frac{1}{\lambda}): higher bias, low variance
    Large \sigma^2: Features f_i vary more smoothly, higher bias, lower variance
    Small \sigma^2: Feature f_i vary less smoothly, lower bias, higher variance
  • Use SVM software package (e.g., liblinear, libsvm)
    Need to specify:
    • Choice of parameter C
    • choice of kernel (similarity function)
  • Kernel (similarity) function
    Do perform feature scaling before using the Gaussian kernel
  • Many off-the-shelf kernels available
    • Polynomial kernel
    • String kernel
  • n=number of features, m=number of training examples
    If n is large, use logistic regression or SVM without a kernel ("linear kernel")
    If n is small, m is intermediate, use SVM with Gaussian kernel
    If n is small, m is large, create/add more features, then use logistic regression or SVM without a kernel.
    Neural network likely to work well for most of these settings, but may be slower to train.

Programming Assignment: Support Vector Machines

  1. Part 1: using a Gaussian kernel with SVMs, 2D datasets
    The C parameter is a positive value that controls the penalty for misclassified training examples.
    Implementation Note:Most SVM software packages (including svmTrain.m) automatically add the extra feature x_0=1 for you and automatically take care of learning the intercept term \theta_0. So when passing your training data to the SVM software, there is no need to add this extra feature x_0=1 yourself.
    In the last section of first part, Optimal C and \sigma parameter is found by using the cross validation set.
    Recall that for classification, the error is defined as the fraction of the cross validation examples that were classified incorrectly. In MATLAB, you can compute this error using mean(double(predictions ~= yval)), where predictions is a vector containing all the predictions from the SVM, and yval are the true labels from the cross validation set.
  2. Part 2: spam classification

Week 8

Course content

  • Unsupervised learning
    While supervised learning algorithms need labeled examples (x,y), unsupervised learning algorithms need only the input (x).

  • K-means algorithm
    input:

    • K (number of clusters)
    • Training set \{x^{(1)}, x^{(2)}, ..., x^{(m)}\}
      c^{(i)} := index (from 1 to K) of cluster centroid closest to x^{(i)}
      \mu_k := average (mean) of points assigned to cluster k
  • Optimization objective
    J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K) = \frac{1}{m} \sum_{i=1}^m ||x^{(i)}-\mu_{c^{(i)}}||^2 \\ \min_{c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K} J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K)

  • K-means algorithm
    Randomly initialize K cluster centroids \mu_1, \mu_2, ..., \mu_K \in R^n
    Repeat{
    for i=1 to m
    c^{(i)} := index (from 1 to K) of cluster centroid closest to x^{(i)}.
    for k=1 to K
    \mu_k := average (mean) of points assigned to cluster k.
    }

  • Random initialization
    Should have K<m m is the number of samples
    Randomly pick K training examples
    Set \mu_1, ..., \mu_K equal to these K examples.
    Try multiple random initialization to avoid that k-mean is stuck in local optimum.

  • Choosing the number of clusters
    Elbow method
    Evaluate K-means based on a metric for how well it performs for that later purpose.

  • Dimensionality reduction

    1. Data compression (Maybe for batteries, this is not necessary, since there is lower dimensional data)
    2. Visualization
      After dimensionality reduction, k=2 or 3 since we can plot 2D or 3D data but don't have ways to visualize higher dimensional data.
  • Principal Component Analysis (PCA) problem formulation
    Reduce from n-dimension to k-dimension: Find k vectors (u^{(1)}, u^{(2)}, ..., u^{(k)} \in R^n) onto which to project the data so as to minimize the projection error.
    PCA is not linear regression

  • Principal Component Analysis algorithm
    Reduce data from n-dimensions to k-dimensions

    1. Compute "covariance matrix":
      \Sigma = \frac{1}{m} \sum_{i=1}^n (x^{(i)}) (x^{(i)})^T
    2. Compute "eigen vectors" of matrix \Sigma
      [U, S, V] = svd(Sigma);\\ Ureduce = U(:, 1:k);\\ z = Ureduce'*x;
      svd - singular value decomposition

    Obtaining z_j=(u^{(j)})^Tx

  • Reconstruction from compressed representation

  • Choosing the number of principal components k
    Typically, choose k to be smallest value so that
    \frac{\frac{1}{m} \sum_{i=1}^m || x^{(i)} - x_{approx}^{(i)} ||^2}{\frac{1}{m} \sum_{i=1}^{m} || x^{(i)} ||^2} \leq 0.01
    or
    \frac{\sum_{i=1}^{k} S_{ii}} {\sum_{i=1}^{m} S_{ii}} \geq 0.99
    "99% of variance is retained"

  • Advice for applying PCA

    1. Supervised learning speedup
      Bad use of PCA: To prevent overfitting. Use regularization instead
  • Before implementing PCA, first try running whatever you want to do with the original/raw data x^{(i)}. Only if that doesn't do what you want, then implement PCA and consider using z^{(i)}.

Programming Assignment: K-Means clustering and principal component analysis

  1. K-Means clustering
    K-means algorithm is as follows:
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
    % Cluster assignment step: Assign each data point to the
    % closest centroid. idx(i) corresponds to c^(i), the index
    % of the centroid assigned to example i
    idx = findClosestCentroids(X, centroids);

    % Move centroid step: Compute means based on centroid
    % assignments
    centroids = computeMeans(X, idx, K);
end

1.1 Implementing K-Means
1.1.1 finding closest centroids
1.1.2 Computing centroid means


Week 9

Course content

  • Anomaly detection
    Check which have p(x) < \epsilon
  • Density estimation
    Training set: \{x^{(1)}, ..., x^{(m)} \}
    Each example is x \in R^n
    \begin{align} p(x) &= p(x_1;\mu_1, \sigma_1^2)p(x_2; \mu_2, \sigma_2^2)...p(x_n;\mu_n, \sigma_n^2)\\ &=\prod_{j=1}^n p(x_j;\mu_j, \sigma_j^2) \end{align}
  • Anomaly detection algorithm
  1. Choose features x_i={x_i^{(1)}, ..., x_i^{(m)}} that you think might be indicative of anomalous examples, each example is x_i \in R^n.
  2. Fit parameters \mu_1, ..., \mu_n, \sigma_1^2, ..., \sigma_n^2
    \begin{align} \mu_j = \frac{1}{m} \sum_{i=1}^m x_j^{(i)};\\ \sigma_j^2 = \frac{1}{m} \sum_{i=1}^m (x_j^{(i)}-\mu_j)^2 \end{align}
  3. Given new example x, compute p(x):
    \begin{align} p(x) &= \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\\ &= \prod_{j=1}^n \frac{1}{\sqrt {2\pi}\sigma_j} \exp (- \frac{(x_j-\mu_j)^2}{2\sigma_j^2}) \end{align}
    Anomaly if p(x) < \epsilon
  • Developing and evaluating an anomaly detection system.


    Comparison of anomaly detection and supervised learning
  • For non-gaussian features, some transformation on original data is needed in order to fit gaussian to it:
  1. log(x)
  2. log(x+c)
  3. \sqrt{x}
  4. x^{(1/3)}
  • Choosing features that might take on unusually large or small values in the event of an anomaly
  • Multivariate Gaussian (Normal) distribution
    x \in R^n, don't model p(x_1), p(x_2), ... etc. separately. Model p(x) all in one go.
    Parameters: \mu \in R^n, \Sigma \in R^{(n \times n)}(covariance matrix)
Model expression Original model Multivariate Gaussian
expression p(x_1;\mu_1, \sigma_1^2 \times ... \times p(x_n; \mu_n, \sigma_n^2)) p(x; \mu, \Sigma)
Correlation Manually create features to capture anomalies where x_1, x_2 take unusual combinations of values Automatically captures correlations between features
Computation power cheaper expensive
Application OK even if m(training set size) is small must have m>n or else \Sigma is non-invertible

Note: If \Sigma is non-invertible, check 1. m>n, 2. if there are redundant features.

  • Recommender systems
  • Collaborative filtering
  • vectorization: low rank matrix factorization
  • Mean normalization

This week's content is so similar to sensor fusion course.

Programming Assignment: Anomaly detection and recommender system

  1. Anomaly detection
    1.1 Gaussian distribution
    1.2 Estimating parameters for a Gaussian
    1.3 Selecting the threshold, \epsilon
    1.4 High dimensional dataset
  2. Recommender systems
    2.1 Collaborative filtering cost
    2.2 Collaborative filtering gradient
    2.3 Regularized cost
    2.4 Regularized gradient

Week 10

Course content

  • Learning with large datasets
    "It is not who has the best algorithm that wins, it is who has the most data."
  • Stochastic gradient descent
    1. Randomly shuffle dataset
    2. Repeat{
      for i := 1, ..., m{
      \theta_j := \theta_j - \alpha(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}
      (for every j=0,...,n)
      }
      }
      Note:
  1. When the training set size m is very large, stochastic gradient descent can be much faster than gradient descent.
  2. The cost function should go down with every literation of batch gradient descent but not necessarily with stochastic gradient descent.
  • Mini-batch gradient descent
    Batch gradient descent: Use all m examples in each iteration.
    Stochastic gradient descent: Use 1 example in each iteration.
    Mini-batch gradient descent: Use b (\leq m) examples in each iteration.
  • Checking for convergence
    Plot cost(\theta, (x^{(i)}, y^{(i)})), averaged over the last 1000 examples. If curve is too noisy, try to increase number of examples or reducing learning rate \alpha.
    Learning rate \alpha is typically held constant can slowly decrease \alpha over time if we want \theta to converge. (E.g., \alpha = \frac{const1}{iterationNumber + const2})
  • Online learning for large scale data
    Learn p(y=1|x;\theta), which is called likelihood in sensor fusion course. x stands for features here.
  • Map reduce and data parallelism
    Map-reduce

    Multi-core machines

    This week may be used in my project later in the future when we have gigabytes data available from cloud.

Quiz

  • As for linear regression with stochastic gradient descent, there is no need to apply map reduce techniques to them again, since computational power has already been decreased.
  • suppose that you are training a logistic regression classifier using stochastic gradient descent. It is found that the cost(cost(\theta, (x^{(i)}, y^{(i)})), averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Then you should try decrease learning rate \alpha
  • Before running stochastic gradient descent, you should randomly shuffle the training set.

Week 11

Course content

  • Photo (Optical character recognition) OCR pipeline
    1. Text detection
    2. Character segmentation
    3. Character recognition
  • Sliding windows for text detection
  • Make sure you have a low bias classifier before getting more data.
  • Estimating the errors due to each component (ceiling analysis)
    What part of the pipeline should you spend the most time trying to improve?


    Ceiling analysis example

Questions arise from this course:

  1. Why don't go directly to deep learning methods instead of machine learning.
    Because feature matters more especially for batteries with less amount of data!
  2. Possible research ideas inspired by this course.
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,732评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,496评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,264评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,807评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,806评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,675评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,029评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,683评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,704评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,666评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,773评论 1 332
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,413评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,016评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,978评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,204评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,083评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,503评论 2 343

推荐阅读更多精彩内容