Course syllabus
Week 1:
- Introduction
- Model and Cost Function
- Gradient Descent
- Linear regression with one variable
- Linear algebra review
Week 2:
- Linear regression with multiple variables
- Octave/Matlab Tutorial
Week 3:
- Classification and Hypothesis Representation
- Multiclass Classification
- Logistic regression
- Regularization
Week 4:
- Neural networks: representation
Week 5:
- Neural networks: learning
- Cost Function and Backpropagation
- Gradient Checking and Random Initialization
Week 6:
- Advice for applying machine learning
- Evaluating a learning algorithm
- Bias vs. Variance
- Handling Skewed Data
Week 7:
- Support vector machines
- Large Margin Classification
- Kernels
Week 8:
- Unsupervised learning
- K-Means
- Dimensionality reduction
Week 9:
- Density Estimation
- Anomaly detection
- Recommender systems
Week 10:
- Gradient Descent with Large Datasets
Week 11:
- Application example: photo OCR
Software
- Octave/MatLab
Week 1
Course content
- Machine learning definition ( Field of study that gives computers the ability to learn without being explicitly programmed. )
- Machine learning algorithms: Supervised learning and unsupervised learning, reinforcement learning, recommender systems.
- Supervised learning (e.g., regression, classification, naive bayesian model, decision tree, random forest model, neural networks, support vector machines)
- Unsupervised learning (clustering)
- Model representation
m = Number of training examples
x = "input" variable/features
y = "output" variable/"target" variable
h = maps from x's to y's
- Cost function
-
Bowl shape in 2D
-
Bowl shape in 3D
Capture.PNG - Gradient descent algorithm
Repeat until convergence:
is called learning rate. A smaller
would result in a smaller step and a larger
results in a larger step. The direction in which the step is taken is determined by the partial derivative of
. Depending on where one starts on the graph, one could end up at different points.
- If
is too small, gradient descent can be slow, while if
is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
- As we approach a local minimum, gradient descent will automatically take smaller steps. So no need to decrease
over time.
- Repeat until convergence
- "Batch": Each step of gradient descent uses all the training examples.
-
Vector is an n
1 matrix
- Matrices are usually denoted by uppercase names while vectors are lowercase.
- "Scalar" means that an object is a single value, not a vector or matrix.
Week 2
Course content
- Multiple features
- value of feature j in the
training example
- the input (features) of the
training example
- the number of training examples
- the number of features
Hypothesis(or called Prediction): - Cost function
Vectorized implementation:
- Gradient descent
Repeat until convergence
simultaneously updatefor
Gradient descent implementation steps
Vectorized implementation:
- Feature scaling (get every feature into approximately a
range)
*Feature scaling speeds up gradient descent by making it require fewer iterations to get to a good solution. * - Mean normalization (Replace
with
to make features have approximately zero mean (Do not apply to
))
With:
- averaged value of x in training set
- range(max-min) or standard deviation
-
Normal equation
Normal equation example and vectorized implementation - Comparison of gradient descent and normal equation
Gradient Descent | Normal equation |
---|---|
Need to choose |
No need to choose |
Needs many iterations | Don't need to iterate |
Works well even when n (features) is large | Slow if n (features) is very large |
- What if
is non-invertible?
- Redundant features (linearly dependent)
- Too many features (e.g.,
), then delete some features, or use regularisation.
- Unvectorized implementation:
- Vectorized implementation:
Programming Assignment: Linear regression
Week 3
Course content
- Classification (Logistic regression):
- Hypothesis representation
logistic function (or sigmoid function):
- Decision boundary
Predictif
or
;
Predictif
or
.
- Cost function
Convex problem has one global optimum, while non convex problem has multiple local optimum and it is hard to find which one is global optimum.
if
if
When, we get the following plot for
vs
When, we get the following plot for
vs
if
if
and
if
and
Note that writing the cost function in this way guarantees thatis convex for logistic regression.
- Simplified logistic cost function:
To fit parameter:
To make a prediction given new:
output
Gradient descent with vectorized implementation is:
Vectorized implementation:
- Advanced optimization algorithms:
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS
With algorithms No.2,3 and 4,doesn't need to be picked manually, and they are often faster than gradient descent, but they are more complex.
function [J_value, gradient] = costFunction(theta)
J_value = [code to compute cost function]
gradient (1) = [code to compute partial derivative of cost function with respect to theta 0]
.
.
.
gradient (n+1) = [code to compute partial derivative of cost function with respect to theta n]
- Multiclass classification
one-vs-all (one-vs_rest):
We are basically choosing one class and lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returns the highest value as our prediction.
- Overfitting
- underfitting, high bias
- just right
- overfitting, high variance
Definition of overfitting: if we have too many features, the learned hypothesis may fit the training set very well (), but fail to generalize to new examples.
Note: If we have a lot of features, and , very little training data, then, overfitting can become a problem.
- How to address overfitting:
- Reduce number of features
- Manually select which features to keep
- Model selection algorithm
- Regularization
- Keep all the features, but reduce magnitude/values of parameters
- Works well when we have a lot of features, each of which contributes a bit to predicting
.
- Keep all the features, but reduce magnitude/values of parameters
- Reduce number of features
- In regularized linear regression, we choose
to minimize:
Ifis set to an extremely large value, algorithm results in underfitting (fails to fit even the training set)
- Regularized linear regression
gradient descent:
wherewill always be less than 1.
Normal equation:
where
L is a matrix with 0 at the top left and 1 is down the diagonal with 0's everywhere else. It should have dimension
Recall that isthen
is non-invertible. However, when we add the term
, then
becomes invertible.
- Regularized logistic regression- Gradient descent
Repeat until convergence
simultaneously updatefor
Programming Assignment: Logistic regression
- Feature mapping
One way to fit the data better is to create more features from each data point. In the provided function mapFeature.m, we will map the features into all polynomial terms ofand
up to the sixth power.
- Different regularization parameter
- With a small
, you should find that the classifier gets almost every training example correct, but draws a very complicated boundary, thus overfitting the data.
- If
is set to too high a value, you will not get a good fit and the decision boundary will not follow the data so well, thus underfitting the data.
Week 4
Course content
- Suppose that you are learning to recognize cars from 2×2 pixel images (grayscale, not RGB). Let the features be pixel intensity values. If you train logistic regression including all the quadratic terms
as features. you will have 10 features. If there are more pixels, then there will be millions of features. Therefore, you need neural network to perform.
- Model representation
Neuron model: logistic unit
sigmoid(logistic) activation function:
weights means parameters in
= "activation" of unit i in layer j
= matrix of weights controlling function mapping from layer j to layer j+1
if network hasunits in layer
,
units in layer
, then
will be of dimension
- Forward propagation vectorized implementation
- Examples and intuitions
A simple example of applying neural networks is by predictingAND
XNOR operator using a hidden layer with two nodes - Multiclass classification
Programming Assignment: Multi-class Classification and Neural Networks
- You will be using multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes, you will need to train 10 separate logistic regression classifiers.
- Regularized logistic regression
Cost function:
Partial derivative of regularized logistic regression cost foris defined as:
For vectorized implementation, please check Matlab code uploaded in github. - One-vs-all classification
In this part of the exercise, you will implement one-vs-all classification by training multiple regularized logistic regression classifiers, one for each of the classes in our dataset. - One-vs-all prediction
*logistic regression cannot form more complex hypotheses as it is only a linear classier. (You could add more features such as polynomial features to logistic regression, but that can be very expensive to train.). The neural network will be able to represent complex models that form non-linear hypotheses. * - Your goal is to implement the feedforward propagation algorithm to use our trained weights for prediction. In next week's exercise, you will write the backpropagation algorithm for learning the neural network parameters.
Week 5
Course content
- Neural network (classification)
- total no. of layers in network
- no. of units (not counting bias unit) in layer
- number of output units/classes
- Cost function of neural networks
- backpropagation:
"Backpropagation" is to minimizing cost function:
backpropagation algorithm
Note thatis wrong in picture above, it should be as follows
backpropagation algorithm.png - Unrolling parameters
If the dimensions of Theta 1 is, Theta 2 is
and Theta 3 is
, then we "unroll" all the elements and put them into one long vecotr:
thetaVector = [Theta1(:); Theta2(:); Theta3(:);]
deltaVector = [D1(:); D2(:); D3(:)]
We can also get back our original matrices from the "unrolled" versions as follows:
Theta1 = reshape(thetaVector(1:100), 10, 11)
Theta2 = reshape(thetaVector(111:220), 10, 11)
Theta3 = reshape(thetaVector(221:231), 1, 11)
- Gradient checking
A small value for {\epsilon}ϵ (epsilon) such as, guarantees that the math works out properly.
epsilon = 1e-4;
for i = 1:n,
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
Once we compute gradApprox vector, we can check that gradApprox deltaVector.
- Random initialization (symmetry breaking)
Initializing eachto a random value between
.
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
- Training a neural network
- Randomly initialize the weights
- Implement forward propagation to get
for any
- Implement the cost function
- Implement backpropagation to compute partial derivatives
- Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
- Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
Programming Assignment: Neural Networks Learning
- Feedforward and regularized cost function
- Backpropagation with random initialization
- Adding regularization to gradient
- Gradient checking
- Learning a good set of parameters (Theta1, Theta2...) using fmincg()
- Optional exercise: To see how the performance of the neural network varies with the regularization parameter and number of training steps (the MaxIter option when using fmincg)
Week 6
Course content
- Debugging a learning algorithm
- Get more training examples
- Try smaller sets of features
- Try getting additional features
- Try adding polynomial features (
etc.)
- Try decreasing
- Try increasing
- Typical split of dataset into training set and test set is 70% and 30% respectively.
- The procedure using training set and test set:
- Learn
and minimize
using the training set
- Compute the test set error
- Learn
- The test set error
- For linear regression mean square error stuff.
- For classification 0/1 misclassification error
- For linear regression mean square error stuff.
- Model selection (e.g., degree of polynomial)
- Train/(Cross) validation/Test sets (typically 60%/20%/20%)
- We can now calculate three separate error values for the three different sets using the following method:
- Optimize the parameters in Θ using the training set for each polynomial degree.
- Find the polynomial degree d with the least error using the cross validation set.
- Estimate the generalization error using the test set with
, (d = theta from polynomial with lower error);
-
High bias problem (underfitting problem) or high variance problem (overfitting problem)
The relationship between degree of the polynomial d and the underfitting or overfitting of our hypothesis - Choosing the regularization parameter
Screenshot 2020-12-08 at 20.31.32.png - In order to choose the model and the regularization term λ, we need to
- Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
- Create a set of models with different degrees or any other variants.
- Iterate through the
s and for each
go through all the models to learn some
.
- Compute the cross validation error using the learned Θ (computed with λ) on the
without regularization or λ = 0.
- Select the best combo that produces the lowest error on the cross validation set.
- Using the best combo Θ and λ, apply it on
to see if it has a good generalization of the problem.
- Learning curve
-
If a learning algorithm is suffering from high bias (underfitting), adding polynomial features, or obtaining additional features will help, but getting more training data will not (by itself) help much.
Typical learning curve for high bias -
If a learning algorithm is suffering from high variance (overfitting), reducing feature set or increasing the regularization parameter or getting more training data is likely to help.
Typical learning curve for high variance
-
- "Small" neural network is prone to underfitting, while "large" neural network is more prone to overfitting (use regularization, e.g., increase
to address the overfitting)
- How to improve the accuracy of the classifier:
- Collect lots of data (for example "honeypot" project but doesn't always work)
- Develop sophisticated features (for example: using email header data in spam emails)
- Develop algorithms to process your input in different ways (recognizing misspellings in spam).
- Error analysis
- The recommended approach to solving machine learning problems is to:
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
- Error Metrics for Skewed Classes
-
Precision/Recall
Precision/Recall
-
- Trading off precision and recall
- F score
Programming Assignment: Regularized linear regression and bias vs. variance
- Regularized linear regression
- Visualizing the dataset
- Regularized linear regression cost function
- Regularized linear regression gradient
- Fitting linear regression
- Bias-variance
- Learning curve
- Polynomial regression
- Learning polynomial regression
- Selecting lambda using a cross validation set
- Large data rationale
- Use a learning algorithm with many parameters (e.g., logistic regression/linear regression with many features; neural network with many hidden units). - low bias algorithms
- Using a very large training set (unlikely to overfit) - low variance
A combination of both hopefully will deliver a small
Remember to normalizing polynomial features before using them
If you are using cost function (linearRegCostFunction) to compute the training, cross validation and test error, you should call the function with the lambda argument set to 0. Do note that you will still need to use lambda when running the training to obtain the theta parameters.
Add an error_metrics.m function to calculate error metrics.
Week 7
Course content
- SVM hypothesis
Hypothesis:
if
otherwise
- SVM Decision Boundary
C*0
s.t.if
if
- Large margin intuition
C is - Kernels and similarity:
Gaussian kernel function:
If:
Ifis far from
:
- SVM with Kernels
- SVM parameters
Large C(): lower bias, high variance
Small C(): higher bias, low variance
Large: Features
vary more smoothly, higher bias, lower variance
Small: Feature
vary less smoothly, lower bias, higher variance
- Use SVM software package (e.g., liblinear, libsvm)
Need to specify:- Choice of parameter C
- choice of kernel (similarity function)
- Kernel (similarity) function
Do perform feature scaling before using the Gaussian kernel - Many off-the-shelf kernels available
- Polynomial kernel
- String kernel
- n=number of features, m=number of training examples
If n is large, use logistic regression or SVM without a kernel ("linear kernel")
If n is small, m is intermediate, use SVM with Gaussian kernel
If n is small, m is large, create/add more features, then use logistic regression or SVM without a kernel.
Neural network likely to work well for most of these settings, but may be slower to train.
Programming Assignment: Support Vector Machines
- Part 1: using a Gaussian kernel with SVMs, 2D datasets
The C parameter is a positive value that controls the penalty for misclassified training examples.
Implementation Note:Most SVM software packages (including svmTrain.m) automatically add the extra featurefor you and automatically take care of learning the intercept term
. So when passing your training data to the SVM software, there is no need to add this extra feature
yourself.
In the last section of first part, Optimaland
parameter is found by using the cross validation set.
Recall that for classification, the error is defined as the fraction of the cross validation examples that were classified incorrectly. In MATLAB, you can compute this error using mean(double(predictions ~= yval)), where predictions is a vector containing all the predictions from the SVM, and yval are the true labels from the cross validation set. - Part 2: spam classification
Week 8
Course content
Unsupervised learning
While supervised learning algorithms need labeled examples (x,y), unsupervised learning algorithms need only the input (x).-
K-means algorithm
input:- K (number of clusters)
- Training set
:= index (from 1 to K) of cluster centroid closest to
:= average (mean) of points assigned to cluster
Optimization objective
K-means algorithm
Randomly initializecluster centroids
Repeat{
for i=1 to m
index (from 1 to K) of cluster centroid closest to
.
for k=1 to K
average (mean) of points assigned to cluster k.
}Random initialization
Should havem is the number of samples
Randomly picktraining examples
Setequal to these
examples.
Try multiple random initialization to avoid that k-mean is stuck in local optimum.Choosing the number of clusters
Elbow method
Evaluate K-means based on a metric for how well it performs for that later purpose.-
Dimensionality reduction
- Data compression (Maybe for batteries, this is not necessary, since there is lower dimensional data)
- Visualization
After dimensionality reduction, k=2 or 3 since we can plot 2D or 3D data but don't have ways to visualize higher dimensional data.
Principal Component Analysis (PCA) problem formulation
Reduce from n-dimension to k-dimension: Find k vectors () onto which to project the data so as to minimize the projection error.
PCA is not linear regression-
Principal Component Analysis algorithm
Reduce data from n-dimensions to k-dimensions- Compute "covariance matrix":
- Compute "eigen vectors" of matrix
svd - singular value decomposition
Obtaining
- Compute "covariance matrix":
Reconstruction from compressed representation
Choosing the number of principal components k
Typically, choose k to be smallest value so that
or
"99% of variance is retained"-
Advice for applying PCA
- Supervised learning speedup
Bad use of PCA: To prevent overfitting. Use regularization instead
- Supervised learning speedup
Before implementing PCA, first try running whatever you want to do with the original/raw data
. Only if that doesn't do what you want, then implement PCA and consider using
.
Programming Assignment: K-Means clustering and principal component analysis
- K-Means clustering
K-means algorithm is as follows:
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
% Cluster assignment step: Assign each data point to the
% closest centroid. idx(i) corresponds to c^(i), the index
% of the centroid assigned to example i
idx = findClosestCentroids(X, centroids);
% Move centroid step: Compute means based on centroid
% assignments
centroids = computeMeans(X, idx, K);
end
1.1 Implementing K-Means
1.1.1 finding closest centroids
1.1.2 Computing centroid means
Week 9
Course content
- Anomaly detection
Check which have - Density estimation
Training set:
Each example is
- Anomaly detection algorithm
- Choose features
that you think might be indicative of anomalous examples, each example is
.
- Fit parameters
- Given new example
, compute
:
Anomaly if
-
Developing and evaluating an anomaly detection system.
Comparison of anomaly detection and supervised learning - For non-gaussian features, some transformation on original data is needed in order to fit gaussian to it:
- log(x)
- log(x+c)
- Choosing features that might take on unusually large or small values in the event of an anomaly
- Multivariate Gaussian (Normal) distribution
, don't model
etc. separately. Model
all in one go.
Parameters:
Model expression | Original model | Multivariate Gaussian |
---|---|---|
expression | p(x; \mu, \Sigma) | |
Correlation | Manually create features to capture anomalies where |
Automatically captures correlations between features |
Computation power | cheaper | expensive |
Application | OK even if m(training set size) is small | must have |
Note: If is non-invertible, check 1.
, 2. if there are redundant features.
- Recommender systems
- Collaborative filtering
- vectorization: low rank matrix factorization
- Mean normalization
This week's content is so similar to sensor fusion course.
Programming Assignment: Anomaly detection and recommender system
- Anomaly detection
1.1 Gaussian distribution
1.2 Estimating parameters for a Gaussian
1.3 Selecting the threshold,
1.4 High dimensional dataset - Recommender systems
2.1 Collaborative filtering cost
2.2 Collaborative filtering gradient
2.3 Regularized cost
2.4 Regularized gradient
Week 10
Course content
- Learning with large datasets
"It is not who has the best algorithm that wins, it is who has the most data." - Stochastic gradient descent
- Randomly shuffle dataset
- Repeat{
for{
(for every
}
}
Note:
- When the training set size
is very large, stochastic gradient descent can be much faster than gradient descent.
- The cost function should go down with every literation of batch gradient descent but not necessarily with stochastic gradient descent.
- Mini-batch gradient descent
Batch gradient descent: Use allexamples in each iteration.
Stochastic gradient descent: Use 1 example in each iteration.
Mini-batch gradient descent: Useexamples in each iteration.
- Checking for convergence
Plot, averaged over the last 1000 examples. If curve is too noisy, try to increase number of examples or reducing learning rate
.
Learning rateis typically held constant can slowly decrease
over time if we want
to converge. (E.g.,
)
- Online learning for large scale data
Learn, which is called likelihood in sensor fusion course.
stands for features here.
- Map reduce and data parallelism
Map-reduce
Multi-core machines
This week may be used in my project later in the future when we have gigabytes data available from cloud.
Quiz
- As for linear regression with stochastic gradient descent, there is no need to apply map reduce techniques to them again, since computational power has already been decreased.
- suppose that you are training a logistic regression classifier using stochastic gradient descent. It is found that the cost(
, averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Then you should try decrease learning rate
- Before running stochastic gradient descent, you should randomly shuffle the training set.
Week 11
Course content
- Photo (Optical character recognition) OCR pipeline
- Text detection
- Character segmentation
- Character recognition
- Sliding windows for text detection
- Make sure you have a low bias classifier before getting more data.
-
Estimating the errors due to each component (ceiling analysis)
What part of the pipeline should you spend the most time trying to improve?
Ceiling analysis example
Questions arise from this course:
- Why don't go directly to deep learning methods instead of machine learning.
Because feature matters more especially for batteries with less amount of data! - Possible research ideas inspired by this course.