Module 1: Fundamentals of Machine Learning
Machine learning brings together statistics, computer science and more...
supervised learning and unsupervised learning
Representation -> Evaluation -> Optimization
- scikit-learn
- SciPy
- NumPy
- Pandas
- Matplotlib
It is important to examine your dataset before starting to work with it for a machine learning task, because:
- To understand how much missing data there is in the dataset
- To get an idea of whether the features need further cleaning
- It may turn out the problem doesn't require machine learning.
- K-Nearest Neighbors Classification
- A low value of “k” (close to 1) is more likely to overfit the training data and lead to worse accuracy on the test data, compared to higher values of “k”.
- Setting “k” to the number of points in the training set will result in a classifier that always predicts the majority class.
- The k-nearest neighbors classification algorithm has to memorize all of the training examples to make a prediction.
Module 2: Supervised Machine Learning
- Linear regression
- Logistic regression
- Support Vector Machines
- Multi-class classification
- Kernelized Support Vector Machine
- Cross validation
- Decision tree (less preprocssing of data, easy to interpret and visualize)
Module 3: Model Evaluation and Selection
- Represent/Train/Evaluate/Refine cycle
- Preamble
- Dummy Classifier (strategy: most_frequent, stratified, uniform, constant)
- Dummy Regressor (strategy: mean, median, quantile, constant)
- Confusion matrices & Basic evaluation matrices
-
Recall is also known as:
- True Positive Rate (TPR)
- Sensitivity
- Probability of detection
-
FPR is also known as Specificity
- Classifier Decision Functions
- Precision recall and ROC curves
- Multi-class evaluation, Macro average vs Micro average
- Regression evaluation (r2_score usually is enough)
- Model selection: Optimizing classifiers for different evaluation metrics
Module 4: Supervised Machine Learning - Part 2
- Naive Bayes Classifier (Naive means each feature of an instance is independent of all the others, given the class)
- Bernoulli, Multinomial (suitable for text classification)
- Gaussian (suitable for high-dimensional data)
- Random Forests
- Gradient Boosted Decision Trees
- Neural Networks
- Data leakage
Optional module: Unsupervised Machine Learning
- Dimensionality Reduction and Manifold Learning
- K-means clustering
- agglomerative clustering
- DBSCAN Clustering