- What is regularization? The differences between Lasso vs Ridge?
- Regularization: A process of adding a tuning parameter to a model to induce smoothness of the weights (prevent the coefficients to fit so perfectly) in order to prevent overfitting. It is most often done by adding a constant multiple to an existing weight vector, the constant can be any norm (sometimes it will be Lasso or Ridge). Then the model predictions minimize the regularized loss function.
- L2 (Ridge): the sum of the square of the weights, it has analytical solution, higher computational efficiency.
- L1 (Lasso): the sum of the absolute value of the weights, it performs feature selection better in sparse cases.
- How to deal with overfitting?
- use simple models
- choose the parameters properly when using a learning algorithm.
- Cross Validation: A standard way to find out-of-sample prediction error. This is better representative of the error you would expect when you're trying to predict a future value, rather than just how well you can fit the data at hand.
- Regularization:Some sort of regularization can help penalize certain sources of overfitting. A common one you can use in linear models is Ridge Regression or LASSO, where your penalize your model if the sum of the norms of the slopes get too high.
- disadvantages of linear regression?
- Linear regressions are sensitive to outliers.
- Linear regressions are meant to describe linear relationships between variables. (However, this can be compensated by transforming some of the parameters with a log, square root, etc. transformation.)
- Linear regression assumes that the data are independent.
- Explain what precision and recall are. How do they relate to the ROC curve?
- In binary classification:
1). TN / True Negative: case was negative and predicted negative
2). TP / True Positive: case was positive and predicted positive
3). FN / False Negative:case was positive but predicted negative
4). FP / False Positive: case was negative but predicted positive- Precision:
TP/(TP+FP), the probability that the true positive among the predicted positive, a measure of how many of the samples predicted by the classifier as positive is indeed positive.- Recall:
TP/(TP+FN), the probability that the true positive among the conditioned positive, a measure of how many of the positive samples have been identified as being positive.- ROC
ROC curve represents a relation between sensitivity(recall) and specificity(not precision) and is commonly used to measure the performance of binary classifiers.
- What is "long" ("tall") and "wide" format data, and the basic ways to deal with the data?
- “Long ” format data: many more records (rows) than features (columns), the main method to deal with this kind of data is sample reduction or feature engineering (such as extract more features).
- “Wide” format data: a small number of records but a large number of features, the main method to deal with this kind of data is dimensionality reduction (such as feature selection, feature reduction like PCA).
- What is the differences between supervised learning and unsupervised learning? Give me examples.
- Supervised Learning: if you are training your machine learning task for every input with corresponding target, it is called supervised learning, which will be able to provide target for any new input after sufficient training.
i.e: You have a dataset including three-cluster data, you want to train a model and predict which class it belongs when there is new input.- Unsupervised Learning: if you are training your machine learning task only with a set of inputs, it is called unsupervised learning, which will be able to find the structure or relationships between different inputs.
i.e.: You have a dataset, you want to train a model to divide the data into several clusters.
- During analysis, how do you treat missing values?
- Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you may drop the variable instead of treating the missing values.
- Deleting the observations: when your have sufficient data points and your delete will not introduce bias.
- Imputation with mean / median / mode or set default value.
- Imputation with some models: KNN, Mice etc.
- Use other features to build a model to predict the missing part.
- What is cross-validation? How to do it right?
Cross Validation is generally used to assess the error of given models and select the most appropriate model.
- Steps:
1). Divide the sample data into training set and test set;
2). Partition the training data into k equal-sized folds;
3). For k = 1,2,...,K, fit the model to the other K-1 folds and calculate the prediction error on the k-th component.
4). Take the average of the prediction errors as an estimate of model performance; select the model that results in the lowest average prediction error;
5). Train the selected model on the entire training data and test on the held-out test set. The prediction error is an estimate of the model’s performance in the real world.
- What do you understand by Bias Variance trade off?
- Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends.
- Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.
- What is latent semantic indexing? What is it used for? What are the specific limitations of the method?
- Latent Semantic Indexing is Principal Component Analysis (PCA) in document analysis, it is simply applying PCA to (the variance-covariance matrix) of X and the principal directions (eigenvectors) to define topics.
- It uses a term-document matrix X that describes the occurrences of terms in documents. Rows correspond to terms(vocabulary) and columns correspond to documents. Elements of X are typically weights that are proportional to the number of times a term appears in a document, with rare terms upweighted to reflect the relative importance. The matrix X is usually large and sparse.
- LSA finds a low-rank approximation of the original term-document matrix, which merges the dimensions of terms that have similar meanings.
- How is KNN different from k-means clustering?
- KNN needs labeled points and is thus supervised learning, while k-means doesn’t and is thus unsupervised learning.
- K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.
- What is Bayes’ Theorem?
How is it useful in a machine learning context?
- Bayes’ Theorem gives you the posterior probability of an event given what is known as prior knowledge.It’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition.
Bayes’ Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier.
- Why is 'Naive' Bayes naive?
Naive Bayes is considered “Naive” because it makes an assumption which is virtually impossible to see in real-life data: the conditional probability is calculated as the pure product of the individual probabilities of components. This implies the absolute independence of features — a condition probably never met in real life.
- Difference between a generative and discriminative model? (生成模型与判别模型)
A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.
- What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?
- Both algorithms are methods for finding a set of parameters that minimize a loss function by evaluating parameters against data and then making adjustments.
- In standard gradient descent, you'll evaluate all training samples for each set of parameters. This is going to take big, slow steps toward the solution.
- In stochastic gradient descent, you'll evaluate only a certain part of training samples for the set of parameters before updating them. This is going to take small, quick steps toward the solution.
- What are the advantages and disadvantages of k-nearest neighbors?
- Advantages: K-Nearest Neighbors have a nice intuitive explanation, and then tend to work very well for problems where comparables are inherently indicative. For example, you could build a kNN housing price model by modeling on other houses in the area with similar number of bedrooms, floor space, etc.
- Disadvantages: They are memory-intensive.They also do not have built-in feature selection or regularization, so they do not handle high dimensionality well.
- What are the advantages and disadvantages of neural networks?
- Advantages: Neural networks (specifically deep NNs) have led to performance breakthroughs for unstructured datasets such as images, audio, and video. Their incredible flexibility allows them to learn patterns that no other ML algorithm can learn.
- Disadvantages: However, they require a large amount of training data to converge. It's also difficult to pick the right architecture, and the internal "hidden" layers are incomprehensible.
- Describe the basic steps to do the PCA (Principal Components Analysis)
- Standardize the data.
- Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
- Sort eigenvalues in descending order and choose the k eigenvectors that correspond to the k largest eigenvalues where k is the number of dimensions of the new feature subspace (k≤d).
- Construct the projection matrix W from the selected k eigenvectors.
- Transform the original dataset X via W to obtain a k-dimensional feature subspace Y.
- Tell me some majors issues needed to be considered in supervised machine learning?
- Bias-variance tradeoff
- Function complexity and amount of training data
- Dimensionality of the input space
- Noise in the output values
- Input data problems such as Heterogeneity of the data, Redundancy in the data and Presence of interactions and non-linearities.