DS Q&A

What is regularization? The differences between Lasso vs Ridge?

Regularization: A process of adding a tuning parameter to a model to induce smoothness of the weights (prevent the coefficients to fit so perfectly) in order to prevent overfitting. It is most often done by adding a constant multiple to an existing weight vector, the constant can be any norm (sometimes it will be Lasso or Ridge). Then the model predictions minimize the regularized loss function.

L2 (Ridge): the sum of the square of the weights, it has analytical solution, higher computational efficiency.

L1 (Lasso): the sum of the absolute value of the weights, it performs feature selection better in sparse cases.

How to deal with overfitting?

use simple models

choose the parameters properly when using a learning algorithm.

Cross Validation: A standard way to find out-of-sample prediction error. This is better representative of the error you would expect when you're trying to predict a future value, rather than just how well you can fit the data at hand.

Regularization：Some sort of regularization can help penalize certain sources of overfitting. A common one you can use in linear models is Ridge Regression or LASSO, where your penalize your model if the sum of the norms of the slopes get too high.

disadvantages of linear regression?

Linear regressions are sensitive to outliers.

Linear regressions are meant to describe linear relationships between variables. (However, this can be compensated by transforming some of the parameters with a log, square root, etc. transformation.)

Linear regression assumes that the data are independent.

Explain what precision and recall are. How do they relate to the ROC curve?

In binary classification:
1). TN / True Negative: case was negative and predicted negative
2). TP / True Positive: case was positive and predicted positive
3). FN / False Negative:case was positive but predicted negative
4). FP / False Positive: case was negative but predicted positive

Precision:
TP/(TP+FP), the probability that the true positive among the predicted positive, a measure of how many of the samples predicted by the classifier as positive is indeed positive.

Recall:
TP/(TP+FN), the probability that the true positive among the conditioned positive, a measure of how many of the positive samples have been identified as being positive.

ROC
ROC curve represents a relation between sensitivity(recall) and specificity(not precision) and is commonly used to measure the performance of binary classifiers.

What is "long" ("tall") and "wide" format data, and the basic ways to deal with the data?

“Long ” format data: many more records (rows) than features (columns), the main method to deal with this kind of data is sample reduction or feature engineering (such as extract more features).

“Wide” format data: a small number of records but a large number of features, the main method to deal with this kind of data is dimensionality reduction (such as feature selection, feature reduction like PCA).

What is the differences between supervised learning and unsupervised learning? Give me examples.

Supervised Learning: if you are training your machine learning task for every input with corresponding target, it is called supervised learning, which will be able to provide target for any new input after sufficient training.
i.e: You have a dataset including three-cluster data, you want to train a model and predict which class it belongs when there is new input.

Unsupervised Learning: if you are training your machine learning task only with a set of inputs, it is called unsupervised learning, which will be able to find the structure or relationships between different inputs.
i.e.: You have a dataset, you want to train a model to divide the data into several clusters.

During analysis, how do you treat missing values?

Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you may drop the variable instead of treating the missing values.

Deleting the observations: when your have sufficient data points and your delete will not introduce bias.

Imputation with mean / median / mode or set default value.

Imputation with some models: KNN, Mice etc.

Use other features to build a model to predict the missing part.

What is cross-validation? How to do it right?

Cross Validation is generally used to assess the error of given models and select the most appropriate model.

Steps:
1). Divide the sample data into training set and test set;
2). Partition the training data into k equal-sized folds;
3). For k = 1,2,...,K, fit the model to the other K-1 folds and calculate the prediction error on the k-th component.
4). Take the average of the prediction errors as an estimate of model performance; select the model that results in the lowest average prediction error;
5). Train the selected model on the entire training data and test on the held-out test set. The prediction error is an estimate of the model’s performance in the real world.

What do you understand by Bias Variance trade off?

Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends.

Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.

What is latent semantic indexing? What is it used for? What are the specific limitations of the method?

Latent Semantic Indexing is Principal Component Analysis (PCA) in document analysis, it is simply applying PCA to (the variance-covariance matrix) of X and the principal directions (eigenvectors) to define topics.

It uses a term-document matrix X that describes the occurrences of terms in documents. Rows correspond to terms(vocabulary) and columns correspond to documents. Elements of X are typically weights that are proportional to the number of times a term appears in a document, with rare terms upweighted to reflect the relative importance. The matrix X is usually large and sparse.

LSA finds a low-rank approximation of the original term-document matrix, which merges the dimensions of terms that have similar meanings.

How is KNN different from k-means clustering?

KNN needs labeled points and is thus supervised learning, while k-means doesn’t and is thus unsupervised learning.

K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.

What is Bayes’ Theorem?
How is it useful in a machine learning context?

Bayes’ Theorem gives you the posterior probability of an event given what is known as prior knowledge.It’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition.

Bayes’ Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier.

Why is 'Naive' Bayes naive?

Naive Bayes is considered “Naive” because it makes an assumption which is virtually impossible to see in real-life data: the conditional probability is calculated as the pure product of the individual probabilities of components. This implies the absolute independence of features — a condition probably never met in real life.

Difference between a generative and discriminative model? （生成模型与判别模型）

A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.

What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?

Both algorithms are methods for finding a set of parameters that minimize a loss function by evaluating parameters against data and then making adjustments.

In standard gradient descent, you'll evaluate all training samples for each set of parameters. This is going to take big, slow steps toward the solution.

In stochastic gradient descent, you'll evaluate only a certain part of training samples for the set of parameters before updating them. This is going to take small, quick steps toward the solution.

What are the advantages and disadvantages of k-nearest neighbors?

Advantages: K-Nearest Neighbors have a nice intuitive explanation, and then tend to work very well for problems where comparables are inherently indicative. For example, you could build a kNN housing price model by modeling on other houses in the area with similar number of bedrooms, floor space, etc.

Disadvantages: They are memory-intensive.They also do not have built-in feature selection or regularization, so they do not handle high dimensionality well.

What are the advantages and disadvantages of neural networks?

Advantages: Neural networks (specifically deep NNs) have led to performance breakthroughs for unstructured datasets such as images, audio, and video. Their incredible flexibility allows them to learn patterns that no other ML algorithm can learn.

Disadvantages: However, they require a large amount of training data to converge. It's also difficult to pick the right architecture, and the internal "hidden" layers are incomprehensible.

Describe the basic steps to do the PCA (Principal Components Analysis)

Standardize the data.

Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.

Sort eigenvalues in descending order and choose the k eigenvectors that correspond to the k largest eigenvalues where k is the number of dimensions of the new feature subspace (k≤d).

Construct the projection matrix W from the selected k eigenvectors.

Transform the original dataset X via W to obtain a k-dimensional feature subspace Y.

Tell me some majors issues needed to be considered in supervised machine learning?

Bias-variance tradeoff

Function complexity and amount of training data

Dimensionality of the input space

Noise in the output values

Input data problems such as Heterogeneity of the data, Redundancy in the data and Presence of interactions and non-linearities.

最后编辑于：2019.02.28 14:52:24

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 213,864评论 6赞 494
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 91,175评论 3赞 387
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 159,401评论 0赞 349
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 57,170评论 1赞 286
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 66,276评论 6赞 385
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,364评论 1赞 292
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,401评论 3赞 412
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,179评论 0赞 269
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,604评论 1赞 306
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,902评论 2赞 328
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,070评论 1赞 341
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,751评论 4赞 337
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,380评论 3赞 319
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,077评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,312评论 1赞 267
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 46,924评论 2赞 365
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 43,957评论 2赞 351

DS Q&A

推荐阅读更多精彩内容