1. Ho would you defined clustering? Can you name a few clustering algorithms?
In Machine Learning, clustering is the unsuprvised task of groping similar instances together. The notion of similarity depends on the task at hand: for example, in some cases two nearby instances will be considered similar, while in others similar instances may be far apart as long as they belong to the same desely packed group. Popular clustering algorithms includes K-Meaans, DBSCAN, agglomerative clustering , BIRCH, Mean-Shift, affinity propagation, and specitral..
2. What are some of them main applications of clustering algorithms ?
The main applications of clustering algorithms include data analysis, customer segmentation, recommender systems, search engins, image segmentation, semisupervised learning, dimensionality reduction, anmaly detection, and novelty detection.
3. Describe two techniques to select the right number of clusters when using k-means.
The eblow rule is a simple technique t oselect the number of clusters when using K-Means: just plot the inertia(the mean squared distance from each instance to its nearest centroid) as a function of the number of clusters, and find the point in the curve where the inertia stops dropping fast (the "elbow"). This is generally close to the optimal number of clusters. Another approach is plot the silhouette score as a function of the number of clusters. There will often be a peak, and the optimal number of cluster is instances. This coefficient varies from +1 for instances that are well inside their cluster and far from other clusters, to -1 for instances that very close to another cluster. You may also plot the sihouette diagrams and perform a more thorough analysis.
4. What is label propagation? Why would you implement it , and how ?
Labeling a dataset is costly and time-consuming. Therefore, it is common to have plenty of unlabeled in stantces, but few labeled instances. Label propagation is technique that consists in copying some (or all) of the labels from the labeled instances to similar unlabeled instances. This can greatly extend the number of labeled instances, and thereby allow a supervised algorithm to reach better performace(this is a from of semi-supervised learning ).One approach is to user a clustering algorithm such as K-Means on all the instances, then for eacher cluster find the most common label or the label of the most representative instance(i.e., the one closet to centroid) and propagate it to the unlabeled instances in the same cluster.
5. Can you name tow clustering algorithms that can scale to large datasets ? And two that look for regions of high density
K-Means and BIRCH scale well to large datasets. DBSCAN and Mean-Shift look for regions of high density.
6. Can you think of a use case where active learning would be useful? How would you implement it?
Active learning is useful whenever you have plenty of unlabeled instances but labeling is costly. In this case (which is very common), rather than randomly selecting instances to label, it is often preferable to perform active learning, where human experts interact with the learning algorithm, providing labels for specific instances when the algorithm requests them. A common approach is uncertainty sampling
.
7. What is the difference between anomaly detection and novelty detection?
Many people use the terms anomaly detection and novelty detection interchangeably, but they are not exactly the same. In anomaly detection, the algorithm is trained on a dataset that may contain outliers, and the goal is typically to identify these outliers (within the training set), as well as outliers among new instances. In novelty detection, the algorithm is trained on a dataset that is presumed to be "clean," and the objective is to detect novelties strictly among new instances. Some algorithms work best for anomaly detection (e.g., Isolation Forest), while others are better suited for novelty detection (e.g., one-class SVM).
8. What is a Gaussian mixture? What tasks can you use it for?
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. In other words, the assumption is that the data is grouped into a finite number of clusters, each with an ellipsoidal shape (but the clusters may have different ellipsoidal shapes, sizes, orientations, and densities), and we don't know which cluster each instance belongs to. This model is useful for density estimation, clustering, and anomaly detection.
9. Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?
One way to find the right number of clusters when using a Gaussian mixture model is to plot the Bayesian information criterion (BIC) or the Akaike information criterion (AIC) as a function of the number of clusters, then choose the number of clusters that minimizes the BIC or AIC. Another technique is to use a Bayesian Gaussian mixture model, which automatically selects the number of clusters.