1、Supervised Learning (监督学习)
- Definition: to give the algorithm a data set in which the "right answers" were given.
For example, in a housing price prediction model, for each size of house in this data set, we told it what is the right price (the actual price that house sold for), and the task of the algorithm was to just produce more of these right answers (such as you want to buy a new size of house, and you want the algorithm give you the corresponding price).
This is also called a regression problem (回归问题): to predict continuous valued output (price).
And here's another supervised learning example, in a breast cancer model, for each size of tumor in this data set, we told it which is malignant (恶性的) and which is benign (良性的), and the task of the algorithm was to try to predict of a new breast cancer as malignant or benign.
This is an example of a classification problem: to predict discrete (离散的) valued output (is malignant or not, and the discrete valued output may be two or more than it).
Summary
- The basic idea of supervised learning is: in supervised learning, for each data in the data set, there is a corresponding "correct answer" -- training set, and the algorithm makes predictions based on these.
- Regression problem, that is, to predict a continuous value output.
- Classification problem, the goal is to predict discrete value output.
Question
You're running a company, and you want to develop learning algorithms to address each of two problems.
- Problem 1: You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months.
- Problem 2: You'd like software to examine individual customer accounts, and for each account decide if it has been hacked/compromised.
Should you treat these as classification or as regression problems?
Through the above summary,we can easly draw a conclusion that the first problem is regression problems, and the second problem is classification problems.
2、Unsupervised Learning (无监督学习)
In supervised learning, we were told explicitly what is the so-called right answer (the blue cycle or the red cross).
But in unsupervised learning, we give the data that doesn't have any labels (or that all have the same labels), and the task of the algorithm was to find some structure in the data.
For example, the unsupervised learning algorithm can break these data into two separate clusters:
And this is called a clustering algorithm (聚类算法). Here are some specific applications of clustering algorithm:
Cocktail party problem (鸡尾酒宴问题)
Imagine a dinner party, a room full of people, all sitting together and all talking at the same time. In this case, it's hard to hear the person in front of you. So let's make another assumption. Imagine a dinner party with only two people in the room, each with a microphone in front of them, and both speaking at the same time. Then a bored researcher recorded the voices of the two men.
When we turn on this recording, we hear two people talking. Similarly, in unsupervised learning, we input the voices of two people and find out some classification through some algorithm. The voice of the first person is separated from that of the second person. This algorithm is called the cocktail party algorithm.
Summary
- Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
- We can derive this structure by clustering the data based on relationships among the variables in the data.
- With unsupervised learning there is no feedback based on the prediction results.
Question
Of the following examples, which would you address using an unsupervised learning algorithm?
A. Given email labeled as spam/not spam, learn a spam filter. (supervised learning)
B. Given a set of news articles found on the web, group them into set of articles about the same story.(unsupervised learning)
C. Given a database of customer data, automatically discover market segments and group customs into different market segments.(unsupervised learning)
D. Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not.(supervised learning)