Maching learing system design
Prioritizing what to work on: Spam classification example
Buliding a spam classifier
How to spend your time to make it have low error?
- Collect lots of data
- Develop sophisticated features base on email routing information (from email header).
- Develop sophisticated features for message body.
- Develop sophisticated algorithm to detect misspellings.
Error analysis
Recommended approach
- Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
- Plot learning curves to decide if more data, more features are likely to help.
- Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
- numerical evaluation.Try to find a way to numerical analysis your algorthim performance.
Error metrics for skewed classed (偏斜类)
skewed class: The ratio of positvie to native examplse is very close to one of two extremes.
Precison (P): Of all patients where we predicted , what fraction actually has cancer?
Recall (R):Of all patients that actually have cancer, what fraction did we correctly detect as hvaing cancer?
Trading off precision and recall
By change the threshold of the , we can blance precision and recall.
F1 Score (F Score)
F1 Score:
Data for machine learing
How much data to train on?
There is a saying, "It's not who has the best algorithm that wins. It's who that has the most data."
Large data rationale
Assume feature has sufficient information to predict accurately.
Assume training set is large enough to use a learing algorithm with many prameters.