The Big Picture

Data analysts carry out their job of engaging data using a well refined, multi-phased process. The goal of their job is to interpret raw data by converting it into useful, and actionable intelligence. What does this process entail exactly?

Collecting Data from various sources
Wrangling Data to make it more reliable
Exploring Data using statistics and visualizations
Transforming Data to prepare it for modeling
Modeling Data using the right machine learning algorithms
Evaluating the results of the data models

Machine Learning In General

The traditional way programmers get computers to do something is by providing a very detailed list of instructions specific to the task they wish to accomplish. This is known as hard coding. A different approach would be to get the computer to accomplish the same task without specifically being programmed to, and instead by training it against data samples. Machine Learning is the name given to generalizable algorithms that enable a computer to carry out a task by examining data rather than hard programming.

With machine learning, results are attributed more towards your data than to the algorithm, since it's the data that guides the computer with what to do. In other words, the algorithm is generic; but the data is specific to the problem being solved. You might feed data into a machine learning algorithm about food your wife likes and food she doesn't like, and the task the algorithm accomplishes is learning how to differentiate between the two. On the other hand, she might feed the same algorithm data about clothes you like / dislike. Without altering a single line of code, the same algorithm can solve the new task based solely on the data!

Arthur Samuel coined the term "machine learning" in the 1950's, noting that a computer can exhibit a behavior just as a pet would after learning through intense training. This seeming ability for computers to 'learn' and 'discover', as well as their known ability to effortlessly crunch through millions of records, is precisely why machine learning and other aspects of data science are being applied to so many fields.

Where we would fail due to our inherent inability to comprehend even medium amounts of data (or to not get very, very bored while doing so), computers excel spectacularly. Further, with advances in storage capacity and computing speeds, machine learning performance increases by the day.

Types of Machine Learning

There are three main types of machine learning, however in this course you'll be focusing on two: unsupervised learning and supervised learning. These two types enable a computer to do a generic task by examining reams of data. To illustrate the difference between the two, let's do a case study:

Let's say your name is Angie, and your husband's name is Craig. You're both organized people and enjoy maintaining lists. In fact, you have a list of every company you've ever done business with, and some details about them, such as how reliable they were. Craig also has a list. His list is of every transaction he's ever made, along with some details about those transactions, such as the items transacted and their costs.

Unsupervised Learning

You and Craig decide to get into the computer resale business. Being data driven people, you wish to do so in an educated way, so you use your lists. Going through yours, you notice the companies that sell computers seem to all be located around some place called 'Silicon Valley'. You also notice the companies with the best reviews are those that pay attention to sleek case designs. Oh, except this one company called 'Fast-Computers', which has awesome reviews, but has computers that look like they were designed by an 8-year old.

Unsupervised learning is similar to this. Given a lot of data, the computer hasn't the slightest idea what any of it means. Yet, it's still able to figure out if there are any meaningful groupings and patterns within the data, along with instances where that data seems out of place!

Supervised Learning

Craig approaches the computer resale challenge differently. All he's really interested in is:

How much should you price a computer at, given its stats?
What price should you buy a computer for, to maximize your profits?
Looking through his data list, he discovers a correlation between a computer's processor speed, its storage space, and its cost. In fact, Craig is able to calculate a precise equation that models this! He's also able to create a set of "IF this AND that THEN transact" rules to decide when it's best to buy or sell a computer.

Unlike unsupervised learning where the computer had no idea what any of the data meant, with supervised learning, the computer is in charge of taking your data and then fitting rules and equations to it. Once it learns these generalized rules through a process called modeling, the rules can be applied to data the computer has never seen before.

Machine Learning And You

Part of being a great programmer is the ability to take complex ideas and abstract them down into solvable constituents. Machine learning operates similarly. Starting with your expertise in an area, look for interesting problems to tackle. Break those problems down into smaller pieces, so that they're either entirely solvable with machine learning, or at least partly so.

Not every problem is machine solvable, nor should every problem even be approached with machine learning! If your issue is directly solvable through some simple means, such as a few yes / no decisions, or if it does not require examining loads of data, there is probably a more fitting solution for you than machine learning. However if your issue is prohibitive on large scales, or requires fine tuning multiple knobs of potentially correlated parameters, then machine learning to the rescue. In the next chapter, we'll check out the types of problems machine learning is strong at solving.

Classification (Supervised)
The goal of classification is to find what class a sample belongs to.
A class could be something like Windows 10 Mobile, and a sample could be something like phone. To get classification working, you have to feed the machine learning algorithm a decent amount phone examples, some of them labeled Windows 10 Mobile, and others labeled, well... non-Windows 10 Mobile. With enough training samples, a classifier will eventually be able to generalize what similarities constitute a Windows 10 Mobile phone and voilà, you've trained a computer to figure out phone types!

Classification
Classification

More Examples
Given a list of emails marked spam and not-spam, figure out if a newly received message is actually spam or not.
Given many images of your friends, perform facial recognition on a brand-new, never-before seen image of one of your friends.
After being trained with a few books, decide which of the before seen authors wrote an article.
Given a list of physical symptoms, determine what ailment a person has.

Classification falls into the realm of supervised learning because in order for it to work, you have to guide the computer by proving it with examples of correctly labeled records. Once you're done training the computer, you can test it by seeing how accurately it scores those records.

Regression (Supervised)
The goal of regression is to predict a continuous-valued feature associated with many samples. Continuous-valued meaning small changes in the input result in small changes of the output.
Imagine hiking from Portland, OR to Seattle, WA. As time progresses, your distance from Portland increases and your distance to Seattle decreases. Even though you stop for meals and to rest, these distance values transition smoothly. Throughout the course of your hiking, you never magically teleport a large distance. Instead, you smoothly and incrementally make your way step-by-step.
With regression, a mathematical relationship is modeled for your samples so that as you gently alter an input feature, an output feature responds by being altered as well. For example, your time journeying and your distance to your destination.

Regression Example
Regression Example

More Examples
Calculate an equation to predict the size of a house given its price; or the price of a house given its size.
Explore if a correlation exists between the hours a student spends studying, spends watching TV, and their final exam score.
Estimate how many power plants should be constructed in the next 50 years, based upon the historical energy consumption per household.
Figure out how many days a person has left to live based on the severity of their symptoms.

Regression falls into the realm of supervised learning because you have to provide the computer with labeled samples. It then attempts to fit an equation to the samples' features.

Clustering (Unsupervised)
The goal of clustering is to automatically group similar samples into sets.
Since a clustering algorithm has no prior knowledge of how the sets should be defined, and furthermore, since the clustering process is unsupervised, the clustering algorithm needs to have a way to tell which samples are the most similar, so it can group them. It does this the same way we humans do: by looking at the various characteristics and features of the sample.

Clustering Example
Clustering Example

More Examples
Match similar people on a matrimonial site based on their profile question answers.
Based on search history, recommend houses a prospective home-buyer might be interested in considering.
Pinpoint the most likely location for a future earthquake using past earthquake seismic data.
Identify new characteristics shared by different people suffering from the same disease.

There are different types of clustering algorithms, some supervised, some unsupervised. There are even semi-supervised clustering methods as well. In this course, you'll only be dealing with unsupervised clustering. In other words, the clustering algorithm you will use won't need anything except your raw data. No labels hinting at desired clustering outcome will be provided to the algorithm.

Dimensionality Reduction (Unsupervised)
The goal of dimensionality reduction is to systematically and intelligently reduce the number of features considered in a dataset. Stated differently, trim the fat off. Often times, in one's eagerness to collect enough data for machine learning to be effective, you might add irrelevant features to your dataset. Bad features have the effect of hindering the machine learning process, and make your data harder to understand. Dimensionality reduction attempts to trim your dataset down to the bare essentials needed for decision-making.

Dimen Reduct Sample
Dimen Reduct Sample

More Examples
Given a 100 question survey, attempt to find the gist of what is truly being assessed; then rephrase it in just 5 questions.
Build a robot that can recognize pictures of similar objects, even if they are rotated at odd angles and orientations.
Compress a video stream by reducing the number of colors.
Summarize a long book.

Dimensionality reduction falls into the realm of unsupervised learning because you don't instruct the computer which features you want it to build; the computer infers this information automatically by examining your unlabeled data.

Reinforcement Learning
The goal of reinforcement learning is to maximize a cumulative reward function (or equivalently, minimize a cumulative cost function), given a set of actions and results. Reinforcement learning is modeled to mimic the way we learn in the real world. We try to solve problems using different techniques. Most of the time, nothing of merit results from our experiments. But occasionally, we stumble upon a set of actions that result in a sweet reward. When this happens, we attempt to repeat those actions that result in our getting rewarded. If we are rewarded yet again, we further associate those actions with the reward, and that is known as the reinforcement cycle. The entire process is also known as performance maximization.

Reinforcement Learning Sample
Reinforcement Learning Sample

More Examples
Discover how to fly a quadcopter by minimizing the function which evaluates the chances of crashing.
Learn to beat a video game like 'Super Mario Bros.' by minimizing the time it takes to get to the castle.
Attempt to take a photo and "re-draw" it in the style of a particular artist.
Automate the trading of stocks and securities by maximizing profit and minimizing transaction fees.

Reinforcement learning is actually a completely different category of learning from supervised and unsupervised learning. It's closer to supervised learning than it is to unsupervised learning, but you could get away with calling it semi-supervised learning. To learn more about reinforcement learning, take a look at the dive deeper section. We won't return to reinforcement learning in this class, but it's important to be aware of it as a data scientist. Much work has been done using reinforcement learning and deep neural networks that are of benefit to machine learning.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,711评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,079评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,194评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,089评论 1 286
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,197评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,306评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,338评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,119评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,541评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,846评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,014评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,694评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,322评论 3 318
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,026评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,257评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,863评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,895评论 2 351

推荐阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,444评论 0 23
  • PLEASE READ THE FOLLOWING APPLE DEVELOPER PROGRAM LICENSE...
    念念不忘的阅读 13,457评论 5 6
  • 根据著名辙学家大头一欧牛的研究,只要占以下任何一条,就过了一个美好的节日。 1)收到节日的祝福:不论是来自长辈、子...
    圣诞夜阅读 570评论 0 0
  • 不安和骚动是人类幸福的大敌,如果能够将它打败,那你就接近佛陀,并且将幸福余生。我这里有几个关于克服不安和骚动的建议...
    鸡仔说阅读 539评论 0 0
  • 武汉的昙华林里,未斑驳的老房子覆盖着年轻而缤纷的色彩,可街道仍然像很多年前一样安静,连墙上张牙舞爪的涂鸦也不能让它...
    宿宿Susu阅读 367评论 0 0