Data analysts carry out their job of engaging data using a well refined, multi-phased process. The goal of their job is to interpret raw data by converting it into useful, and actionable intelligence. What does this process entail exactly?
Collecting Data from various sources
Wrangling Data to make it more reliable
Exploring Data using statistics and visualizations
Transforming Data to prepare it for modeling
Modeling Data using the right machine learning algorithms
Evaluating the results of the data models
Machine Learning In General
The traditional way programmers get computers to do something is by providing a very detailed list of instructions specific to the task they wish to accomplish. This is known as hard coding. A different approach would be to get the computer to accomplish the same task without specifically being programmed to, and instead by training it against data samples. Machine Learning is the name given to generalizable algorithms that enable a computer to carry out a task by examining data rather than hard programming.
With machine learning, results are attributed more towards your data than to the algorithm, since it's the data that guides the computer with what to do. In other words, the algorithm is generic; but the data is specific to the problem being solved. You might feed data into a machine learning algorithm about food your wife likes and food she doesn't like, and the task the algorithm accomplishes is learning how to differentiate between the two. On the other hand, she might feed the same algorithm data about clothes you like / dislike. Without altering a single line of code, the same algorithm can solve the new task based solely on the data!
Arthur Samuel coined the term "machine learning" in the 1950's, noting that a computer can exhibit a behavior just as a pet would after learning through intense training. This seeming ability for computers to 'learn' and 'discover', as well as their known ability to effortlessly crunch through millions of records, is precisely why machine learning and other aspects of data science are being applied to so many fields.
Where we would fail due to our inherent inability to comprehend even medium amounts of data (or to not get very, very bored while doing so), computers excel spectacularly. Further, with advances in storage capacity and computing speeds, machine learning performance increases by the day.
Types of Machine Learning
There are three main types of machine learning, however in this course you'll be focusing on two: unsupervised learning and supervised learning. These two types enable a computer to do a generic task by examining reams of data. To illustrate the difference between the two, let's do a case study:
Let's say your name is Angie, and your husband's name is Craig. You're both organized people and enjoy maintaining lists. In fact, you have a list of every company you've ever done business with, and some details about them, such as how reliable they were. Craig also has a list. His list is of every transaction he's ever made, along with some details about those transactions, such as the items transacted and their costs.
Unsupervised Learning
You and Craig decide to get into the computer resale business. Being data driven people, you wish to do so in an educated way, so you use your lists. Going through yours, you notice the companies that sell computers seem to all be located around some place called 'Silicon Valley'. You also notice the companies with the best reviews are those that pay attention to sleek case designs. Oh, except this one company called 'Fast-Computers', which has awesome reviews, but has computers that look like they were designed by an 8-year old.
Unsupervised learning is similar to this. Given a lot of data, the computer hasn't the slightest idea what any of it means. Yet, it's still able to figure out if there are any meaningful groupings and patterns within the data, along with instances where that data seems out of place!
Supervised Learning
Craig approaches the computer resale challenge differently. All he's really interested in is:
How much should you price a computer at, given its stats?
What price should you buy a computer for, to maximize your profits?
Looking through his data list, he discovers a correlation between a computer's processor speed, its storage space, and its cost. In fact, Craig is able to calculate a precise equation that models this! He's also able to create a set of "IF this AND that THEN transact" rules to decide when it's best to buy or sell a computer.
Unlike unsupervised learning where the computer had no idea what any of the data meant, with supervised learning, the computer is in charge of taking your data and then fitting rules and equations to it. Once it learns these generalized rules through a process called modeling, the rules can be applied to data the computer has never seen before.
Machine Learning And You
Part of being a great programmer is the ability to take complex ideas and abstract them down into solvable constituents. Machine learning operates similarly. Starting with your expertise in an area, look for interesting problems to tackle. Break those problems down into smaller pieces, so that they're either entirely solvable with machine learning, or at least partly so.
Not every problem is machine solvable, nor should every problem even be approached with machine learning! If your issue is directly solvable through some simple means, such as a few yes / no decisions, or if it does not require examining loads of data, there is probably a more fitting solution for you than machine learning. However if your issue is prohibitive on large scales, or requires fine tuning multiple knobs of potentially correlated parameters, then machine learning to the rescue. In the next chapter, we'll check out the types of problems machine learning is strong at solving.
Classification (Supervised)
The goal of classification is to find what class a sample belongs to.
A class could be something like Windows 10 Mobile, and a sample could be something like phone. To get classification working, you have to feed the machine learning algorithm a decent amount phone examples, some of them labeled Windows 10 Mobile, and others labeled, well... non-Windows 10 Mobile. With enough training samples, a classifier will eventually be able to generalize what similarities constitute a Windows 10 Mobile phone and voilà, you've trained a computer to figure out phone types!
More Examples
Given a list of emails marked spam and not-spam, figure out if a newly received message is actually spam or not.
Given many images of your friends, perform facial recognition on a brand-new, never-before seen image of one of your friends.
After being trained with a few books, decide which of the before seen authors wrote an article.
Given a list of physical symptoms, determine what ailment a person has.
Classification falls into the realm of supervised learning because in order for it to work, you have to guide the computer by proving it with examples of correctly labeled records. Once you're done training the computer, you can test it by seeing how accurately it scores those records.
Regression (Supervised)
The goal of regression is to predict a continuous-valued feature associated with many samples. Continuous-valued meaning small changes in the input result in small changes of the output.
Imagine hiking from Portland, OR to Seattle, WA. As time progresses, your distance from Portland increases and your distance to Seattle decreases. Even though you stop for meals and to rest, these distance values transition smoothly. Throughout the course of your hiking, you never magically teleport a large distance. Instead, you smoothly and incrementally make your way step-by-step.
With regression, a mathematical relationship is modeled for your samples so that as you gently alter an input feature, an output feature responds by being altered as well. For example, your time journeying and your distance to your destination.
More Examples
Calculate an equation to predict the size of a house given its price; or the price of a house given its size.
Explore if a correlation exists between the hours a student spends studying, spends watching TV, and their final exam score.
Estimate how many power plants should be constructed in the next 50 years, based upon the historical energy consumption per household.
Figure out how many days a person has left to live based on the severity of their symptoms.
Regression falls into the realm of supervised learning because you have to provide the computer with labeled samples. It then attempts to fit an equation to the samples' features.
Clustering (Unsupervised)
The goal of clustering is to automatically group similar samples into sets.
Since a clustering algorithm has no prior knowledge of how the sets should be defined, and furthermore, since the clustering process is unsupervised, the clustering algorithm needs to have a way to tell which samples are the most similar, so it can group them. It does this the same way we humans do: by looking at the various characteristics and features of the sample.
More Examples
Match similar people on a matrimonial site based on their profile question answers.
Based on search history, recommend houses a prospective home-buyer might be interested in considering.
Pinpoint the most likely location for a future earthquake using past earthquake seismic data.
Identify new characteristics shared by different people suffering from the same disease.
There are different types of clustering algorithms, some supervised, some unsupervised. There are even semi-supervised clustering methods as well. In this course, you'll only be dealing with unsupervised clustering. In other words, the clustering algorithm you will use won't need anything except your raw data. No labels hinting at desired clustering outcome will be provided to the algorithm.
Dimensionality Reduction (Unsupervised)
The goal of dimensionality reduction is to systematically and intelligently reduce the number of features considered in a dataset. Stated differently, trim the fat off. Often times, in one's eagerness to collect enough data for machine learning to be effective, you might add irrelevant features to your dataset. Bad features have the effect of hindering the machine learning process, and make your data harder to understand. Dimensionality reduction attempts to trim your dataset down to the bare essentials needed for decision-making.
More Examples
Given a 100 question survey, attempt to find the gist of what is truly being assessed; then rephrase it in just 5 questions.
Build a robot that can recognize pictures of similar objects, even if they are rotated at odd angles and orientations.
Compress a video stream by reducing the number of colors.
Summarize a long book.
Dimensionality reduction falls into the realm of unsupervised learning because you don't instruct the computer which features you want it to build; the computer infers this information automatically by examining your unlabeled data.
Reinforcement Learning
The goal of reinforcement learning is to maximize a cumulative reward function (or equivalently, minimize a cumulative cost function), given a set of actions and results. Reinforcement learning is modeled to mimic the way we learn in the real world. We try to solve problems using different techniques. Most of the time, nothing of merit results from our experiments. But occasionally, we stumble upon a set of actions that result in a sweet reward. When this happens, we attempt to repeat those actions that result in our getting rewarded. If we are rewarded yet again, we further associate those actions with the reward, and that is known as the reinforcement cycle. The entire process is also known as performance maximization.
More Examples
Discover how to fly a quadcopter by minimizing the function which evaluates the chances of crashing.
Learn to beat a video game like 'Super Mario Bros.' by minimizing the time it takes to get to the castle.
Attempt to take a photo and "re-draw" it in the style of a particular artist.
Automate the trading of stocks and securities by maximizing profit and minimizing transaction fees.
Reinforcement learning is actually a completely different category of learning from supervised and unsupervised learning. It's closer to supervised learning than it is to unsupervised learning, but you could get away with calling it semi-supervised learning. To learn more about reinforcement learning, take a look at the dive deeper section. We won't return to reinforcement learning in this class, but it's important to be aware of it as a data scientist. Much work has been done using reinforcement learning and deep neural networks that are of benefit to machine learning.