【W1-01】 Specialization Motivation
About this course
The key word in data science is "science", not "data"
- An introduction to key 'ideas behind working with data in a scientific way that will produce new and reproducible insight
- An introduction to the tools that allows you to execute on a data analytic strategy, from raw data in database to a complete report with interactive graphics
- Hands on practice
1 【W1-01】Specialization Motivation
1.1 Why do data science
Credits blongs the person who's actually trying to ==get things done==, even when there are obstacles in the way.
It's important to strive the valiantly do these sorts of things, even if you're going to take some criticism.
1.2 The key challenge of Data Science
The heart of philosophy about data science is ==answering question with data==. The question should come first and then data follow after.
- Finding the worth answering problem
- Have the right information can answer the question
- Have the information in advance
- Have the right amount of data ( no more or less )
Answering the question that you are interested in, and with the data that you have.
1.3 Why data science
- Data deluge: data is much cheaper and easier to collecting, storing, and processing
- Big data: We have data in new areas that we didn't use to have, that allow us to answer new questions we never could before
1.4 Why statistical data science
- Statistics is the science of learning from data
- Statistics deals with any uncertainties when answering questions with data
1.5 Why now
- Explosive growth of data in every possible area
- Tools, competitions and websites are all developed around the idea of helping to learn from data
- Huge investment in algorithm and prediction development
- Have the opportunities to get involved in projects that have super high profile result
1.6 Why R
- Increasingly the most commonly used programming language in data science
- comprehensive set of packages for all processes involved in data science ( from rawest of raw file to interactive reports and web apps )
- it is Free
- It has one of the best development environment - RStudio
- It has an amazing ecosystem of developers
- Packages are easy to install and integrate
1.7 Who is data scientist
- Using data to answer all kinds of questions
1.8 Goal
- Hacking skills
- Computer programming: access data, clean data, analysis data and plot data
- Figure out answers for yourself
2 [W1-02] The Toolbox
2.1 What data scientist do
- Define the question
- Identify the ideal data set
- Determine what data is accessible
- Obtain data
- Clean data
- Exploratory data analysis, including making plots and clusterings to identify patterns in the data set
- Statistical modeling
- Interpret result
- Synthesizing and writing up result
- Create reproducible code
- Distributing result
- Interactive graphics, write ups, presentations and interactive apps
2.2 Main workinghorse
- R, RStudio, R scripts, R markdown
- Git & Github ( distributed version control )
3 [W1-03] Getting Help and Finding Answers
3.1 Asking questions
- Often the fastest answer is the one you find youself
- Being an active participant in pnline community environment ( message board, stackoverflow and etc )
- if you figure out an answer to a question, post it back to the message board
3.1.1 How to ask an R question.
- What are the steps will repeat this problem
- What is the expected output and what do you get instead
- What version of R or R package are used
- What operating system are used
3.1.2 How to ask a data analysis question
- What is the question you are trying to answer
- What steps or tools did you use to answer it
- What is the expected output and what do you get instead
- What other solutions you have thought about
3.2 Find the answer for yourself
- Google it or search on Stack Overflow
- Post the solution you found
3.3 Getting help with R ( see Evernote )
3.4 Key characters of hacker
- Willing to find answers on their own
- Knowledgeable about where to find answers (eg. CrossValidation for data analysis/statistics )
- Unintimidated by new data type or R packages
- Unafraid to say they don't know
- Polite but relentless
3.5 How to search
- Stackoverflow with "[r]" tag
- R mailing list for software questions
- CrossValidated for more general questions
- Google [data type] R package
4 Types of Data Science Questions
4.1 Descriptive analysis
Goal: Describe a set of data
- It is the first kind of data analysis performed
- Most commonly applied to census data
- The description and interpretation of the data are different steps
- Descriptions usually can not be generalized without additional statistical modeling
4.2 Exploratory analysis
Goal: Find new relationships but not necessarily confirm them
- Exploratory models are good for discovering new connections
- Define future studies to confirm the findings
- Exploratory analysis are usually not the final conclusion
- Exploratory analysis alone should not be used for generalizing or predicting
- Correlation does not imply causation
4.3 Inferential Analysis
Goal: Extrapolate or generalize a small sample of data to a large population
- Inference is commonly the goal for statistical model
- Inference involves estimating both the quantity you interested in and the uncertainty about the estimation
- Inference depends heavily on both the population and the sampling schema
4.4 Predictive Analysis
Goal: To use the value on some objects to predict values for another object
- If X predicts Y, it doesn't mean that X causes Y
- Accurate prediction depends heavily on measuring the right variables
- Althrough there are bettern adn worse prediction models, more data and a simple models works really well
- Prediction is very hard, especially for future
4.5 Causal Analysis
Goal: To find out what happends to one variable when you change another variable
- Usually randomized studies are required to identify causation
- There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions
- Causal relationships are usually identified on average effects, but may not apply to every individual
- Causal models are usually the "gold standard" for data analysis
4.6 Mechanistic Analysis
Goal: Understand the exact changes in variables that lead to changes in other variables for individual object
- Increadiable hard to infer, except in simple situations or i
- Usually in situations that are modeled by a deterministic set of equations ( physical/engineering science)
- Generally the random component of the data is measurement error
- If the equations are known but the parameters are not, they maybe inferred with data analysis
5 What is Data
5.1 Definition of Data
Data are values of qualitative or quantitative variable, belonging to a set of items
* set of items: Sometimes called the population; the set of objects you are interested in; a set of things you make measurement on
* variables: A measurement or characteristic of an item
* qualitative: not necessrily orderd and not necessarily measured in scale
* quanlitative: usually measured on a continuous scale, and have an ordering on that scale
5.2 Data is the Second Most Important Thing in Data Science
- The most important thing in data science is question, so data should follow the question
- Often the data will limit or enable the question
- Start with the question, then may not have data to answer that question, so you have to modify the question
- But having data is useless if you don't have a question
6 What about Big Data
- One way to solve big data problem is to wait until hardware catches up with the size of data
- Most questions that you are trying to answer don't necessarily have the big data component that necessitates the need of huge numbe of computers
- It is now possible to collect much more data much more cheaply than it was before and to analysis it
- But the question is how much of that data is useful for answering the question that you are interested in
- Regardless of size of data, you need the right data
7 Experimental Design
7.1 Why should we care
A exciting result can lead you astray if you are not very careful about experimental design and analysis
Be aware of when performing experimental design or data science project:
- Know and care about the analysis plan
- Pay attention to all aspects of the design and analysis of the study so that you aware of what are the key issues from data cleaning to the data analysis to the reporting that can trip you up
- Have a plan for data and code sharing
- eg. Github for code sharing
- eg. figshare for data sharing
- The Leek group guide to data sharing in Github
7.2 Formulate your question in advance
- The first and most important thing of performaning an experiment
7.3 Statistical inference
7.3.1
7.3.2 Confunding and spurious correlation
- Be careful what are the other variables that may causing a relationship
- Correlation is not causation
- Even if you observe that two variables are correlated with each other, you have to prove that they are not correlated because of some other variables we didn't measure
7.3.3 Deal with potential confounders: Randomization and Blocking
- Fix some variables
- Stratify some variables, make a measurement metrics
- Randomize variables, to aim is to balance the comfounding effect
7.4 Prediction
7.4.1
7.4.2 Prediction versus inference
- For prediction, you need the distributions to be more separated
- It is important to pay attention to the relative size of effects when considering predition versus inference
7.4.3 Prediction key quantities
7.5 Data dredging
Data dredging (also data fishing, data snooping, and p-hacking) is the use of data mining to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality.