【W1-01】 Specialization Motivation

About this course

The key word in data science is "science", not "data"

An introduction to key 'ideas behind working with data in a scientific way that will produce new and reproducible insight
An introduction to the tools that allows you to execute on a data analytic strategy, from raw data in database to a complete report with interactive graphics
Hands on practice

1 【W1-01】Specialization Motivation

1.1 Why do data science

Credits blongs the person who's actually trying to ==get things done==, even when there are obstacles in the way.

It's important to strive the valiantly do these sorts of things, even if you're going to take some criticism.

1.2 The key challenge of Data Science

The heart of philosophy about data science is ==answering question with data==. The question should come first and then data follow after.

Finding the worth answering problem
Have the right information can answer the question
Have the information in advance
Have the right amount of data ( no more or less )

Answering the question that you are interested in, and with the data that you have.

1.3 Why data science

Data deluge: data is much cheaper and easier to collecting, storing, and processing
Big data: We have data in new areas that we didn't use to have, that allow us to answer new questions we never could before

1.4 Why statistical data science

Statistics is the science of learning from data
Statistics deals with any uncertainties when answering questions with data

1.5 Why now

Explosive growth of data in every possible area
Tools, competitions and websites are all developed around the idea of helping to learn from data
Huge investment in algorithm and prediction development
Have the opportunities to get involved in projects that have super high profile result

1.6 Why R

Increasingly the most commonly used programming language in data science
comprehensive set of packages for all processes involved in data science ( from rawest of raw file to interactive reports and web apps )
it is Free
It has one of the best development environment - RStudio
It has an amazing ecosystem of developers
Packages are easy to install and integrate

1.7 Who is data scientist

Using data to answer all kinds of questions

1.8 Goal

Data science Venn diagram

Hacking skills
- Computer programming: access data, clean data, analysis data and plot data
- Figure out answers for yourself

2 [W1-02] The Toolbox

2.1 What data scientist do

Define the question
Identify the ideal data set
Determine what data is accessible
Obtain data
Clean data
Exploratory data analysis, including making plots and clusterings to identify patterns in the data set
Statistical modeling
Interpret result
Synthesizing and writing up result
Create reproducible code
Distributing result
- Interactive graphics, write ups, presentations and interactive apps

2.2 Main workinghorse

R, RStudio, R scripts, R markdown
Git & Github ( distributed version control )

3 [W1-03] Getting Help and Finding Answers

3.1 Asking questions

Often the fastest answer is the one you find youself
Being an active participant in pnline community environment ( message board, stackoverflow and etc )
- if you figure out an answer to a question, post it back to the message board

3.1.1 How to ask an R question.

What are the steps will repeat this problem
What is the expected output and what do you get instead
What version of R or R package are used
What operating system are used

3.1.2 How to ask a data analysis question

What is the question you are trying to answer
What steps or tools did you use to answer it
What is the expected output and what do you get instead
What other solutions you have thought about

3.2 Find the answer for yourself

Google it or search on Stack Overflow
Post the solution you found

3.3 Getting help with R ( see Evernote )

3.4 Key characters of hacker

Willing to find answers on their own
Knowledgeable about where to find answers (eg. CrossValidation for data analysis/statistics )
Unintimidated by new data type or R packages
Unafraid to say they don't know
Polite but relentless

3.5 How to search

Stackoverflow with "[r]" tag
R mailing list for software questions
CrossValidated for more general questions
Google [data type] R package

4 Types of Data Science Questions

4.1 Descriptive analysis

Goal: Describe a set of data

It is the first kind of data analysis performed
Most commonly applied to census data
The description and interpretation of the data are different steps
- Descriptions usually can not be generalized without additional statistical modeling

4.2 Exploratory analysis

Goal: Find new relationships but not necessarily confirm them

Exploratory models are good for discovering new connections
Define future studies to confirm the findings
Exploratory analysis are usually not the final conclusion
Exploratory analysis alone should not be used for generalizing or predicting
Correlation does not imply causation

4.3 Inferential Analysis

Goal: Extrapolate or generalize a small sample of data to a large population

Inference is commonly the goal for statistical model
Inference involves estimating both the quantity you interested in and the uncertainty about the estimation
Inference depends heavily on both the population and the sampling schema

4.4 Predictive Analysis

Goal: To use the value on some objects to predict values for another object

If X predicts Y, it doesn't mean that X causes Y
Accurate prediction depends heavily on measuring the right variables
Althrough there are bettern adn worse prediction models, more data and a simple models works really well
Prediction is very hard, especially for future

4.5 Causal Analysis

Goal: To find out what happends to one variable when you change another variable

Usually randomized studies are required to identify causation
There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions
Causal relationships are usually identified on average effects, but may not apply to every individual
Causal models are usually the "gold standard" for data analysis

4.6 Mechanistic Analysis

Goal: Understand the exact changes in variables that lead to changes in other variables for individual object

Increadiable hard to infer, except in simple situations or i
Usually in situations that are modeled by a deterministic set of equations ( physical/engineering science)
Generally the random component of the data is measurement error
If the equations are known but the parameters are not, they maybe inferred with data analysis

5 What is Data

5.1 Definition of Data

Data are values of qualitative or quantitative variable, belonging to a set of items

* set of items: Sometimes called the population; the set of objects you are interested in; a set of things you make measurement on
* variables: A measurement or characteristic of an item
* qualitative: not necessrily orderd and not necessarily measured in scale
* quanlitative: usually measured on a continuous scale, and have an ordering on that scale

5.2 Data is the Second Most Important Thing in Data Science

The most important thing in data science is question, so data should follow the question
Often the data will limit or enable the question
- Start with the question, then may not have data to answer that question, so you have to modify the question
But having data is useless if you don't have a question

6 What about Big Data

One way to solve big data problem is to wait until hardware catches up with the size of data
Most questions that you are trying to answer don't necessarily have the big data component that necessitates the need of huge numbe of computers
It is now possible to collect much more data much more cheaply than it was before and to analysis it
But the question is how much of that data is useful for answering the question that you are interested in
Regardless of size of data, you need the right data

7 Experimental Design

7.1 Why should we care

A exciting result can lead you astray if you are not very careful about experimental design and analysis

Be aware of when performing experimental design or data science project:

Know and care about the analysis plan
- Pay attention to all aspects of the design and analysis of the study so that you aware of what are the key issues from data cleaning to the data analysis to the reporting that can trip you up
Have a plan for data and code sharing
- eg. Github for code sharing
- eg. figshare for data sharing
- The Leek group guide to data sharing in Github

7.2 Formulate your question in advance

The first and most important thing of performaning an experiment

7.3 Statistical inference

7.3.1

image.png

7.3.2 Confunding and spurious correlation

Be careful what are the other variables that may causing a relationship
Correlation is not causation
Even if you observe that two variables are correlated with each other, you have to prove that they are not correlated because of some other variables we didn't measure

7.3.3 Deal with potential confounders： Randomization and Blocking

Fix some variables
Stratify some variables, make a measurement metrics
Randomize variables, to aim is to balance the comfounding effect

7.4 Prediction

7.4.1

image.png

7.4.2 Prediction versus inference

For prediction, you need the distributions to be more separated
It is important to pay attention to the relative size of effects when considering predition versus inference

7.4.3 Prediction key quantities

image.png

7.5 Data dredging

Data dredging (also data fishing, data snooping, and p-hacking) is the use of data mining to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality.

Data Scientist's Toolbox

【W1-01】 Specialization Motivation

About this course

1 【W1-01】Specialization Motivation

1.1 Why do data science

1.2 The key challenge of Data Science

1.3 Why data science

1.4 Why statistical data science

1.5 Why now

1.6 Why R

1.7 Who is data scientist

1.8 Goal

2 [W1-02] The Toolbox

2.1 What data scientist do

2.2 Main workinghorse

3 [W1-03] Getting Help and Finding Answers

3.1 Asking questions

3.1.1 How to ask an R question.

3.1.2 How to ask a data analysis question

3.2 Find the answer for yourself

3.3 Getting help with R ( see Evernote )

3.4 Key characters of hacker

3.5 How to search

4 Types of Data Science Questions

4.1 Descriptive analysis

4.2 Exploratory analysis

4.3 Inferential Analysis

4.4 Predictive Analysis

4.5 Causal Analysis

4.6 Mechanistic Analysis

5 What is Data

5.1 Definition of Data

5.2 Data is the Second Most Important Thing in Data Science

6 What about Big Data

7 Experimental Design

7.1 Why should we care

7.2 Formulate your question in advance

7.3 Statistical inference

7.3.1

7.3.2 Confunding and spurious correlation

7.3.3 Deal with potential confounders： Randomization and Blocking

7.4 Prediction

7.4.1

7.4.2 Prediction versus inference

7.4.3 Prediction key quantities

7.5 Data dredging

相关阅读更多精彩内容

友情链接更多精彩内容