Data Scientist's Toolbox

【W1-01】 Specialization Motivation

About this course

The key word in data science is "science", not "data"

  • An introduction to key 'ideas behind working with data in a scientific way that will produce new and reproducible insight
  • An introduction to the tools that allows you to execute on a data analytic strategy, from raw data in database to a complete report with interactive graphics
  • Hands on practice

1 【W1-01】Specialization Motivation

1.1 Why do data science

Credits blongs the person who's actually trying to ==get things done==, even when there are obstacles in the way.

It's important to strive the valiantly do these sorts of things, even if you're going to take some criticism.

1.2 The key challenge of Data Science

The heart of philosophy about data science is ==answering question with data==. The question should come first and then data follow after.

  • Finding the worth answering problem
  • Have the right information can answer the question
  • Have the information in advance
  • Have the right amount of data ( no more or less )

Answering the question that you are interested in, and with the data that you have.

1.3 Why data science

  • Data deluge: data is much cheaper and easier to collecting, storing, and processing
  • Big data: We have data in new areas that we didn't use to have, that allow us to answer new questions we never could before

1.4 Why statistical data science

  • Statistics is the science of learning from data
  • Statistics deals with any uncertainties when answering questions with data

1.5 Why now

  • Explosive growth of data in every possible area
  • Tools, competitions and websites are all developed around the idea of helping to learn from data
  • Huge investment in algorithm and prediction development
  • Have the opportunities to get involved in projects that have super high profile result

1.6 Why R

  • Increasingly the most commonly used programming language in data science
  • comprehensive set of packages for all processes involved in data science ( from rawest of raw file to interactive reports and web apps )
  • it is Free
  • It has one of the best development environment - RStudio
  • It has an amazing ecosystem of developers
  • Packages are easy to install and integrate

1.7 Who is data scientist

  • Using data to answer all kinds of questions

1.8 Goal

Data science Venn diagram
  • Hacking skills
    • Computer programming: access data, clean data, analysis data and plot data
    • Figure out answers for yourself

2 [W1-02] The Toolbox

2.1 What data scientist do

  • Define the question
  • Identify the ideal data set
  • Determine what data is accessible
  • Obtain data
  • Clean data
  • Exploratory data analysis, including making plots and clusterings to identify patterns in the data set
  • Statistical modeling
  • Interpret result
  • Synthesizing and writing up result
  • Create reproducible code
  • Distributing result
    • Interactive graphics, write ups, presentations and interactive apps

2.2 Main workinghorse

  • R, RStudio, R scripts, R markdown
  • Git & Github ( distributed version control )

3 [W1-03] Getting Help and Finding Answers

3.1 Asking questions

  • Often the fastest answer is the one you find youself
  • Being an active participant in pnline community environment ( message board, stackoverflow and etc )
    • if you figure out an answer to a question, post it back to the message board

3.1.1 How to ask an R question.

  • What are the steps will repeat this problem
  • What is the expected output and what do you get instead
  • What version of R or R package are used
  • What operating system are used

3.1.2 How to ask a data analysis question

  • What is the question you are trying to answer
  • What steps or tools did you use to answer it
  • What is the expected output and what do you get instead
  • What other solutions you have thought about

3.2 Find the answer for yourself

  • Google it or search on Stack Overflow
  • Post the solution you found

3.3 Getting help with R ( see Evernote )

3.4 Key characters of hacker

  • Willing to find answers on their own
  • Knowledgeable about where to find answers (eg. CrossValidation for data analysis/statistics )
  • Unintimidated by new data type or R packages
  • Unafraid to say they don't know
  • Polite but relentless

3.5 How to search

  • Stackoverflow with "[r]" tag
  • R mailing list for software questions
  • CrossValidated for more general questions
  • Google [data type] R package

4 Types of Data Science Questions

4.1 Descriptive analysis

Goal: Describe a set of data
  • It is the first kind of data analysis performed
  • Most commonly applied to census data
  • The description and interpretation of the data are different steps
    • Descriptions usually can not be generalized without additional statistical modeling

4.2 Exploratory analysis

Goal: Find new relationships but not necessarily confirm them
  • Exploratory models are good for discovering new connections
  • Define future studies to confirm the findings
  • Exploratory analysis are usually not the final conclusion
  • Exploratory analysis alone should not be used for generalizing or predicting
  • Correlation does not imply causation

4.3 Inferential Analysis

Goal: Extrapolate or generalize a small sample of data to a large population
  • Inference is commonly the goal for statistical model
  • Inference involves estimating both the quantity you interested in and the uncertainty about the estimation
  • Inference depends heavily on both the population and the sampling schema

4.4 Predictive Analysis

Goal: To use the value on some objects to predict values for another object
  • If X predicts Y, it doesn't mean that X causes Y
  • Accurate prediction depends heavily on measuring the right variables
  • Althrough there are bettern adn worse prediction models, more data and a simple models works really well
  • Prediction is very hard, especially for future

4.5 Causal Analysis

Goal: To find out what happends to one variable when you change another variable
  • Usually randomized studies are required to identify causation
  • There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions
  • Causal relationships are usually identified on average effects, but may not apply to every individual
  • Causal models are usually the "gold standard" for data analysis

4.6 Mechanistic Analysis

Goal: Understand the exact changes in variables that lead to changes in other variables for individual object
  • Increadiable hard to infer, except in simple situations or i
  • Usually in situations that are modeled by a deterministic set of equations ( physical/engineering science)
  • Generally the random component of the data is measurement error
  • If the equations are known but the parameters are not, they maybe inferred with data analysis

5 What is Data

5.1 Definition of Data

Data are values of qualitative or quantitative variable, belonging to a set of items

* set of items: Sometimes called the population; the set of objects you are interested in; a set of things you make measurement on
* variables: A measurement or characteristic of an item
* qualitative: not necessrily orderd and not necessarily measured in scale
* quanlitative: usually measured on a continuous scale, and have an ordering on that scale

5.2 Data is the Second Most Important Thing in Data Science

  • The most important thing in data science is question, so data should follow the question
  • Often the data will limit or enable the question
    • Start with the question, then may not have data to answer that question, so you have to modify the question
  • But having data is useless if you don't have a question

6 What about Big Data

  • One way to solve big data problem is to wait until hardware catches up with the size of data
  • Most questions that you are trying to answer don't necessarily have the big data component that necessitates the need of huge numbe of computers
  • It is now possible to collect much more data much more cheaply than it was before and to analysis it
  • But the question is how much of that data is useful for answering the question that you are interested in
  • Regardless of size of data, you need the right data

7 Experimental Design

7.1 Why should we care

A exciting result can lead you astray if you are not very careful about experimental design and analysis

Be aware of when performing experimental design or data science project:

  • Know and care about the analysis plan
    • Pay attention to all aspects of the design and analysis of the study so that you aware of what are the key issues from data cleaning to the data analysis to the reporting that can trip you up
  • Have a plan for data and code sharing
    • eg. Github for code sharing
    • eg. figshare for data sharing
    • The Leek group guide to data sharing in Github

7.2 Formulate your question in advance

  • The first and most important thing of performaning an experiment

7.3 Statistical inference

7.3.1

image.png

7.3.2 Confunding and spurious correlation

  • Be careful what are the other variables that may causing a relationship
  • Correlation is not causation
  • Even if you observe that two variables are correlated with each other, you have to prove that they are not correlated because of some other variables we didn't measure

7.3.3 Deal with potential confounders: Randomization and Blocking

  • Fix some variables
  • Stratify some variables, make a measurement metrics
  • Randomize variables, to aim is to balance the comfounding effect

7.4 Prediction

7.4.1

image.png

7.4.2 Prediction versus inference

  • For prediction, you need the distributions to be more separated
  • It is important to pay attention to the relative size of effects when considering predition versus inference

7.4.3 Prediction key quantities

image.png

7.5 Data dredging

Data dredging (also data fishing, data snooping, and p-hacking) is the use of data mining to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,132评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,802评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,566评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,858评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,867评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,695评论 1 282
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,064评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,705评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 42,915评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,677评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,796评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,432评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,041评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,992评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,223评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,185评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,535评论 2 343

推荐阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,380评论 0 23
  • 二叔被判刑的时候,我们都觉得他们一家子完了,八年,等二叔刑满释放,已经年过花甲。 二婶,今年只有43岁,我们之前一...
    彭晨龙阅读 260评论 0 1
  • 故事梗概:他和他都是追求自由的人,就算是恋人也无法束缚他们,因此争吵不断,于是便有了恋爱合约…… ...
    流年留念i阅读 580评论 0 1
  • 当你一次次为了梦想挫败的时候,一个声音说,继续前进吧,远方有你的梦想,有你最高的期待,有你超越别人生活的向往...
    律动青春ing阅读 330评论 0 0