Course syllabus
Week 1 - Introduction to data
Week 2 - Univariate data
Week 3 - Multivariate data
Week 4 - Populations and samples
Week 1
Course content
- Data can be numbers, images, words, audio
- Two key types of data: Organic/process data, "Designed data collection"
- i.i.d. means independent and identically distributed
- In case data is not i.i.d., dependencies and differences need to be accounted for in analysis.
- variable types
- continuous vs discrete
- ordinal vs nominal
Quantitative discrete variables are numeric, measurable quantities with a set range of countable values
Nominal variables consist of groups or names in which there is no inherent ordering.
Ordinal variables consist of groups or names with an inherent ordering or ranking. - Data types in python
- Introduction to libraries and data management
Week 2
Course content
- categorical data, tables, bar charts and pie charts
- histograms: shape, center, spread, outliers
- numerical summaries: Min, 1st quartile(25%), Median(50%), 3rd quartile(75%), Max
- standard score (empirical rule) 68-95-99.7 rule
- Boxplots
Boxplots can hide gaps and clusters - Seaborn library(sis)
sns.distplot().set()
sns.boxplot()
Week 3
Course content
- Gathering multivariate categorical data
- Two way or contingency table
- Marginal and conditional distribution
- Two univariate bar chart, side by side bar chart, stacked bar chart, Mosaic plot
- Association type: linear, quadratic, no association
- Positive linear association, negative linear association
- Association strength(weak, moderate, strong) - measured by Pearson correlation (
or
), number between -1 and 1
Correlation does not imply causation - Simpson's paradox
- Multivariate data selection
- Multivariate distributions
Week 4
Course content
- Sampling from well-defined populations
Option 1: Conducting a population census
Option 2: Probability sampling
Option 3: Non-probability sampling - Probability sampling
Simple random sampling
Complex samples - Non probability sampling
- Sampling distribution
- Sampling variance
- A sampling distribution is the distribution of all possible estimates that would arise from hypothetical repeated sampling, and larger sample sizes will result in a sampling distribution with less variance, meaning that estimates are more precise.
- Making population inference based on only one sample
- Inference for non-probability samples
- Complex samples (stratification)
- The empirical rule of distribution