Introduction to Data Science
Before we start
Libraries that we are using
There's bunch of library that data scientist will use during the research, here are some important libraries:
- NumPy: provides multidimensional array with basic operation, linear algebra function
- SciPy: provides numerical algorithms and domain-specific toolbox. Signal processing, optimization, statistics...
- Matplotlib: data visualization
- SCIKIT-Learn: machine learning library, classification, regression, clustering, dimensiality reduction, model selection, and preprocessing.
- PANDAS: data analysis library. Importing CSV, text file, xls, SQL databases, and some other files. It can handling missing data and intelligent data alignment.
Python version and import toolboxes
This book use Anaconda distribution of python. It has integrated some of the most imoportant package that we will use. Detail about installing will not be covered in this note. And we are using Jupyter Notebook Environment. All the code was on python 2.7. The reason why we don't use python 3.6 is the poor 3rd party module support. Everything runs fine with 2.7.
Begin with python
Remember to import the toolboxes that we will use in the program
import pandas as pd
import numpy as np
import matplotlib.pylot as plt
#so that we can use short cut pd, np, plt
To run the code in just one cell, use Cell
-> Run
or just press Ctrl+Enter
To add a cell below, use Insert
-> Insert Cell Below
or just press Ctrl+B
Data input and output
The data structure of Pandas called DataFrame
object. It's basiclly a tabular data structure, that's rows and columns.
Colums are called series
, consist of a list of serval values.
e.g.
data = { 'year': [
2010, 2011, 2012,
2010, 2011, 2012,
2010, 2011, 2012
],
'team': [
'FCBarcelona', 'FCBarcelona',
'FCBarcelona', 'RMadrid',
'RMadrid', 'RMadrid',
'ValenciaCF', 'ValenciaCF',
'ValenciaCF'
],
'wins': [30, 28, 32, 29, 32, 26, 21, 17, 19],
'draws': [6, 7, 4, 5, 4, 7, 8, 10, 8],
'losses': [2, 3, 2, 4, 2, 5, 9, 11, 11]
}
football = pd.DataFrame(data, columns = [
'year', 'team', 'wins', 'draws', 'losses'
]
)
index | year | team | wins | draws | losses |
---|---|---|---|---|---|
0 | 2010 | FCBarcelona | 30 | 6 | 2 |
1 | 2011 | FCBarcelona | 28 | 7 | 3 |
2 | 2012 | FCBarcelona | 32 | 4 | 2 |
3 | 2010 | RMadrid | 29 | 5 | 4 |
4 | 2011 | RMadrid | 32 | 4 | 2 |
5 | 2012 | RMadrid | 26 | 7 | 5 |
6 | 2010 | ValenciaCF | 21 | 8 | 9 |
7 | 2011 | ValenciaCF | 17 | 10 | 11 |
8 | 2012 | ValenciaCF | 19 | 8 | 11 |
this is what the result looks like
index | year | team | wins | draws | losses |
---|---|---|---|---|---|
0 | 2010 | FCBarcelona | 30 | 6 | 2 |
1 | 2011 | FCBarcelona | 28 | 7 | 3 |
2 | 2012 | FCBarcelona | 32 | 4 | 2 |
3 | 2010 | RMadrid | 29 | 5 | 4 |
4 | 2011 | RMadrid | 32 | 4 | 2 |
5 | 2012 | RMadrid | 26 | 7 | 5 |
6 | 2010 | ValenciaCF | 21 | 8 | 9 |
7 | 2011 | ValenciaCF | 17 | 10 | 11 |
8 | 2012 | ValenciaCF | 19 | 8 | 11 |
We can not only use dictionary to input data, but also a CSV file.
e.g.
edu = pd.read_csv('files/Data.csv',
na_values = ':',
usecols = ["A","B","C"])
edu
na_values
set the character that represents NaN
usecols
set the name of the columns.
You can also use Pandas to read some other files, by using read_excel()
, read_hdf()
, read_table()
, read_clipboard()
by calling edu.head()
edu.tail()
we can see the first 5 rows and the last 5 rows of the Data in edu
.
If we want to know the name of the column or the index or the value, call attribute columns
, index
, value
.
If you want a quick view how this statistic is like. call edu.describe()
It give us the count, mean, standard diviation, min, max, 25% 50% 75% quantile of each column or series
Selecting Data
Select or filter some of the data normally use square bracket [ ].
# select just the column name "Value"
edu[`Value`]
#select some rows
edu[10: 14]
# select some rows of some suluns
edu.i[90: 94, ['TIME', 'GEO']]
We can also fiter out some of the data that we don't want. Criteria for filtering can use all the Boolean operator.
# we want the Value larger than 6.5, and just last 5 row of the data
edu[edu['Value'] > 6.5].tail()
to represent NaN
using isnull()
edu[edu["Value"].isnull]
Manipulating Data
Aggregation function
We can call the aggregation functions to get some information about the data that we want to know about.
Here're some commenly used functions:count()
, sum()
, mean()
, median()
, max()
, min()
, std()
, var()
, prod()
-> product of values
They have a common parameter axis.
axis = 0
means function apply to the rows for each column.
axis = 1
means function apply to the columns for each row. (Normally this don't make sence)
Note that the max function of Pandas and of Python are different.
In Pandas, NaN
are exclude.
max function of python actually regard
NaN
as maximum
print "Pandas max function:" edu['Value'].max() #return 8.81
print "Python max function:" max(edu['Value']) #return NaN
we can also apply functions in NumPy by append. apply(np.functionname)
e.g.
s = edu["Value"].apply(np.sqrt)
s.head()
we can also apply a lambda function
s = edu["Value"].apply(lambda d: d**2)
Lambda function is a easy and better way to define a simple function that with no name and will only be use one time.
def f(x):
return x**2
print f(8) # result: 64
#is equal to
g = lambda x: x**2
print g(8) # result: 64
Insert and remove
We can simply use assign operator over a DataFrame.
edu['ValueNorm'] = edu['Value']/edu['Value'].max()
Some of the data should be removed. Parameter axis
shows whether we are removing the row or the column. Parameter inplace
shows whether we are changing the duplicate or the original. If you don't want the old data , set True
edu.drop('ValueNorm', axis = 1, inplace = True)
Inssert new row at bottom using append
.
remember to set ignore_index
to True
, otherwise it will be set to index 0, which will lead to an error if index 0 already exist.
edu = edu.append({"TIME": 2000, "Value": 5.00, "GEO": 'a'}, ignore_index = True)
deleting the last row would be easy if you write like this.
edu.drop(max(edu.index), axis = 0, inplace = True)
To delete NaN
we can use isnull()
function and set axis
to 0 to delate the whole row.
eduDrop = edu.drop(edu['Value'].isnull(), axis = 0)
or we can use specific dropna()
function
eduDrop = edu.dropna(how = 'any', subset = ['Value'])
If we want to set a default value to all the NaN
we can use fillna()
eduFilled = edu.fillna(value = {"Value": 0})
Sorting
Data that sorted by column are commenly seen. We can use sort
function.
edu.sort_values(by = 'Value', ascending = False,
inplace = True)
note that if the inplace
was True
, there will be no new DataFrame return.
Sorted data have their oringinal index, so we can just use sort_index
to return the original order
edu.sort_index(axis = 0, ascending = True, inplace = True)
Grouping Data
We want to group data that meet some commen criteria. Pandas has the groupby
function to do this.
# shows the mean of the value of each country
group = edu[["GEO", "Value"]].groupby("GEO").mean()
- | Value |
---|---|
GEO | |
Austria | 5.618333 |
Belgium | 6.189091 |
Bulgaria | 4.093333 |
Cyprus | 7.023333 |
Czech Republic | 4.16833 |
result:
- | Value |
---|---|
GEO | |
Austria | 5.618333 |
Belgium | 6.189091 |
Bulgaria | 4.093333 |
Cyprus | 7.023333 |
Czech Republic | 4.16833 |
Rearranging Data
Sometimes we need to make the index meaningful by rearranging data. The result is normally a new spreadsheet containing the data that we want and well arranged.
filtered_data = edu[edu["TIME"] > 2005]
pivedu = pd.pivot_table(filtered_data, values = 'Value',
index = ['GEO'],
columns = ['Timer'])
pivedu.head()
TIME | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|---|---|
GEO | ||||||
Austria | ... | ... | ... | ... | .. | .. |
Belgium | .... | ... | ... | ... | ... | ... |
Bulgaria | ... | ... | .. | ... | ... | .. |
Cyprus | ... | .. | .. | .. | ... | .. |
Czech Republic | ... | .... | ... | .... | .. | ... |
result:
TIME | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|---|---|
GEO | ||||||
Austria | ... | ... | ... | ... | .. | .. |
Belgium | .... | ... | ... | ... | ... | ... |
Bulgaria | ... | ... | .. | ... | ... | .. |
Cyprus | ... | .. | .. | .. | ... | .. |
Czech Republic | ... | .... | ... | .... | .. | ... |
and we can directly select specific row and column by lable of the new spreadsheet that we just made.
pivedu.ix[['Spain','Portugal'], [2006, 2011]]
Ranking
Ranking number is really useful. Forexample, we want to know the rank of each country by year.
in the example, we first rename the Germany that with a long long description. Then we drop all the ``NaN`. Finally we do rank.
pivedu = pivedu.rename(index = {'Germany(until 1990 former territory of the FRG)': Germany})
pivedu = pivedu.dropna()
pivedu.rank(ascending = False, method = 'first').head()
if we want to see the rank from top to bottom, we can first rank then sort.
totalSum = pivedu.sum(axis = 1)
totalSum.rank(ascending = False, method = 'dense')
.sort_values().head()
GEO | - |
---|---|
Denmark | 1 |
Cyprus | 2 |
Finland | 3 |
Malta | 4 |
Belgium | 5 |
result:
GEO | - |
---|---|
Denmark | 1 |
Cyprus | 2 |
Finland | 3 |
Malta | 4 |
Belgium | 5 |
Plotting
By using plot
function, we can plot the graphic.
totalSum = pivedu.sum(axis = 1)
.sort_values(ascending = False)
totalSum.plot(kind = 'bar', style = 'b', alpha = 0.4, title = "Total Values for Country")
kind
define which kind of graphic are we using.
Style
refer to the style properties, in bar gragh means color of the bar.
alpha
refer to percentage transparency of the color
if we want to add a legend, a description of the graph, use legend
function.
myplot.legend(loc = 'center left', # set reletive location of the legend with respect to the plot
bbox_to_anchor = (1, .5)) # set a absolute position with respect to the plot
Basic of statistics in Python
First we introduce two concept. Population and sample. Population is a collection ob object, a sample is a part of the population that is observed.
Before the statistics we should prepare the data that suit our need.
Data Prepareation
Data prepareation include following operation:
- Obtaining the data: e.g. read a file, scraping from internet
- Parsing the data: to unwrap the datapackage
- Cleaning the data: deal with incomplete records
- Building data structure: e.g. a database
file = open('files/ch03/adult.data','r')
data = [] # set a empty list
for line in file:
data1 = line.split(', ')
if len(data1) == 15:
data.append([.., .., .., .., .., .., .., .., ..., ..., .., .., .., .., ..])
df = pd.DataFrame(data)
df.columns = ['..', ... ,'..']
# give them a data lable in column property
after fit it in DataFrame, we can command shape
to know the exact number of the data samples, in rows and column.
df.shape
# return (32561, 15)
we can also group the dataFrame like we did earlier.
ml50 = df[(df.sex == 'Male') & (df.income == '>50K\n')]
Summerize some statistic value
- Mean
$$\mu=\frac{1}{n}\sum_{i=1}^nx_i$$
ml['age'].mean()
- Sample Variance
$$\sigma2=\frac{1}{n}\sum_{i}(x_i-\mu)2$$
σ is the standard deviation
ml['age'].var()
ml['age'].std()
- Sample Median:
ml['age'].median()
- Quantil: a fraction p of the data value is less or equal to $x_p$
- Data distribution: commenly use histogram to show the distribution
maAge = ml['age']
mlAge.hist(normed = 0, histtype = 'stepfilled', bins = 20)
# bins: width of the bin
# normed: 1 to show the value of the density function at the bin.(0-1)
# 0 to show the number of the sample in the bin(normally a real number)
to make two histogram compareable, we can overlapping each other in a same graphic
import seaborn as sns
fmAge.hist(normed = 0, histtype = 'stepfilled', alpha = .5, bins = 20)
mlAge.hist(normed = 0, histtype = 'stepfilled', alpha = 0.5,
color = sns.desaturate("indianred", .75),
bin = 10)
if we want to show the cumulative histogram, add a parameter with cumulative = True
Outlier Treatment
First define the standard of the outlier. Normally the samples that far from the median and whose value exceed the mean by 2 or 3 standard deviation.
df2 = df.drop(df.index[
(df.income == '>50K\n') &
(df['age'] > df['age'].median() + 35) &
(df['age'] > df['age'].median() - 15)
])
you can check the removal of your data by overlapping the graphic before and after the removing.
histogram()
returns
-
hist
: value of the bar(according tonormed
), Array -
bin.edge
: edge of the bin, array of float
Measureing Asymmetry: Skewness and Pearson's Median Skewness Coefficient
Skewness: how asymmetric the distribution is.
Skewness of a set of n data samples, $x_i$:
$$ g_1 = \frac{1}{n}\frac{\sum_{i}(x_i - \mu3)}{\sigma3} $$
nagative skewness means the distribution "skews left". A normal distribution have a 0 skewness.
we can define a function to calculate the skewness of a data set
def skewness()
res = 0
m = x.mean()
s = x.std()
for i in x:
res += (i-m) * (i-m) * (i-m)
res /= (len(x) * s * s * s)
return res
Normal Distribution
Also called the Gaussian distribution,
$$ P(x) = \frac{1}{\sqrt{2\pi\sigma2}}e{- \frac{({x-\mu})2}{2\sigma2}} $$
really important in the statistics
Kernel Density
We don't really get every distribution's parameter of every data set.
Instead, we use kernel density estimation. We use Gaussian kernel to generate the density around the data. The sum of these Gaussian distribution can give us a continuous function. So that we can have a continuous representation of the original distribution
x1 = np.random.normal(-1, 0.5, 15)
# random.normal have 3 parameter
# loc: center of the normal distribution
# scale: standard deviation
# size: output shape, canbe a tuple of int or int
x2 = np.random.normal(6, 1, 10)
y = np.r_[x1, x2]
# r_[] merging all the input to a new Array.
x = np.linspace(min(y), max(y), 100)
# linspace return evenly distribute number, array
s = 0.4 # smoothing parameter
#calculate the kernels
kernels = np.transpose([norm.pdf(x, yi, s) for yi in y])
#norm.pdf(x, loc, scale)
#pdf stands for probability density function
plt.plot(x, kernels, 'k:')
plt.plot(x, kernels.sum(1), 'r')
plt.plot(y, np.zeros(len(y)), 'bo', ms = 10)
we can also call the build in function of SciPy
from scipy.stats import kde
density = kde.gaussian_kde(y)
# kde.gaussian_kde Representation of a kernel-density estimate using Gaussian kernels.
xgrid = np.linspace(x.min(), x.max(), 200)
plt.hist(y, bins = 28, normed = True)
plt.plot(xgrid, density(xgrid), 'r-')
Estimation
Mean
we guess the mean $\mu$ of the distribution using the sample mean. This process called estimation and the statistic called estimator.
The sample mean minimize the following mean squared error
$$ MSE = \frac{1}{n}\sum{(\bar{x}-\mu)^2} $$
NTs = 200
mu = 0.0
var = 1.0
err = 0.0
NPs = 1000
for i in range(NTs)
x = np.random.normal(mu, var, NPs)
err += (x.mean()-mu)**2
print 'MSE: ', err/NTests
Variance
If the sample number is large enough, we can use sample variance estimator
$$ \bar{\sigma}^2 = \frac{1}{n}\sum(x_i-\bar{x})^2 $$
if we don't have enough sample, this estimator won't works well, it will be biased. So better estimator is
$$ \bar{\sigma}^2 = \frac{1}{n-1}\sum(x_i-\bar{x})^2 $$
Standard Score
$$ z_i = \frac{(x_i-\mu)}{\sigma} $$
It's a normalize number set. After the normalizing, it has a distribution, mean of 0 and variace of 1. It inherits the shape of the orincginal dataset.
Covariance, Rank Correlation
Covarience only will be mention when two variable share the same tendency.
We center the data with respect to their mean, and look for the difference.
$$ Cov(X, Y) = \frac{1}{n}\sum_{i=1}^ndx_idy_i $$
Pearson's correlation is after we normalized the data and multiply them together.
$$ \rho_i = \frac{x_i-\mu_X}{\rho_X}\frac{y_i-\mu_Y}{\rho_Y} $$
$$ \rho = \frac{1}{n}\sum_{i=1}^n{\rho_i}=\frac{Cov(X,Y)}{\sigma_X\sigma_Y} $$
Pearson's correlation is always between -1 and +1. The magnitude shows the correlation. The sign shows positive or negative relation
$\rho = 0$ dose not mean that they have no correlation, it only shows the correlation of first order. And note that it dosen't work well in the presence of outliers.
Spearmans Rank Correlation is a robust solution especially when the data contain outliers. Use rank to sort sample data.
$$ r_s = 1-\frac{6\sum_i{d_i2}}{n\cdot(n2-1)} $$
with $$ d_i = rg(x_i)-rg(y_i)$$
is the difference of the rank of $x$ and $y$
Statistical Inference
Theres two way to address the problem of statistical inference.
- frequentist approach
- bayesian approach:
The frequentist approach
If we accept this asumption, that means we concern about the population parameters from analysis of sample.
There's three things we mostly concern about:
- point estimation: a particular value that approximates some parameter of interest. e.g. mean or variance
- confidence intervals, or set estimattes: range of the value that best represent the parameter of insterest
- hypothesis: accept or reject
Estimation
Remember estimation is not truth. More data the estimation will be better.
Point estimation
Sampling
data = pd.read_csv("files/aaa.csv")
data['Date'] = data[u'Dia de mes'].apply(lambda x: str(x)) +
'-' +
data[u'Mes de any'].apply(lambda x: str(x))
data['Date'] = pd.to_datetime(data['Date'])
accidents = data.groupby(['Date']).size()
print accidents.mean()
How we estimate the sampling distribution of the sample mean
- Draw $s$ (a large number) independent samples $ { x^1,..., x^s} $ from the population where each element $x^j$ is composed of ${x_i^j}_{i=1,...,n}$.
- Evaluate the sample mean $\hat{\mu}j=\frac{1}{n}\sum_{i=1}nx_i^j$ of each sample
- Estimate the sampling distribution of $\hat{\mu}$ by the empirical distribution of the sample replications.
# population
df = accidents.to_frame() # convert accidents to DataFrame
nTest = 10000
elements = 200
# mean array of samples
means = [0] * nTest # build a sero array with nTest 0
#sample generation
for i in range(nTest):
rows = np.random.choice(df.index.values, elements)
sampled_df = df.ix[rows]
means[i] = sampled_df.mean()
Traditional Approach
But actually we can't always get access to the population, so we have to solve the problem by using some theoretical results from traditonal statistics.
Standard error
$$ SE = \frac{\sigma_x}{\sqrt{n}} $$
This formula use the standard deviation $\sigma_x$, which is not known. If the population distribution is not skewed and estimation is sufficiently good if $n>30$. THis allow us to use the standard error of the sample value if we can't get access to the population.
rows = np.random.choice(df.index.values, 200)
sampledDf = df.ix[rows]
estSigmaMean = sampledDf.std()/math.sqrt(200)
Computationally intensive approach
When the full dataset has super large population, a alternative to the traditional.
In the bootstrap, we draw n observation with replacement from the oringinal data to create a bootstrap sample or resample. Then, we can calculate the mean for this resample. By repeating this process more times, we can get a good approximation of the mean sampling distribution.
def meanBootStrap (X, numberb):
x = [0] * numberb
for i in range (numberb):
sample = [X[j]
for j in np.random.randint(len(X), size = len(X)
]
x[i] = np.mean(sample)
return x
m = meanBootstrap(accidents, 10000)
np.mean(m)
Confidence Interval
Confidence Level | 90% | 95% | 99% | 99.9% |
---|---|---|---|---|
z Value | 1.65 | 1.96 | 2.58 | 3.291 |
A point estimate $\Theta$, is hardly to be perfect. So we want to provide a plausible range of value of the parameter. This range called confidence interval.
We care about the interval and the plausibility .
If the point estimate follow the normal model with standard error SE, then a confidence interval for the population parameter is $$\Theta \pm z \times SE$$
Confidence Level | 90% | 95% | 99% | 99.9% |
---|---|---|---|---|
z Value | 1.65 | 1.96 | 2.58 | 3.291 |
Use bootstrapping to calculate the 95% confidence interval
m= meanBootstrap(accidents, 10000)
sampleMean = np.mean(m)
sampleSe = np.std(m)
# Confidence Interval
ci = [np.percentile(m, 2.5), np.precentile(m, 97.5)]
95% confidence means:
In 95% of the cases, when I compute the 95% confidence interval from this sample, the true mean of the population will fall within the interval define by the bounds: $\pm1.96\times SE$
Hypothesis Testing
Two kinds of hypotheses:
- $H_0$: Two distribution is the same. Null hypothesis
- $H_1$: Two distribution is different. Alternative hypothesis
n = len(counts2013)
mean = counts2013.mean()
s = counts2013.std()
ci = [ mean - s*1.96/np.sqrt(n), mean + s*1.96/np.sqrt()]
Interpretation of CI Test
The result canbe reject, or refuse to reject.
if we use a 95% confidence interval to test a problem where the null hypothesis is true, we will make an error whenever the point estimate is at least $1.96\sigma$ away from the population parameter. This hapen about 5% of the time.