Introduction to Data Science

Before we start

Libraries that we are using

There's bunch of library that data scientist will use during the research, here are some important libraries:

NumPy: provides multidimensional array with basic operation, linear algebra function
SciPy: provides numerical algorithms and domain-specific toolbox. Signal processing, optimization, statistics...
Matplotlib: data visualization
SCIKIT-Learn: machine learning library, classification, regression, clustering, dimensiality reduction, model selection, and preprocessing.
PANDAS: data analysis library. Importing CSV, text file, xls, SQL databases, and some other files. It can handling missing data and intelligent data alignment.

Python version and import toolboxes

This book use Anaconda distribution of python. It has integrated some of the most imoportant package that we will use. Detail about installing will not be covered in this note. And we are using Jupyter Notebook Environment. All the code was on python 2.7. The reason why we don't use python 3.6 is the poor 3rd party module support. Everything runs fine with 2.7.

Begin with python

Remember to import the toolboxes that we will use in the program

import pandas as pd
import numpy as np
import matplotlib.pylot as plt
#so that we can use short cut pd, np, plt

To run the code in just one cell, use Cell -> Run or just press Ctrl+Enter
To add a cell below, use Insert -> Insert Cell Below or just press Ctrl+B

Data input and output

The data structure of Pandas called DataFrame object. It's basiclly a tabular data structure, that's rows and columns.
Colums are called series, consist of a list of serval values.
e.g.

data = { 'year': [
2010, 2011, 2012,
2010, 2011, 2012,
2010, 2011, 2012
],
'team': [
'FCBarcelona', 'FCBarcelona',
'FCBarcelona', 'RMadrid',
'RMadrid', 'RMadrid',
'ValenciaCF', 'ValenciaCF',
'ValenciaCF'
],
'wins':     [30, 28, 32, 29, 32, 26, 21, 17, 19],
'draws':    [6, 7, 4, 5, 4, 7, 8, 10, 8],
'losses':   [2, 3, 2, 4, 2, 5, 9, 11, 11]
}

football = pd.DataFrame(data, columns = [
'year', 'team', 'wins', 'draws', 'losses'
]
)

index	year	team	wins	draws	losses
0	2010	FCBarcelona	30	6	2
1	2011	FCBarcelona	28	7	3
2	2012	FCBarcelona	32	4	2
3	2010	RMadrid	29	5	4
4	2011	RMadrid	32	4	2
5	2012	RMadrid	26	7	5
6	2010	ValenciaCF	21	8	9
7	2011	ValenciaCF	17	10	11
8	2012	ValenciaCF	19	8	11

this is what the result looks like

index	year	team	wins	draws	losses
0	2010	FCBarcelona	30	6	2
1	2011	FCBarcelona	28	7	3
2	2012	FCBarcelona	32	4	2
3	2010	RMadrid	29	5	4
4	2011	RMadrid	32	4	2
5	2012	RMadrid	26	7	5
6	2010	ValenciaCF	21	8	9
7	2011	ValenciaCF	17	10	11
8	2012	ValenciaCF	19	8	11

We can not only use dictionary to input data, but also a CSV file.
e.g.

edu = pd.read_csv('files/Data.csv',
na_values = ':',
usecols = ["A","B","C"])
edu

na_values set the character that represents NaN
usecols set the name of the columns.
You can also use Pandas to read some other files, by using read_excel(), read_hdf(), read_table(), read_clipboard()
by calling edu.head() edu.tail() we can see the first 5 rows and the last 5 rows of the Data in edu.
If we want to know the name of the column or the index or the value, call attribute columns, index, value.
If you want a quick view how this statistic is like. call edu.describe()
It give us the count, mean, standard diviation, min, max, 25% 50% 75% quantile of each column or series

Selecting Data

Select or filter some of the data normally use square bracket [ ].

# select just the column name "Value"
edu[`Value`]

#select some rows
edu[10: 14]

# select some rows of some suluns
edu.i[90: 94, ['TIME', 'GEO']]

We can also fiter out some of the data that we don't want. Criteria for filtering can use all the Boolean operator.

# we want the Value larger than 6.5, and just last 5 row of the data
edu[edu['Value'] > 6.5].tail()

to represent NaN using isnull()

edu[edu["Value"].isnull]

Manipulating Data

Aggregation function

We can call the aggregation functions to get some information about the data that we want to know about.
Here're some commenly used functions:count(), sum(), mean(), median(), max(), min(), std(), var(), prod() -> product of values
They have a common parameter axis.
axis = 0 means function apply to the rows for each column.
axis = 1 means function apply to the columns for each row. (Normally this don't make sence)
Note that the max function of Pandas and of Python are different.
In Pandas, NaN are exclude.

max function of python actually regard NaN as maximum

print "Pandas max function:" edu['Value'].max()  #return 8.81
print "Python max function:" max(edu['Value'])  #return NaN

we can also apply functions in NumPy by append. apply(np.functionname)
e.g.

s = edu["Value"].apply(np.sqrt)
s.head()

we can also apply a lambda function

s = edu["Value"].apply(lambda d: d**2)

Lambda function is a easy and better way to define a simple function that with no name and will only be use one time.

def f(x):
return x**2
print f(8) # result: 64

#is equal to
g = lambda x: x**2
print g(8) # result: 64

Insert and remove

We can simply use assign operator over a DataFrame.

edu['ValueNorm'] = edu['Value']/edu['Value'].max()

Some of the data should be removed. Parameter axis shows whether we are removing the row or the column. Parameter inplace shows whether we are changing the duplicate or the original. If you don't want the old data , set True

edu.drop('ValueNorm', axis = 1, inplace = True)

Inssert new row at bottom using append.
remember to set ignore_index to True, otherwise it will be set to index 0, which will lead to an error if index 0 already exist.

edu = edu.append({"TIME": 2000, "Value": 5.00, "GEO": 'a'}, ignore_index = True)

deleting the last row would be easy if you write like this.

edu.drop(max(edu.index), axis = 0, inplace = True)

To delete NaN we can use isnull() function and set axis to 0 to delate the whole row.

eduDrop = edu.drop(edu['Value'].isnull(), axis = 0)

or we can use specific dropna() function

eduDrop = edu.dropna(how = 'any', subset = ['Value'])

If we want to set a default value to all the NaN we can use fillna()

eduFilled = edu.fillna(value = {"Value": 0})

Sorting

Data that sorted by column are commenly seen. We can use sort function.

edu.sort_values(by = 'Value', ascending = False,
inplace = True)

note that if the inplace was True, there will be no new DataFrame return.
Sorted data have their oringinal index, so we can just use sort_index to return the original order

edu.sort_index(axis = 0, ascending = True, inplace = True)

Grouping Data

We want to group data that meet some commen criteria. Pandas has the groupby function to do this.

# shows the mean of the value of each country
group = edu[["GEO", "Value"]].groupby("GEO").mean()

-	Value
GEO
Austria	5.618333
Belgium	6.189091
Bulgaria	4.093333
Cyprus	7.023333
Czech Republic	4.16833

result:

-	Value
GEO
Austria	5.618333
Belgium	6.189091
Bulgaria	4.093333
Cyprus	7.023333
Czech Republic	4.16833

Rearranging Data

Sometimes we need to make the index meaningful by rearranging data. The result is normally a new spreadsheet containing the data that we want and well arranged.

filtered_data = edu[edu["TIME"] > 2005]
pivedu = pd.pivot_table(filtered_data, values = 'Value',
index = ['GEO'],
columns = ['Timer'])
pivedu.head()

TIME	2006	2007	2008	2009	2010	2011
GEO
Austria	...	...	...	...	..	..
Belgium	....	...	...	...	...	...
Bulgaria	...	...	..	...	...	..
Cyprus	...	..	..	..	...	..
Czech Republic	...	....	...	....	..	...

result:

TIME	2006	2007	2008	2009	2010	2011
GEO
Austria	...	...	...	...	..	..
Belgium	....	...	...	...	...	...
Bulgaria	...	...	..	...	...	..
Cyprus	...	..	..	..	...	..
Czech Republic	...	....	...	....	..	...

and we can directly select specific row and column by lable of the new spreadsheet that we just made.

pivedu.ix[['Spain','Portugal'], [2006, 2011]]

Ranking

Ranking number is really useful. Forexample, we want to know the rank of each country by year.
in the example, we first rename the Germany that with a long long description. Then we drop all the ``NaN`. Finally we do rank.

pivedu = pivedu.rename(index = {'Germany(until 1990 former territory of the FRG)': Germany})
pivedu = pivedu.dropna()
pivedu.rank(ascending = False, method = 'first').head()

if we want to see the rank from top to bottom, we can first rank then sort.

totalSum = pivedu.sum(axis = 1)
totalSum.rank(ascending = False, method = 'dense')
.sort_values().head()

GEO	-
Denmark	1
Cyprus	2
Finland	3
Malta	4
Belgium	5

result:

GEO	-
Denmark	1
Cyprus	2
Finland	3
Malta	4
Belgium	5

Plotting

By using plot function, we can plot the graphic.

totalSum = pivedu.sum(axis = 1)
.sort_values(ascending = False)
totalSum.plot(kind = 'bar', style = 'b', alpha = 0.4, title = "Total Values for Country")

kind define which kind of graphic are we using.
Style refer to the style properties, in bar gragh means color of the bar.
alpha refer to percentage transparency of the color
if we want to add a legend, a description of the graph, use legend function.

myplot.legend(loc = 'center left', # set reletive location of the legend with respect to the plot
bbox_to_anchor = (1, .5)) # set a absolute position with respect to the plot

Basic of statistics in Python

First we introduce two concept. Population and sample. Population is a collection ob object, a sample is a part of the population that is observed.
Before the statistics we should prepare the data that suit our need.

Data Prepareation

Data prepareation include following operation:

Obtaining the data: e.g. read a file, scraping from internet
Parsing the data: to unwrap the datapackage
Cleaning the data: deal with incomplete records
Building data structure: e.g. a database

file = open('files/ch03/adult.data','r')
data = [] # set a empty list
for line in file:
data1 = line.split(', ')
if len(data1) == 15:
data.append([.., .., .., .., .., .., .., .., ..., ..., .., .., .., .., ..])

df = pd.DataFrame(data)
df.columns = ['..', ... ,'..']
# give them a data lable in column property

after fit it in DataFrame, we can command shape to know the exact number of the data samples, in rows and column.

df.shape
# return (32561, 15)

we can also group the dataFrame like we did earlier.

ml50 = df[(df.sex == 'Male') & (df.income == '>50K\n')]

Summerize some statistic value

Mean
$$\mu=\frac{1}{n}\sum_{i=1}^nx_i$$

ml['age'].mean()

Sample Variance
$$\sigma^{2=\frac{1}{n}\sum_{i}(x_i-\mu)}2$$
σ is the standard deviation

ml['age'].var()
ml['age'].std()

Sample Median: ml['age'].median()
Quantil: a fraction p of the data value is less or equal to $x_p$
Data distribution: commenly use histogram to show the distribution

maAge = ml['age']
mlAge.hist(normed = 0, histtype = 'stepfilled', bins = 20)
# bins: width of the bin
# normed: 1 to show the value of the density function at the bin.(0-1)
#         0 to show the number of the sample in the bin(normally a real number)

to make two histogram compareable, we can overlapping each other in a same graphic

import seaborn as sns
fmAge.hist(normed = 0, histtype = 'stepfilled', alpha = .5, bins = 20)
mlAge.hist(normed = 0, histtype = 'stepfilled', alpha = 0.5,
color = sns.desaturate("indianred", .75),
bin = 10)

if we want to show the cumulative histogram, add a parameter with cumulative = True

Outlier Treatment

First define the standard of the outlier. Normally the samples that far from the median and whose value exceed the mean by 2 or 3 standard deviation.

df2 = df.drop(df.index[
(df.income == '>50K\n') &
(df['age'] > df['age'].median() + 35) &
(df['age'] > df['age'].median() - 15)
])

you can check the removal of your data by overlapping the graphic before and after the removing.
histogram() returns

hist: value of the bar(according to normed), Array
bin.edge: edge of the bin, array of float

Measureing Asymmetry: Skewness and Pearson's Median Skewness Coefficient

Skewness: how asymmetric the distribution is.
Skewness of a set of n data samples, $x_i$:
$$ g_1 = \frac{1}{n}\frac{\sum_{i}(x_i - \mu^3)}{\sigma3} $$
nagative skewness means the distribution "skews left". A normal distribution have a 0 skewness.
we can define a function to calculate the skewness of a data set

def skewness()
res = 0
m = x.mean()
s = x.std()
for i in x:
res += (i-m) * (i-m) * (i-m)
res /= (len(x) * s * s * s)
return res

Normal Distribution

Also called the Gaussian distribution,
$$ P(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e{- \frac{({x-\mu})^2}{2\sigma2}} $$
really important in the statistics

Kernel Density

We don't really get every distribution's parameter of every data set.
Instead, we use kernel density estimation. We use Gaussian kernel to generate the density around the data. The sum of these Gaussian distribution can give us a continuous function. So that we can have a continuous representation of the original distribution

x1 = np.random.normal(-1, 0.5, 15)
# random.normal have 3 parameter
# loc: center of the normal distribution
# scale: standard deviation
# size: output shape, canbe a tuple of int or int
x2 = np.random.normal(6, 1, 10)
y = np.r_[x1, x2]
# r_[] merging all the input to a new Array.
x = np.linspace(min(y), max(y), 100)
# linspace return evenly distribute number, array

s = 0.4 # smoothing parameter

#calculate the kernels
kernels = np.transpose([norm.pdf(x, yi, s) for yi in y])
#norm.pdf(x, loc, scale)
#pdf stands for probability density function
plt.plot(x, kernels, 'k:')
plt.plot(x, kernels.sum(1), 'r')
plt.plot(y, np.zeros(len(y)), 'bo', ms = 10)

we can also call the build in function of SciPy

from scipy.stats import kde
density = kde.gaussian_kde(y)
# kde.gaussian_kde Representation of a kernel-density estimate using Gaussian kernels.
xgrid = np.linspace(x.min(), x.max(), 200)
plt.hist(y, bins = 28, normed = True)
plt.plot(xgrid, density(xgrid), 'r-')

Estimation

Mean

we guess the mean $\mu$ of the distribution using the sample mean. This process called estimation and the statistic called estimator.
The sample mean minimize the following mean squared error
$$ MSE = \frac{1}{n}\sum{(\bar{x}-\mu)^2} $$

NTs = 200
mu = 0.0
var = 1.0
err = 0.0
NPs = 1000
for i in range(NTs)
x = np.random.normal(mu, var, NPs)
err += (x.mean()-mu)**2
print 'MSE: ', err/NTests

Variance

If the sample number is large enough, we can use sample variance estimator
$$ \bar{\sigma}^2 = \frac{1}{n}\sum(x_i-\bar{x})^2 $$
if we don't have enough sample, this estimator won't works well, it will be biased. So better estimator is
$$ \bar{\sigma}^2 = \frac{1}{n-1}\sum(x_i-\bar{x})^2 $$

Standard Score

$$ z_i = \frac{(x_i-\mu)}{\sigma} $$
It's a normalize number set. After the normalizing, it has a distribution, mean of 0 and variace of 1. It inherits the shape of the orincginal dataset.

Covariance, Rank Correlation

Covarience only will be mention when two variable share the same tendency.
We center the data with respect to their mean, and look for the difference.
$$ Cov(X, Y) = \frac{1}{n}\sum_{i=1}^ndx_idy_i $$
Pearson's correlation is after we normalized the data and multiply them together.
$$ \rho_i = \frac{x_i-\mu_X}{\rho_X}\frac{y_i-\mu_Y}{\rho_Y} $$
$$ \rho = \frac{1}{n}\sum_{i=1}^n{\rho_i}=\frac{Cov(X,Y)}{\sigma_X\sigma_Y} $$
Pearson's correlation is always between -1 and +1. The magnitude shows the correlation. The sign shows positive or negative relation
$\rho = 0$ dose not mean that they have no correlation, it only shows the correlation of first order. And note that it dosen't work well in the presence of outliers.
Spearmans Rank Correlation is a robust solution especially when the data contain outliers. Use rank to sort sample data.
$$ r_s = 1-\frac{6\sum_i{d_i^2}}{n\cdot(n2-1)} $$
with $$ d_i = rg(x_i)-rg(y_i)$$
is the difference of the rank of $x$ and $y$

Statistical Inference

Theres two way to address the problem of statistical inference.

frequentist approach
bayesian approach:

The frequentist approach

If we accept this asumption, that means we concern about the population parameters from analysis of sample.
There's three things we mostly concern about:

point estimation: a particular value that approximates some parameter of interest. e.g. mean or variance
confidence intervals, or set estimattes: range of the value that best represent the parameter of insterest
hypothesis: accept or reject

Estimation

Remember estimation is not truth. More data the estimation will be better.

Point estimation

Sampling

data = pd.read_csv("files/aaa.csv")
data['Date'] = data[u'Dia de mes'].apply(lambda x: str(x)) +
'-' +
data[u'Mes de any'].apply(lambda x: str(x))
data['Date'] = pd.to_datetime(data['Date'])
accidents = data.groupby(['Date']).size()
print accidents.mean()

How we estimate the sampling distribution of the sample mean

Draw $s$ (a large number) independent samples $ { x^1,..., x^s} $ from the population where each element $x^j$ is composed of ${x_i^j}_{i=1,...,n}$.
Evaluate the sample mean $\hat{\mu}^{j=\frac{1}{n}\sum_{i=1}}nx_i^j$ of each sample
Estimate the sampling distribution of $\hat{\mu}$ by the empirical distribution of the sample replications.

# population
df = accidents.to_frame() # convert accidents to DataFrame
nTest = 10000
elements = 200

# mean array of samples
means = [0] * nTest # build a sero array with nTest 0

#sample generation
for i in range(nTest):
rows = np.random.choice(df.index.values, elements)
sampled_df = df.ix[rows]
means[i] = sampled_df.mean()

Traditional Approach

But actually we can't always get access to the population, so we have to solve the problem by using some theoretical results from traditonal statistics.
Standard error
$$ SE = \frac{\sigma_x}{\sqrt{n}} $$
This formula use the standard deviation $\sigma_x$, which is not known. If the population distribution is not skewed and estimation is sufficiently good if $n>30$. THis allow us to use the standard error of the sample value if we can't get access to the population.

rows = np.random.choice(df.index.values, 200)
sampledDf = df.ix[rows]
estSigmaMean = sampledDf.std()/math.sqrt(200)

Computationally intensive approach

When the full dataset has super large population, a alternative to the traditional.
In the bootstrap, we draw n observation with replacement from the oringinal data to create a bootstrap sample or resample. Then, we can calculate the mean for this resample. By repeating this process more times, we can get a good approximation of the mean sampling distribution.

def meanBootStrap (X, numberb):
x = [0] * numberb
for i in range (numberb):
sample = [X[j]
for j in np.random.randint(len(X), size = len(X)
]
x[i] = np.mean(sample)
return x
m = meanBootstrap(accidents, 10000)
np.mean(m)

Confidence Interval

Confidence Level	90%	95%	99%	99.9%
z Value	1.65	1.96	2.58	3.291

A point estimate $\Theta$, is hardly to be perfect. So we want to provide a plausible range of value of the parameter. This range called confidence interval.
We care about the interval and the plausibility .
If the point estimate follow the normal model with standard error SE, then a confidence interval for the population parameter is $$\Theta \pm z \times SE$$

Confidence Level	90%	95%	99%	99.9%
z Value	1.65	1.96	2.58	3.291

Use bootstrapping to calculate the 95% confidence interval

m= meanBootstrap(accidents, 10000)
sampleMean = np.mean(m)
sampleSe = np.std(m)

# Confidence Interval
ci = [np.percentile(m, 2.5), np.precentile(m, 97.5)]

95% confidence means:

In 95% of the cases, when I compute the 95% confidence interval from this sample, the true mean of the population will fall within the interval define by the bounds: $\pm1.96\times SE$

Hypothesis Testing

Two kinds of hypotheses:

$H_0$: Two distribution is the same. Null hypothesis
$H_1$: Two distribution is different. Alternative hypothesis

n = len(counts2013)
mean = counts2013.mean()
s = counts2013.std()
ci = [ mean - s*1.96/np.sqrt(n), mean + s*1.96/np.sqrt()]

Interpretation of CI Test
The result canbe reject, or refuse to reject.

if we use a 95% confidence interval to test a problem where the null hypothesis is true, we will make an error whenever the point estimate is at least $1.96\sigma$ away from the population parameter. This hapen about 5% of the time.

Introduction to Data Science (still reading...)

Introduction to Data Science (still reading...)

Introduction to Data Science

Before we start

Libraries that we are using

Python version and import toolboxes

Begin with python

Data input and output

Selecting Data

Manipulating Data

Aggregation function

Insert and remove

Sorting

Grouping Data

Rearranging Data

Ranking

Plotting

Basic of statistics in Python

Data Prepareation

Summerize some statistic value

Outlier Treatment

Measureing Asymmetry: Skewness and Pearson's Median Skewness Coefficient

Normal Distribution

Kernel Density

Estimation

Mean

Variance

Standard Score

Covariance, Rank Correlation

Statistical Inference

The frequentist approach

Estimation

Point estimation

Traditional Approach

Computationally intensive approach

Confidence Interval

Hypothesis Testing

推荐阅读更多精彩内容