数据挖掘之Python学习笔记

ml-python.png

本文是关于learning data mining with python的学习笔记。记录下在学习过程中的一些知识点。该书用的是python3.4。
本书所用到的python包
scikit-learn： http://scikit-learn.org/stable/
The scikit-learn package is a machine learning library, written in Python. Itcontains numerous algorithms, datasets, utilities, and frameworks for performingmachine learning.

"Data mining provides a way for a computer to learn how to make decisions with data. "

Datasets comprise of two aspects:
• Samples that are objects in the real world. This can be a book, photograph,animal, person, or any other object.
• Features that are descriptions of the samples in our dataset. Features couldbe the length, frequency of a given word, number of legs, date it was created,and so on.

Support is the number of times that a rule occurs in a dataset, which is computed bysimply counting the number of samples that the rule is valid for.
While the support measures how often a rule exists, confdence measures howaccurate they are when they can be used.

Overftting is the problem of creating a model that classifes our training datasetvery well, but performs poorly on new samples. *The solution is quite simple: neveruse training data to test your algorithm. *

scikit-learn library contains a function to split data into training andtesting components:from sklearn.cross_validation import train_test_split .

'''
This function will split the dataset into two subdatasets, 
according to a given ratio(which by default uses 25 percent of the dataset for testing). 
Xd_train contains our data for training andXd_test contains our data for testing.
 y_train and y_test give the correspondingclass values for these datasets.
random_state. Setting the random state will give the samesplit every time the same value is entered.
It will look random, but the algorithm usedis deterministic and the output will be consistent.  
'''
//test_size是样本占比。如果是整数的话就是样本的数量。
//random_state是随机数的种子。不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。
from sklearn.cross_validation import train_test_split 
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, y, test_size=0.4, random_state=14)

scikit-learn estimators

support vectormachines (SVM)
random forests
neural networks

Estimators are scikit-learn's abstraction.
Estimators are usedfor classifcation.
Estimators have the following two main functions :

• fit(): This performs the training of the algorithm and sets internalparameters. It takes two inputs, the training sample dataset and thecorresponding classes for those samples.
• predict(): This predicts the class of the testing samples that is given asinput. This function returns an array with the predictions of each inputtesting sample.

数据挖掘之Python学习笔记

scikit-learn estimators

推荐阅读更多精彩内容