数据挖掘之Python学习笔记

ml-python.png

本文是关于learning data mining with python的学习笔记。记录下在学习过程中的一些知识点。该书用的是python3.4。
本书所用到的python包
scikit-learnhttp://scikit-learn.org/stable/
The scikit-learn package is a machine learning library, written in Python. Itcontains numerous algorithms, datasets, utilities, and frameworks for performingmachine learning.

"Data mining provides a way for a computer to learn how to make decisions with data. "


Datasets comprise of two aspects:
Samples that are objects in the real world. This can be a book, photograph,animal, person, or any other object.
Features that are descriptions of the samples in our dataset. Features couldbe the length, frequency of a given word, number of legs, date it was created,and so on.

Support is the number of times that a rule occurs in a dataset, which is computed bysimply counting the number of samples that the rule is valid for.
While the support measures how often a rule exists, confdence measures howaccurate they are when they can be used.

Overftting is the problem of creating a model that classifes our training datasetvery well, but performs poorly on new samples. *The solution is quite simple: neveruse training data to test your algorithm. *

scikit-learn library contains a function to split data into training andtesting components:from sklearn.cross_validation import train_test_split .

'''
This function will split the dataset into two subdatasets, 
according to a given ratio(which by default uses 25 percent of the dataset for testing). 
Xd_train contains our data for training andXd_test contains our data for testing.
 y_train and y_test give the correspondingclass values for these datasets.
random_state. Setting the random state will give the samesplit every time the same value is entered.
It will look random, but the algorithm usedis deterministic and the output will be consistent.  
'''
//test_size是样本占比。如果是整数的话就是样本的数量。
//random_state是随机数的种子。不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。
from sklearn.cross_validation import train_test_split 
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, y, test_size=0.4, random_state=14) 

scikit-learn estimators

  1. support vectormachines (SVM)
  2. random forests
  3. neural networks
  • Estimators are scikit-learn's abstraction.
  • Estimators are usedfor classifcation.
    Estimators have the following two main functions :

• fit(): This performs the training of the algorithm and sets internalparameters. It takes two inputs, the training sample dataset and thecorresponding classes for those samples.
• predict(): This predicts the class of the testing samples that is given asinput. This function returns an array with the predictions of each inputtesting sample.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 作者:伊丽莎白·吉尔博特 一个星期五的下午,在单位的图书室发现了这本书,当然,对内容一无所知,就仅仅是被她的...
    木瓜书阅读 1,521评论 0 0
  • 今天睡到自然醒。听说悉尼前几天还热到三十多度,我们呆的这两天都下小雨,最高气温25度左右刚刚好! 在酒店简单吃过早...
    心理师静怡阅读 1,631评论 2 2
  • 后来啊 我是怎么发现的呢 我不小心得知了她的联系方式 真的是不小心 世界就是这么奇妙、这么小 然后我终于知道 当年...
    林茨阅读 754评论 0 0
  • 年少的时候,我们都会觉得非“他,她”不可了,可是过了些年,”他和她“即使成不了朋友,但也可以甚是平静的说:“他呀,...
    五行缺火miss页阅读 1,761评论 0 0