数据挖掘之旅
- 数据挖掘简介及其应用场景
- 搭建Python数据挖掘环境
- 亲和性分析示例:根据购买习惯推荐商品
- 经典分类问题示例:根据测量结果推测植物种类
数据挖掘简介
数据挖掘旨在让计算机根据已有数据做出决策。决策可以是预测明天的天气、拦截垃圾邮件、检测网站的语言或者约会网站上发现新的恋爱对象等。
数据挖掘设计算法、统计学、工程学、最优化理论和计算机科学相关领域的知识。
亲和性分析
- 向网站用户提供多样化的服务和定制化的服务或投放定向广告
- 为了向用户推荐电影或商品,而卖给他们呢一些与之相关的小玩意
- 根据基因寻找有亲缘关系的人
商品推荐
通过分析用户历史交易数据,找到用户购买相同商品的交易数据,通过打折、限购、预售等营销方式推荐给用户相同商品或相似商品
简单的排序规则
-
规则的优劣有多种衡量方法,常用的是支持度(support)和置信度(confidence)
- 支持度:指数据集中规则应验的次数,衡量的是给定规则应验的比例
- 置信度:衡量的则是规则准确率如何 参考网址`
简单示例
import numpy as np
# 加载数据
# 数据title 面包、牛奶、奶酪、苹果、香蕉
# 0 表示未购买该商品,1 表示购买该商品
dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
print(X[:5])
# 检测“如果买了苹果,也会购买香蕉”的置信度和支持度
# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]
num_apple_purchases = 0
for sample in X:
if sample[3] == 1:
num_apple_purchases += 1
print ("{0} people bought Apples".format(num_apple_purchases))
# defaultdict 若查找键值不存在,则返回默认值
from collections import defaultdict
# 规则应验
valid_rules = defaultdict(int)
# 违反规则
invalid_rules = defaultdict(int)
# 相同规则,即若顾客买了苹果,他们也买苹果
num_occurances = defaultdict(int)
# 顾客购买了某一商品
n_features = 4
for sample in X:
for premise in range(4):
if sample[premise] == 0:
continue
num_occurances[premise] += 1
for conclusion in range(n_features):
if premise == conclusion:
continue
if sample[conclusion] == 1:
valid_rules[(premise,conclusion)] += 1
else:
invalid_rules[(premise,conclusion)] += 1
# 计算置信度支持度
support = valid_rules
confidence = defaultdict(float)
for premise,conclusion in valid_rules.keys():
rule = (premise,conclusion)
confidence[rule] = valid_rules[rule]/num_occurances[premise]
# 格式化输出函数
def print_rule(premise,conclusion,support,confidence,features):
premise_name = features[premise]
conclusion_name = features[conclusion]
print ("Rule: If a person buys {0} they will also buy {1}".format(premise_name,conclusion_name))
print (" - Support: {0}".format(support[(premise,conclusion)]))
print (" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))
premise = 1
conclusion = 3
print_rule(premise,conclusion,support,confidence,features)
# 找出最优规则
from operator import itemgetter
sorted_support = sorted(support.items(),key=itemgetter(1),reverse = True)
for index in range(5):
print("Rule # {0}".format(index + 1))
premise,conclusion = sorted_support[index][0]
print_rule(premise,conclusion,support,confidence,features)
# 找出最优规则
sorted_confidence = sorted(confidence.items(),key=itemgetter(1),reverse = True)
for index in range(5):
print("Rule # {0}".format(index + 1))
premise,conclusion = sorted_confidence[index][0]
print_rule(premise,conclusion,support,confidence,features)
数据集位置及解释 提取码:hjyc
%%markdown
简单介绍分类问题
分类应用的目标是,根据已知类别的数据集,经过训练得到一个分类模型,再用模型对类别未知的数据进行分类。
-
什么是类别?类别值又怎么解释?,可参考如下例子
- 根据检测数据确定植物的种类。类别的值为“植物属于哪个种类?”
- 判断图像中有没有狗。类别是“图像里有狗吗?”
- 根据化验结果,判断病人有没有被感染。类别是“病人被感染了吗?”
实现OneR算法 描述
OneR算法的思路很简单,根据已有数据中,具有相同特征值的个体最可能属于那个类别进行分类
OneR算法首先变量每个特征的每个取值,对每个特征值,统计它在各个类别中的出现次数,找到它出现次数最多的类别,并统计它在其他类别中出现的次数
OneR选取错误最低的特征作为唯一的分类准则
简单示例
#数据加载
import numpy as np
from sklearn.datasets import load_iris
dataset = load_iris()
X = dataset.data
y = dataset.target
n_samples, n_features = X.shape
# 计算每个属性的平均值
attribute_means = X.mean(axis=0)
assert attribute_means.shape == (n_features,)
# 将均值作为阈值,离散化数据
X_d = np.array(X >= attribute_means, dtype='int')
# 拆分数据集,将数据集拆分为训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=14)
print("There are {} training samples".format(y_train.shape))
print("There are {} testing samples".format(y_test.shape))
from collections import defaultdict
from operator import itemgetter
def train(X, y_true, feature):
"""使用OneR算法计算给定特征的预测值和误差
Parameters
----------
X: array [n_samples, n_features]
保存数据集的二维数组。每一行是一个样本,每一列是一个特征。
y_true: array [n_samples,]
保存类值的一维数组。对应于X,这样y_true[i] is the class value for sample X[i].
feature: int
与要测试的变量的索引相对应的整数
0 <= variable < n_features
Returns
-------
predictors: dictionary of tuples: (value, prediction)
对于数组中的每个项,如果变量具有给定值,则进行给定的预测。
error: float
此规则错误预测的训练数据比率。
"""
# Check that variable is a valid number
n_samples, n_features = X.shape
assert 0 <= feature < n_features
# 获取此变量具有的所有唯一值
values = set(X[:,feature])
# Stores the predictors array that is returned
predictors = dict()
errors = []
for current_value in values:
most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
predictors[current_value] = most_frequent_class
errors.append(error)
# 计算使用此特征进行分类的总误差
total_error = sum(errors)
return predictors, total_error
def train_feature_value(X, y_true, feature, value):
# 创建一个简单的字典来计算它们给出特定预测的频率
class_counts = defaultdict(int)
# 遍历每个样本并计算每个类/值对的频率
for sample, y in zip(X, y_true):
if sample[feature] == value:
class_counts[y] += 1
# 现在通过排序(最高优先)和选择第一个项目来获得最好的一个
sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
most_frequent_class = sorted_class_counts[0][0]
# 错误是没有归类为最频繁类的样本数
n_samples = X.shape[1]
error = sum([class_count for class_value, class_count in class_counts.items()
if class_value != most_frequent_class])
return most_frequent_class, error
# 计算所有预测值
all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])}
errors = {variable: error for variable, (mapping, error) in all_predictors.items()}
# 选择最好的并保存为“模型”
# 按错误排序
best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0]
print("The best model is based on variable {0} and has error {1:.2f}".format(best_variable, best_error))
model = {'variable': best_variable,
'predictor': all_predictors[best_variable][0]}
# 定义预测函数
def predict(X_test, model):
variable = model['variable']
predictor = model['predictor']
y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])
return y_predicted
y_predicted = predict(X_test, model)
# Compute the accuracy by taking the mean of the amounts that y_predicted is equal to y_test
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {:.1f}%".format(accuracy))