笔者最近几年家里催着结婚,虽然这几年各大亲网站红娘、实体店、家人亲戚朋友也介绍了不少妹纸,各种类型都不少,但由于笔者前面制定了一大套的标准,到最后,基本能得罪的亲朋好友都得罪了一遍,然后还是孤家寡人一个。。。一个偶然的原因,突然兴起了利用大数据的方法探查择偶标准是不是合理的这个想法,比如本文探索生肖冲突这件事(当然写下此文纯粹是觉得花了些时间的事是需要简单记录下),我就上网抓了些评论,写了个小程序做了两个标签(一个是实例和观点,一个是正面评价和负面评价)的分类(这里仅以鸡犬相害作为例子抛砖引玉,有兴趣的同学可以拿我后面的程序去修改去找你感兴趣的问题),然后得出的结论大概是这样子的:
1、我们看到观点上,只有23%不到的人认为鸡犬没有冲突,而实例上是,68%的存在鸡犬生肖搭配的夫妻或情侣认为他们关系不错或还过得去,只有32%的人发现这种感情关系是不好或失败的 ,为什么会出现这种情况,我觉得根据熵原理来说,具体的、有点聊头的观点信息量会比较大,也有利于传播和交流,所以容易形成夸大的偏见
2、之所以亲子关系相对来说更容易不和的赶脚,发现这里亲子关系大都是和她妈妈,而估计这里留言的大多是女生,母女之间观念差异肯定会带来一些感受上的不好,这种不好往往会表达出来,但并不代表亲子关系不好
3、从这个数据我们只能知道属鸡和属狗的夫妻或情侣有70%左右的可以相处和谐,但并不能证明属鸡和属狗生肖相处不具有负面或正面效应,毕竟没有评估其他生肖关系作为参考,但仅以这些样本的统计结果来看,我大概瞄了一眼,分类为不和的里面最多有一半提及分手或离婚了,也就是这里的离婚率应该低于20%,大家可以对比下我国近10年离婚率均值参考下
4、这个数据来源为豆瓣、知乎评论,有可能存在样本分布范围有偏颇的情况
5、最后根据这个数据,我得出结论是影响婚姻爱情和谐的因素有很多,虽然观念或观点会带来一些偏见或负面影响,但我们看到,在这种偏见影响下,属鸡属狗的两人仍然大部分能够很嗨皮的享受两人时光
ok,结论摆完了,以下为实现步骤(有点专业,非程序猿建议忽略):
1、从百度搜了几篇豆瓣和知乎的这方面讨论帖子,然后写个小爬虫抓取下来
2、然后对抓取下来的大约500来条评论进行去重和标注,大约标注了100条数据,标注分为3个标签,其中一个标签分类事实还是观点,一个标签分类亲子关系还是夫妻关系,一个标签分类对这种关系的情感态度
3、然后执行文本清洗和分词,最后做了3个分类任务,分别用最大熵、随机森林、一些boost和stack的方法做了一下测试,由于写这篇文章离做完得出结果有段时间了,我也不记得哪个方法在哪个任务上的性能比较好了,大家可以自己去尝试。
4、由于几个标签任务的分类结果都还算不错,所以没做太多的算法优化,最后几种分类模型的结果出现不一致的时候采用简单的加权(测试集上的准确率作为权重)投票的方式得出最后的分类结果
以下为部分代码:
scrapy.py
import urllib.request
import requests
import urllib.parse
import reimport urllib.request, urllib.parse, http.cookiejar
from bs4import BeautifulSoup
# urls存储url,new_urls存储待爬取的url,old_urls存储已经爬过的url
class UrlManger(object):
"""docstring for UrlManger"""
def __init__(self):
self.new_urls =set()# 定义new_urls为一个集合,用来存储还未parse的urls
self.old_urls =set()
def get_new_url(self):
new_url =self.new_urls.pop()
# print('get_new_urllalalala'+ new_url)
self.old_urls.add(new_url)
return new_url
#添加urls到集合
def add_new_urls(self, urls):
if urlsis None or len(urls) ==0:
r#添加urls到集合eturn
else:
for urlin urls:
self.add_new_url(url)
# 添加url的规则
def add_new_url(self, url):
if urlis None:
return
if urlnot in self.new_urlsand urlnot in self.old_urls:
self.new_urls.add(url)
# 判断是否还有url
def has_new_url(self):
return (len(self.new_urls)) !=0
# htmldownloader函数:用于页面的下载
class HtmlDowloader(object):
"""docstring for UrlManger"""
def __init__(self):
pass
#使用request来请求获取相关页面完成页面的下载
def download(self, url):
if urlis None:
return None
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0'}
response = requests.get(url, headers=headers)
if response.status_code !=200:
print(response.status_code)
return None
return response.content
class HtmlParser(object):
"""docstring for HtmlParser"""
def __init__(self):
pass
#使用soup来完成页面连接的获取,并将发现的url添加到new_urls中,返回new_urls
def _get_new_urls(self, page_url, soup):
new_urls =set()
links_discuss = soup.find_all('a', href=re.compile(r"/discussion"))# 审查元素,表示页面的链接含有discussion
links_topic = soup.find_all('a', href=re.compile(r"/topic/\d+"), text=re.compile(u'属鸡'))
for link1in links_topic:
print(link1.get_text())# 打印标题
links = links_discuss
for linkin links:
new_url = link['href']
new_full_url = urllib.parse.urljoin(page_url, new_url)# urljoin的作用是把前一个链接和后面的链接合并成一个完整的链接
# print(page_url)
new_urls.add(new_full_url)
return new_urls
def _get_new_data(self, page_url, soup):
res_data = {}
people_node = soup.find('a', href=re.compile(r"/people/\d+"))
print(people_node)
res_data['people']=people_node.get_text()
return res_data
def parse(self, page_url, html_cont):
if page_urlis None or html_contis None:
return
soup = BeautifulSoup(html_cont, "html.parser", from_encoding='utf-8')
new_urls =self._get_new_urls(page_url, soup)
new_data =self._get_new_data(page_url, soup)
return new_urls, new_data
class HtmlOutputer(object):
"""docstring for HtmlOutputer"""
def __init__(self):
pass
def output(self):
print('craw successfully')
def collect_data(new_data):
print('get new data successfully')
class SpiderMain(object):
"""docstring for SpiderMain"""
def __init__(self):
print('SpiderMain begin')
self.urls = UrlManger()
self.downloader = HtmlDowloader()
self.parser = HtmlParser()
self.outputer = HtmlOutputer()
def craw(self, root_url):
count =1
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
try:
new_url =self.urls.get_new_url()
print('craw %d:%s' % (count, new_url))
html_cont =self.downloader.download(new_url)
new_urls, new_data =self.parser.parse(new_url, html_cont)
self.urls.add_new_urls(new_urls)
if count ==3:
break
count = count +1
except:
print(count, 'craw failed')
self.outputer.output()
if __name__ =='__main__':
root_url ='https://www.douban.com/group/148995/'
obj_spider = SpiderMain()
obj_spider.craw(root_url)
Main.py
import urllib.request
from bs4import BeautifulSoup
def getHtml(url):
"""获取url页面"""
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
req = urllib.request.Request(url,headers=headers)
req = urllib.request.urlopen(req)
content = req.read().decode('utf-8')
return content
def getComment(url):
"""解析HTML页面"""
html = getHtml(url)
soupComment = BeautifulSoup(html, 'html.parser')
comments = soupComment.findAll('p', '')
onePageComments = []
for commentin comments:
onePageComments.append(comment.getText()+'\n')
return onePageComments
if __name__ =='__main__':
urlSets =set()
urlSets.add('https://www.douban.com/group/topic/110824994/')
urlSets.add('https://www.douban.com/group/topic/105044195/')
urlSets.add('https://www.douban.com/group/topic/72264943/')
urlSets.add('https://www.douban.com/group/topic/82920501/')
urlSets.add('https://www.douban.com/group/topic/57243174/')
for urlbeforein urlSets:
filename ='鸡狗相害page' +str(i) +'.txt'
i +=1
f =open(filename, 'w', encoding='utf-8')
for pagein range(5):# 豆瓣爬取多页评论需要验证。
url = urlbefore+str(100*page)
print('第%s页的评论:' % (page+1))
print(url +'\n')
for iin getComment(url):
f.write(i)
print(i)
print('\n')
docUtil.R
library(rJava)
library(Rwordseg)
library(tm)
library(maxent)
###########################清洗文本############################################################
docClean<-function(sentence){
sentence <- gsub(pattern = " ", replacement ="", sentence) sentence <- gsub("\t", "", sentence)
sentence <- gsub(",", ",", sentence)
sentence <- gsub("~|'", "", sentence)
sentence <- gsub("\\\"", "", sentence)
return(sentence)
}
##############################数据清洗特殊字符################################################
docCleanSp<-function(sentence){
juzi <- as.vector(sentence) #文本内容转化为向量sentence
juzi <- gsub("[[:digit:]]*", "", juzi) #清除数字[a-zA-Z]
juzi <- gsub("[a-zA-Z]", "", juzi) #清除英文字符
juzi <- gsub("\\.", "", juzi) #清除全英文的dot符号
juzi <- juzi[!is.na(juzi)] #清除对应sentence里面的空值(文本内容),要先执行文本名
juzi <- juzi[!nchar(juzi) < 2] #`nchar`函数对字符计数,英文叹号为R语言里的“非”函数
return(juzi)
}
#########################构造评论包含的词对应的相应的列(类别标签,观点标签),并最后整合到一起
doc2termVec<-function(juzi){
system.time(x <- segmentCN(strwords = juzi))
temp <- lapply(x, length) #每一个元素的长度,即文本分出多少个词
temp <- unlist(temp) #lapply返回的是一个list,所以3行unlist
id <- rep(df[, "id"], temp) #将每一个对应的id复制相应的次数,就可以和词汇对应了
class <- rep(df[, "class"], temp)#id对应的情感倾向标签复制相同的次数
term <- unlist(x) #6行将list解散为向量
view<-rep(df[, "view"], temp)
testterm <- as.data.frame(cbind(id, term, class,view), stringsAsFactors = F)
return(testterm)
}
##################################执行停用词处理的函数##################################
removeStopWords<-function(x,words){
ret<-character(0)
index<-1
it_max<-length(x)
while(index<=it_max){
if(length(words[words==x[index]])<1)ret<-c(ret,x[index])
index<-index+1
}
ret
}
MdlBuild.R
###加载必须的包体和载入相关环境变量
library(rpart)
library(e1071)
library(sjmisc)
library(ROSE)
library(ROCR)
library(DMwR)
library(maxent)
#########模型评选AUC
calcAuc2<-function(predcol,outcol){
perf<-performance(prediction(predcol,outcol==1),'auc')
as.numeric(perf@y.values)
}
#####################常用建模函数###############
###贝叶斯分类
bayesMdl<-function(predcol,trainset,testset){
precol=colnames(predcol)
print(table(predcol[,precol]))
print(table(testset[,precol]))
mdl <- naiveBayes(trainset[,precol]~., data = trainset)
pred <- predict(mdl, newdata = testset)
print(table(pred,testset[,precol]))
print(cbind(pred,testset[,precol]))
print(testset[,precol])
print(roc.curve(testset[,precol], pred, plotit = F))
#return(pred)
}
###########################下面面使用决策树来建模##########################
dtMdl<-function(predcol,trainset,testset){
precol=colnames(predcol)
print(table(trainset[,precol]))
print(table(testset[,precol]))
treeimb <- rpart(trainset[,precol] ~ ., data = trainset)
pred <- predict(treeimb, newdata = testset)
level<-as.vector(colnames(pred))
dfPred<-as.data.frame(pred)
dfPred$class<-ifelse(pred[,1]>0.5,level[1],ifelse(pred[,2]>0.5,level[2],level[3]))
dfPred$pre<-ifelse(testset[,precol]==dfPred$class,1,0)
test<-cbind(testset[,precol],dfPred)
print(paste("there is ratio :", sum(dfPred$pre)/nrow(dfPred)))
return(pred)
}
#####最大熵建模函数,计算用于最大熵的训练时间
maxentMdl<-function(precol,trainset,testset){
ptm <- proc.time()
colnum<-dim(trainset)[2]
model <- maxent(trainset[,2:colnum],precol)
ptms <- proc.time() - ptm
print(ptms)
m <- testSet[,2:colnum]
n <- testSet[,1]
#计算最大熵模型用于测试的时间
ptm <- proc.time()
ms <- predict.maxent(model,m) #测试
ptms <- proc.time() - ptm
print(ptms)
#计算准确率
kn <- as.character(n) #类别数组
km <- ms[,1] #预测后的类别数组
print(table(km))
calculate_mean <- function(kn,km)
{
num <- 0
for(i in 1:length(kn))
{
if(kn[i]==km[i])
{
num <- num + 1
}
}
return (num/length(kn))
}
print(calculate_mean(kn,km))
return(km)
}
###################################使用随机森林建模##################################
rfMdl<-function(predcol,trainset,testset){
library(randomForest)
set.seed(5123512)
precol=colnames(predcol)
tmdl<-randomForest(x=trainset,y=trainset[,precol],ntree=50,importance=T)
pred <-predict(tmdl,newdata = testset)
print(head(pred,25))
print(head(testset[,precol],25))
print(table(pred))
dfPred<-as.data.frame(pred)
dfPred$pre<-ifelse(testset[,precol]==pred,1,0)
print(paste("there is ratio :", sum(dfPred$pre)/nrow(dfPred)))
return(pred)
}
##高斯混合模型EM算法#####
###加载必须的包体和载入相关环境变量
Sys.setenv(JAVA_HOME="c:/Program Files/Java/jre1.8.0_201/")
setwd("E:/R_workspace/rdmdata/")
library(rJava)
library(xlsx)
library(mclust)
require(mclust)
bodys<-read.csv("bodys_em.csv",header=T,sep=',')
mean0<-170
mean1<-170
std0<-3
std1<-3
xVec<-bodys$height
##################以下函数试图求出M-step的参数值值#########
estep_two<-function(x,a0,mean0,mean1,std0,std1){
m0<-(x-mean0)^2/(2*std0^2)
m1<-(x-mean1)^2/(2*std0^2)
w_est0<-1/sqrt(std0)*exp(-m0)*a0
print(paste("w_est0 is : ",w_est0))
w_est1<-1/sqrt(std1)*exp(-m1)*(1-a0)
print(paste("w_est1 is : ",w_est1))
w_est<-w_est0/(w_est0+w_est1)
print(paste("w_est is : ",w_est))
return(w_est)
}
em_gaussian_two<-function(xVec,mean0,mean1,std0,std1,threshold,iters){
###########################初始化相关参数####################
w0<-0.5
wVec<-sapply(xVec,function(x,w=w0,miu0=mean0,miu1=mean1,st0=std0,st1=std1)estep_two(x,w,miu0,miu1,st0,st1))
m_est<-mean(wVec)
iter=1
##########################E—M-step迭代#####################
while(iter<iters){
W<-sum(wVec)/length(wVec)
mean0_new<-(wVec %*% xVec)/sum(wVec)
print(paste("mean0_new is :",mean0_new))
dValue<-xVec-as.vector(mean0_new)
std0_new<-sqrt(sum(dValue^2)/length(wVec))
mean1_new<-((1-wVec)%*% xVec)/sum(1-wVec)
print(paste("mean1_new is :",mean1_new))
std1_new<-sqrt(sum((xVec-as.vector(mean1_new))^2)/length(wVec))
wVec<-sapply(xVec,function(x,w=W,miu0=mean0_new,miu1=mean1_new,st0=std0_new,st1=std1_new)
estep_two(x,w,miu0,miu1,st0,st1))
delta<-abs(sum(wVec)/length(wVec)-W)
print(paste("delta is :",delta))
if(delta<threshold){
print(paste("delta is below threshold:",delta))
break
}
iter<-iter+1
print(paste("iter is :",iter))
}
return(c(std1=std0_new,mean1=mean0_new,std2=std1_new,mean2=mean1_new))
}
######当我们认为有m个高斯混合模型时,如何计算出expectation
estep<-function(X,W,AVG,STD){##W为隐含变量向量,即假设的多个高斯模型的隐含概率分布
k=0
len<-length(W)
m<-vector(length = len)
w_est<-vector(length = len)
while(k<length(len)){
m[k]<-(x-AVG[k])^2/(2*STD[k]^2)
w_est[k]<-1/sqrt(STD[k])*exp(-m[k])*W[k]##这里w_est[k]为对第k个w的估计
}
return(w_est) #这里返回对w的估计向量
}
#logistic来分类
glmMdl<-function(predcol,trainset,testset){
precol=colnames(predcol)
print(table(predcol[,precol]))
print(table(testset[,precol]))
mdl <- glm(formula=trainset[,precol]~., data = trainset)
modelGlm2<-step(mdl,trace=0)
summary(modelGlm2)
#模型显著性检验
anova(object=modelGlm2,test="Chisq")
HL_test <- hoslem_gof(x = modelGlm2)
pred<- predict(modelGlm2, newdata = testSet)
accuracy.meas(as.numeric(testset[,precol]), as.numeric(pred>0.5))
}
logLikelyhood<-function(outcol,predcol){
sum(ifelse(outcol==pos,log(predcol),log(1-predcol)))
}
##########################使用bagging来建模############################
ntrain<-dim(dTrain)[1]
n<-ntrain
ntree<-100
fv<-paste(outcome,'==1 ~ ',paste(selVars,collapse = ' + '),sep='')
#####获取取样函数。执行ntree次迭代,每次迭代获取取样序号,最后形成ntree个取样序列
samples<-sapply(1:ntree, FUN=function(iter){
sample(1:ntrain,size = n,replace = T)
})
treelist<-lapply(1:ntree, FUN = function(iter){
samp<-samples[,iter];
rpart(fv,data=dTrain[samp,],control=rpart.control(cp=0.001,minsplit=1000,minbucket=1000,maxdepth=5))
})
predict.bag<-function(treelist,newdata){
preds<-sapply(1:length(treelist),FUN = function(iter){
predict(treelist[[iter]],newdata=newdata)
})
predsums<-rowSums(preds)
predsums/length(treelist)
}
accuracyMeasures <- function(pred, truth, name="model") {
dev.norm <- -2*loglikelihood(as.numeric(truth), pred)/length(pred)
ctable <- table(truth=truth,
pred=(pred>0.5))
accuracy <- sum(diag(ctable))/sum(ctable)
precision <- ctable[2,2]/sum(ctable[,2])
recall <- ctable[2,2]/sum(ctable[2,])
f1 <- precision*recall
data.frame(model=name, accuracy=accuracy, f1=f1, dev.norm)
}
knnPred <- function(nK,df) {
knnDecision <- knn(knnTrain,df[,selVars],knnCl,k=nK,prob=T)
pred<-ifelse(knnDecision==TRUE,
attributes(knnDecision)$prob,
1-(attributes(knnDecision)$prob))
print(paste(calcAUC(pred,df[,outcome]),' nk is : ',nK) )
}
dataStats.R
library(Hmisc)
library(DMwR)
mystats<-function(x,na.omit=F){
if(na.omit)
x<-x[!is.na(x)]
m<-mean(x)
n<-length(x)
s<-sd(x)
skew<-sum((x-m)^3/s^3)/n
kurt<-sum((x-m)^4/s^4)/n-3
return(c(n=n,mean=m,stdev=s,skew=skew,kurt=kurt))
}
mySimpleSummary<-function(x,na.omit=F){
colName<-colnames(x)
hist(x[,colName],prob = T,xlab=colName,main=paste("Hist of ",colName))
lines(density(x[,colName],na.rm = T))
rug(jitter(x[,colName])) #x轴数据分布密集性,jitter为对原始值随机排列的函数
}
#对数据框中的某列进行描述性统计
mySummary<-function(x,na.omit=F){
print(summary(x))
colName<-colnames(x)
par(mfrow=c(2,2))
hist(x[,colName],prob = T,xlab=colName,main=paste("Hist of ",colName))
lines(density(x[,colName],na.rm = T))
rug(jitter(x[,colName])) #x轴数据分布密集性,jitter为对原始值随机排列的函数
qqPlot(x[,colName],xlab=colName,main=paste("qq plot of ",colName))
#对数据简单清洗,对缺失值用均值填充
x[which(is.na(x)),]<-mean(x[,1], na.rm = T)
#数据分布箱图
boxplot(x[,colName],ylab=paste("distribution of",colName),col="gold")
title(paste("box plot of ",colName))
rug(jitter(x[,colName]),side=2) #y轴数据分布密集性
abline(h=mean(x[,colName],na.rm = T),lty=1)
abline(h=mean(x[,colName],na.rm = T)+sd(x[,colName],na.rm = T),lty=2)
abline(h=median(x[,colName],na.rm = T),lty=3)
#数据分布提琴图
vioplot(x[,colName],names=colName,col="blue")
title(paste("viobox plot of ",colName))
rug(jitter(x[,colName]),side=2) #y轴数据分布密集性
abline(h=mean(x[,colName],na.rm = T),lty=1)
}
#绘制条件分位箱图
myBwplot<-function(x){
bwplot( size~a1, data=test_data, panel=panel.bpplot,
probs=seq(.01,.49,by=.01), datadensity = TRUE,
ylab=paste('river ',size), xlab=paste('Algal ',a1)
)
}
calcAuc<-function(predcol,outcol){
perf<-performance(prediction(predcol,outcol==pos),'auc')
as.numeric(perf@y.values)
}
#查找数据框中的na个数满足一定条件的行并显示出来
naDataView<-function(x){
m<-floor(length(x)/5)
b<-rowSums(is.na(x))>=m
return (x[b,])
}
#查找变量na个数并反回百分比
naColpercent<-function(x){
a<-colSums(is.na(x))
per<-a/nrow(x)
return(per)
}
##相关性矩阵简化显示
symnum(cor(df_clean,use='complete.obs'))
#对数据框中那些缺失值较多的样本进行删除,对其它样本进行填充操作
dataClean<-function(x,y,df){
factorCount<-table(x$y) #计算每个类别的样本数
naFaCount<-table (naDataView(x$y)) #计算每个类别含有na且满足删除条件的样本数
naPercent<-naFaCount/factorCount #计算可删除样本在每个类别中比例
c<-sapply(naPercent, function(x) x=0.05)
cna = c/naPercent
if(sum(cna>1)==length(naPercent)){
x<-na.omit(x)
}else{
dataReplace(df)
}
return(x)
}
#对有缺失值的变量根据相关关系进行填充
fillCorNa1=function(x) sapply(x[is.na(x),1],function(x)
if(is.na(x))
return (NA)
else
return (lineCor[1]+lineCor[2]*x)
)
#对缺失值根据相关关系进行计算
fillCorNa2<-function(x,lineCor){
if(is.na(x))
return (NA)
else {
return (lineCor[1]+lineCor[2]*x)
}
}
#简单替换缺失值
simpleDfClean<-function(df,x){
df<-df[-manyNAs(x),0.1]
df<-knnImputation(df,k=10)
return (df)
}
#获取两个向量的线性关系并返回相应线性函数
getlineCor<-function(x,y){
lm1<-lm(y~x)
cof<-c(lm1$coefficients[1],lm1$coefficients[2])
return(cof)
}
test.R
library(rJava)
library(xlsx)
library(Rwordseg)
library(tm)
###读取文本,生成各种向量
test<-read.xlsx2("生肖匹配分析.xlsx",1,header=T,fileEncoding = "UTF-8")
class<-read.csv("class.csv",header = T)
view<-read.csv("view.csv",header = T)
df<-cbind(class,view,test)
id<-seq(1,454,by=1)
df<-cbind(id,df)
#head(df,10)
###对文本内容进行清洗
df$sentence<-doc_clean(df$sentence)
sentence<-unique(df$sentence)
#####################对所有文档执行停用词处理,并生成文档list##################################
data_stw <- readLines("chineseStopWords.txt",encoding = "UTF-8")
########################################生成文档list########################################
doc_CN <- list()
for(j in 1:dim(df)[1])
{
x <- c(segmentCN(as.character(df[j,4])),nosymbol=TRUE) #对文档分词
doc_CN[[j]] <- removeStopWords(x,data_stw) #去停用词
}
########################################构建语料库###############################################
kvid <- Corpus(VectorSource(doc_CN)) #调用tm包中的函数,生成语料库格式文档。
meta(kvid,"class") <- class
#unique_class <- unique(class)
kvid <- tm_map(kvid,stripWhitespace)#去除文档中因去停用词导致的空白词。
########################生成词项-文档矩阵(TDM),注意这里只包含文档中句子成分####################
control=list(removePunctuation=T,minDocFreq=1,wordLengths = c(1, Inf))
tdm=TermDocumentMatrix(kvid,control)#词项-文档矩阵
ts.tdm<-DocumentTermMatrix(kvid,control)
sample.dtm <- TermDocumentMatrix(kvid, control = list(wordLengths = c(2, Inf)))
tdm_removed3=removeSparseTerms(ts.tdm,0.99)
tdm_matrix4 <- as.matrix(tdm_removed3)
#默认的加权方式是TF-IDF,removePunctution,去除标点,
#minDocFreq = 1表示至少词项至少出现了1次,wordLengths则表示词的长度。
#读取类别和其对应的数量。为的是在词项文档矩阵后加入类别,便于后来的分类。
##typ_text = read.table("部门类别及数量.txt",sep='\t',header = TRUE,row.names=1,fileEncoding = "UTF-8")
n=1
######################################################################
for(i in 1:3){
m=n+table(class)[[i]]
#ts <- inspect(tdm[1:length(tdm$dimnames$Terms),n:m-1]) ###这里是生成某个类别的m-n-1个文本(行)-词(列)矩阵向量
colnum<-ncol(tdm_matrix4)
tk<-tdm_matrix4[n:m-1,4:colnum]
tf<-as.matrix(class[n:m-1,])
colnames(tf)<-"class"
tm<-cbind(tf,tk) #####将文档-词项矩阵中的文档所对应的类别绑定起来
filename <- paste(i,'.txt',sep = "") ##然后写入到类别所在的表格文件里
write.table(tm,filename,sep = "\t", col.names = NA,fileEncoding = "UTF-8")
n=n+table(class)[[i]]
}
###########################以下为文本分类代码######################
library(tm)
trainSet <- data.frame(NULL)
testSet <- data.frame(NULL)
#循环测试
filename <- paste(1,'.txt',sep="")
text = read.table(filename, header = TRUE, sep = "\t", row.names=1,fileEncoding = "UTF-8")
len <- dim(text)[1]
colnum<-dim(text)[2]
sam <- trunc(len * 1 / 2) #取文档2/3的数据。trunc函数用于取整
trainSet <- rbind(trainSet,text[1:sam,]) #将2/3的数据放置于训练集
k <- sam + 1
testSet <- rbind(testSet,text[k:len,]) #剩余的数据放置于测试集
filename <- paste(3,'.txt',sep="")
text = read.table(filename, header = TRUE, sep = "\t", row.names=1,fileEncoding = "UTF-8")
sam <- trunc(len * 1 / 2)
trainSet <- rbind(trainSet,text[1:sam,]) #将2/3的数据放置于训练集
k <- sam + 1
testSet <- rbind(testSet,text[k:len,]) #剩余的数据放置于测试集
########################构建训练集###########################
filename <- paste(1,'.txt',sep="")
text = read.table(filename, header = TRUE, sep = "\t", row.names=1,fileEncoding = "UTF-8")
len <- dim(text)[1]
colnum<-dim(text)[2]
trainSet <- rbind(trainSet,text)
filename <- paste(3,'.txt',sep="")
text = read.table(filename, header = TRUE, sep = "\t", row.names=1,fileEncoding = "UTF-8")
trainSet <- rbind(trainSet,text) #将2/3的数据放置于训练集
filename <- paste(2,'.txt',sep="")
text = read.table(filename, header = TRUE, sep = "\t", row.names=1,fileEncoding = "UTF-8")
testSet <- rbind(testSet,text) #剩余的数据放置于测试集
################构建完整数据集###########################
sets<-data.frame(NULL)
for(i in 1:3){
filename <- paste(i,'.txt',sep="")
text = read.table(filename, header = TRUE, sep = "\t", row.names=1,fileEncoding = "UTF-8")
sets <- rbind(sets,text)
}
##############构建包含view、class、sentence、分词结果的完整矩阵##############################
filename <- paste('result2','.txt',sep = "")
testSet = read.table(filename, header = TRUE, sep = "\t", row.names=1,fileEncoding = "UTF-8")
id<-row.names(testSet)
sets[id,"class"]<-testSet[id,"pred"]
sets<-sets[which(row.names(sets)!="701"),]
sets<-sets[which(row.names(sets)!="3801"),]
sentence<-as.data.frame(df$sentence)
setence<-sentence[,1]
view<-read.csv("view.csv",header = T)
result<-cbind(view,sets[,"class"],setence,sets)
filename <- paste('result1','.txt',sep = "") ##然后写入到类别所在的表格文件里
write.table(result,filename,sep = "\t", col.names = NA,fileEncoding = "UTF-8")
factsSet<-result[which(result$class=='facts'),]
viewSet<-result[which(result$class=='view'),]
trainSet <- data.frame(NULL)
testSet <- data.frame(NULL)
colnum<-dim(result)[2]
resultSet<-result[which(result$view!='unkown'),c(1,5:colnum)]
testSet<-result[which(result$view=='unkown'),c(1,5:colnum)]
len <- dim(resultSet)[1]
colnum<-dim(resultSet)[2]
#############################构建训练测试集,用来进行模型选择##################
sam<-sample.int(len,len/3,replace = FALSE)
#sam <- trunc(len * 1 / 2) #取文档2/3的数据。trunc函数用于取整
trainSet<-rbind(trainSet,resultSet[sam,])
testSet<-rbind(testSet,resultSet[-sam,])
trainSet$view<-factor(trainSet$view,levels=c("mid","neg","pos"))
testSet$view<-factor(testSet$view,levels=c("mid","neg","pos"))
#############################构建真正训练测试集,用来进行建模及分类##################
trainSet <- data.frame(NULL)
trainSet <- resultSet
len<-dim(testSet)[1]
trainSet$view<-factor(trainSet$view,levels=c("mid","neg","pos"))
testSet[1:10,1]<-"mid"
testSet[11:20,1]<-"neg"
testSet[21:len,1]<-"pos"
testSet$view<-factor(testSet$view,levels=c("mid","neg","pos"))
dtMdl(trainSet[1],trainSet,testSet)
rfPre<-rfMdl(trainSet[1],trainSet,testSet)
mxPre<-maxentMdl(trainSet$view,trainSet,testSet)
id<-row.names(testSet)
sentence<-as.data.frame(df$sentence)
test_sentence<-sentence[id,1]
test<-cbind(rfPre,mxPre,test_sentence,testSet)
id2<-row.names(trainSet)
class<-result[which(result$view=='unkown'),2]
test<-cbind(class,rfPre,mxPre,test_sentence,testSet)
test_sentence<-sentence[id2,1]
rfPre<-trainSet$view
mxPre<-trainSet$view
class<-result[which(result$view!='unkown'),2]
train<-cbind(class,rfPre,mxPre,test_sentence,trainSet)
result2<-rbind(test,train)
filename <- paste('result211','.txt',sep = "") ##然后写入到类别所在的表格文件里
write.table(result2,filename,sep = "\t", col.names = NA,fileEncoding = "UTF-8")
####貌似简书不能上传文件,只好作罢,大家自己去抓豆瓣评论吧,作者手里也有结果数据集,有兴趣的同学可以找我要