简历过滤机器学习模型

https://github.com/LaPetiteChat/resumeparsing/blob/master/README.md

首先,简历过滤需要一位数据处理专家对数据进行预处理,预处理的过程中,将申请人的degree,education,skills进行赋值处理。然后将所有数据分成训练集、测试集,并建立模型进行计算。计算出的结果,进行评分,并按照评分结果进行排名。分别按照:

第一步:数据的处理;

第二步:数据的分类汇总;

第三步:做出预测;

第四步:预测的AB测试;

第五步:预测的准确度;

第六步:汇总所有的codes并执行。

我参考的是James Brownlee在纯数据计算朴素贝叶斯模型的python代码。https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

数据处理的结果应该按照是如下格式:

2,80,75,77.5,170,76.5623006,100,0.477837813,0.48099608;

1,60,60,60,170,62.56563666,100,0.41483181,0.018300161;

1,70,60,65,180,11.8019322,100,0.587375064,0.59734792;

2,95,90,92.5,180,9.192119736,100,0.724807147,0.793548513;

2,70,60,65,190,61.05618588,100,0.210516292,0.847069275;

2,80,60,70,175,53.8670648,100,0.611277028,0.548198959;

2,98,80,89,180,97.73447261,100,0.327901412,0.901802248;

2,100,70,85,190,15.40299551,100,0.102116189,0.737199743;

2,80,60,70,170,19.10572653,100,0.304586768,0.990202928;

2,75,80,77.5,175,56.2798471,100,0.188033658,0.031831499;

2,60,60,60,108,66.80843217,100,0.31256692,0.721842211;

1,60,60,60,128,97.83551224,100,0.31916056,0.874652191;

1,60,60,60,118,87.06427613,100,0.60115086,0.350288914;

1,80,60,70,148,18.23221236,100,0.203243346,0.647979462;

1,70,60,65,96,92.66992657,100,0.636451279,0.674574266;

2,70,60,65,56,30.32381303,100,0.052958132,0.367499219;

2,70,60,65,78,97.02955264,100,0.838578477,0.237851524;

1,75,60,67.5,99,43.84248184,100,0.429382186,0.963473002;

1,80,60,70,80,16.93151228,100,0.590318547,0.612625236;

2,89,60,74.5,76,68.57730357,100,0.812078797,0.023798746;

2,75,60,67.5,70,34.40610591,100,0.798953348,0.747503739;

2,80,60,70,70,68.67798479,100,0.594131327,0.35679144;

1,80,65,72.5,70,42.83178707,100,0.20246207,0.759385263;

2,80,70,75,78,94.55595774,100,0.989418371,0.719005387;

2,80,70,75,80,31.86938847,100,0.011526203,0.93627719;

2,95,80,87.5,80,33.70595686,100,0.416061459,0.396876282;

2,80,70,75,80,32.24077154,100,0.306767368,0.380221452;

2,75,70,72.5,70,73.98029002,100,0.869762403,0.36895025;

2,95,80,87.5,75,42.24168725,100,0.145999377,0.84569649;

2,70,60,65,65,21.85085821,100,0.433577036,0.139770985;

2,80,60,70,70,94.27238981,100,0.272838291,0.514772471;

2,80,80,80,180,50.52999248,100,0.478211656,0.115011178;

2,80,60,70,70,28.98835362,100,0.337907792,0.163271458;

2,70,70,70,70,42.22976876,100,0.923559771,0.159577817;

2,95,80,87.5,88,99.49893906,100,0.629810507,0.240372341;

2,70,60,65,60,58.33858912,100,0.532191584,0.311663668;

2,75,60,67.5,65,37.64067938,100,0.017666922,0.123165325;

2,95,70,82.5,76,44.28547402,100,0.614640793,0.57501755;

2,80,60,70,75,55.65959732,100,0.905169823,0.959682442;

4,80,70,75,70,4.119718261,100,0.131295751,0.838565154;

1,95,80,87.5,88,62.25351263,100,0.046046188,0.749358644;

2,95,70,82.5,138,11.04213716,100,0.830072134,0.261641346;

1,80,70,75,148,22.23522883,100,0.951078931,0.112332275;

2,75,60,67.5,118,94.81548956,100,0.090262754,0.968252204;

1,75,60,67.5,108,45.05536366,100,0.841871612,0.846410278;

2,80,70,75,180,19.3636175,70,0.855828096,0.759138723;

2,85,70,77.5,100,60.06062389,70,0.131260472,0.110782428;

2,80,60,70,180,9.138379193,70,0.766458516,0.222152779;

1,60,60,60,180,40.79941226,70,0.853640842,0.491858233;

1,70,60,65,118,61.37785318,70,0.685999587,0.854444086;

2,85,70,77.5,178,94.46345188,70,0.066474014,0.352098625;

1,80,70,75,100,95.4459327,70,0.285268038,0.025587163;

2,95,80,87.5,100,15.66591961,70,0.790453718,0.254196496;

2,60,60,60,128,0.44465934,70,0.398131268,0.069557991;

1,80,70,75,118,19.29234555,70,0.704691473,0.063151475;

4,80,70,75,148,49.61073014,70,0.156932286,0.754304454;

1,75,70,72.5,180,57.61196301,70,0.934810363,0.315597648;

2,75,70,72.5,138,7.793499915,70,0.48824784,0.987923607;

2,80,70,75,128,61.27801532,70,0.073013419,0.989033413;

1,75,60,67.5,138,73.70227221,70,0.188176053,0.127624627;

2,70,60,65,180,51.67259957,70,0.31772058,0.103774004;

2,80,70,75,178,36.12180926,70,0.316417386,0.846812165;

1,75,60,67.5,128,91.51184891,70,0.775280982,0.268165881;

2,95,80,87.5,128,50.04427752,70,0.093976794,0.326069983;

2,80,70,75,138,84.97538663,80,0.029917753,0.760080103;

2,75,60,67.5,148,65.63419798,80,0.537541227,0.781482787;

1,80,70,75,130,55.12483937,80,0.24535984,0.508181576;

1,80,70,75,130,83.15918437,80,0.132191626,0.45038017;

2,80,70,75,128,10.61381313,80,0.566090299,0.688339999;

2,80,70,75,128,99.07948967,80,0.769833272,0.108526708;

2,70,60,65,130,85.71394939,80,0.494068544,0.548402184;

2,75,60,67.5,148,47.73810582,80,0.624882073,0.421783056;

假设:Degree: Bachelor=1, Master=2, Phd=4

      Education1: 各高效排名参照国内高校排名赋值,权重为50%

Education2:对各高校计算机专业排名综合评分表排名, Fudan=80,Shanghai Jiao Tong=95, Nanjing=80,…,权重为50%

Education3:按照1:1比例计算出一个education的综合分

      Skills里面分为:C++=100,Java=90,Python=98,SQL=80, …

      Work Experience: ibm=80,Huawei=90,Microsoft=70,…

      Position: dev为100, qa=80, manager=70, no hire=50

      Years of Work Experience:

      按照技能分隔符的个数赋值:比如一个技能赋值为1,两个技能赋值为2,以此类推等等。

      Plus: 按照关键技能的关键词出现频率排名,其中debug为5,dev为4,program为4,test为3,…

(以上数据都是假设,具体计算的时候按照具体设计来进行,为了节约时间,我自动生成了一些随机的0到1的数字来补充空白的数字。)

这个数据库一共提供了72个记录,就是72个候选人的数据。

如果有no hire,可以赋值null数据为50。

数据集分类方法:

数据集可以按照67%训练集对33%测试的比例分配,在预测的时候将数据分为5份,并检验其数据的结果如何。

所有代码如下:

# Example of Naive Bayes implemented from Scratch in Python

import csv

import random

import math

def loadCsv(employeesdata1.csv)

lines = csv.reader(open(employeesdata1.csv, "rb"))

dataset = list(lines)

for i in range(len(dataset)):

dataset[i] = [float(x) for x in dataset[i]]

return dataset

def splitDataset(dataset, splitRatio):

trainSize = int(len(dataset) * splitRatio)

trainSet = []

copy = list(dataset)

while len(trainSet) < trainSize:

index = random.randrange(len(copy))

trainSet.append(copy.pop(index))

return [trainSet, copy]

def separateByClass(dataset):

separated = {}

for i in range(len(dataset)):

vector = dataset[i]

if (vector[-1] not in separated):

separated[vector[-1]] = []

separated[vector[-1]].append(vector)

return separated

def mean(numbers):

return sum(numbers)/float(len(numbers))

def stdev(numbers):

avg = mean(numbers)

variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)

return math.sqrt(variance)

def summarize(dataset):

summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]

del summaries[-1]

return summaries

def summarizeByClass(dataset):

separated = separateByClass(dataset)

summaries = {}

for classValue, instances in separated.iteritems():

summaries[classValue] = summarize(instances)

return summaries

def calculateProbability(x, mean, stdev):

exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))

return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

def calculateClassProbabilities(summaries, inputVector):

probabilities = {}

for classValue, classSummaries in summaries.iteritems():

probabilities[classValue] = 1

for i in range(len(classSummaries)):

mean, stdev = classSummaries[i]

x = inputVector[i]

probabilities[classValue] *= calculateProbability(x, mean, stdev)

return probabilities

def predict(summaries, inputVector):

probabilities = calculateClassProbabilities(summaries, inputVector)

bestLabel, bestProb = None, -1

for classValue, probability in probabilities.iteritems():

if bestLabel is None or probability > bestProb:

bestProb = probability

bestLabel = classValue

return bestLabel

def getPredictions(summaries, testSet):

predictions = []

for i in range(len(testSet)):

result = predict(summaries, testSet[i])

predictions.append(result)

return predictions

def getAccuracy(testSet, predictions):

correct = 0

for i in range(len(testSet)):

if testSet[i][-1] == predictions[i]:

correct += 1

return (correct/float(len(testSet))) * 100.0

def main():

filename = 'pima-indians-diabetes.data.csv'

splitRatio = 0.67

dataset = loadCsv(filename)

trainingSet, testSet = splitDataset(dataset, splitRatio)

print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet))

# prepare model

summaries = summarizeByClass(trainingSet)

# test model

predictions = getPredictions(summaries, testSet)

accuracy = getAccuracy(testSet, predictions)

print('Accuracy: {0}%').format(accuracy)

main()

最后运行汇总的结果应该如下:

1. Split 72 rows into train=48 and test=24 rows

2. Accuracy: 76.3779527559%

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,254评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,875评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,682评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,896评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,015评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,152评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,208评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,962评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,388评论 1 304
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,700评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,867评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,551评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,186评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,901评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,142评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,689评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,757评论 2 351