通过八爪鱼软件,爬取了前程无忧网和猎聘网关键词“数据分析”的全国范围内1月内发布的企业职位。
本次分析源数据的职位发布日期:2019年7月20日-2019年7月30日
清洗完成后,用于分析的数据2875条
数据清洗
数据结构
爬取到的原始数据有X条,使用MySQL对爬取的原始数据进行清洗。
原始数据字段:
id int(255):序号
jobtitle VARCHAR(255) :职位名称
company VARCHAR(255):公司名称
salary VARCHAR(255):薪酬
position VARCHAR(255):工作地
qualification VARCHAR(255):职位要求,包括学历、工作经验、语言、年龄
tag VARCHAR(255):职位标签,包括五险一金、年底双薪、节日福利等
industry VARCHAR(255):行业
size VARCHAR(255):公司规模,人员数
职位名称
删除空值:
DELETE
FROM
liepin
WHERE
jobtitle IS NULL;
删除与数据分析无关的职位:
DELETE
FROM
51job
WHERE
jobtitle NOT LIKE '%数据%分析%'
AND jobtitle NOT LIKE '%大数据%'
AND jobtitle NOT LIKE '%数据%运营%'
AND jobtitle NOT LIKE '%DBA%';
薪酬数据
将薪酬算为平均年薪:
--提取 salary,获得 min - max - average salary
--新建列
ALTER TABLE liepin ADD ( min_salary FLOAT ( 10 ), max_salary FLOAT ( 10 ), average_salary FLOAT ( 10 ) );
--when `salary` LIKE '%面议%',min / max / average = 0
UPDATE liepin
SET min_salary = 0,
max_salary = 0,
average_salary = 0
WHERE
salary LIKE '%面议%';
--salary不是面谈时,salary格式 'min-max万' --min_salary = min
UPDATE liepin
SET min_salary = SUBSTRING_INDEX( salary, '-', 1 )
WHERE
salary LIKE '%-%万%';
--max_Salary = max
UPDATE liepin
SET max_salary = (
SUBSTRING( SUBSTRING_INDEX( salary, '万', 1 ), LOCATE( '-', SUBSTRING_INDEX( salary, '万', 1 ) ) + 1 )
)
WHERE
salary LIKE '%-%万%';
--average_salary = ( min + max ) / 2
UPDATE liepin
SET average_salary = ( min_salary + max_salary ) / 2
WHERE
salary LIKE '%-%万%';
工作地域
将工作地域分成两个字段,分成城市和区域,例如:字段1:北京 字段2:朝阳区
ALTER TABLE liepin ADD (
position1 VARCHAR(255),
position2 VARCHAR(255)
);
UPDATE liepin SET position1 = position
WHERE position NOT LIKE '%-%';
UPDATE liepin SET position2 = 'null'
WHERE position NOT LIKE '%-%';
UPDATE liepin SET position1 = SUBSTRING_INDEX(`position`,'-',1)
WHERE position LIKE '%-%';
UPDATE liepin SET position2 = SUBSTRING(`position`,LOCATE('-',`position`)+1)
WHERE position LIKE '%-%';
职位要求
把职位要求分成学历、经验、年龄和语言:
ALTER TABLE liepin ADD ( education VARCHAR ( 255 ), experience VARCHAR ( 255 ), age VARCHAR ( 255 ), LANGUAGE VARCHAR ( 255 ) );
--学历
UPDATE liepin
SET education = substring_index( qualification, ' ', 1 );
--工作经验
UPDATE liepin
SET experience = SUBSTRING( SUBSTRING_INDEX( qualification, ' ', 37 ), - 5 );
--年龄要求
UPDATE liepin
SET age = substring_index( qualification, ' ',- 1 );
--工作语言
UPDATE liepin
SET LANGUAGE = SUBSTRING( SUBSTRING_INDEX( qualification, ' ', 73 ), - 5 );
职位标签
将职位标签分开存入不同字段:
ALTER TABLE liepin ADD ( label1 VARCHAR ( 255 ), label2 VARCHAR ( 255 ) );
UPDATE liepin
SET label1 = SUBSTRING_INDEX( tag, ' ', 1 );
UPDATE liepin
SET label2 = SUBSTRING_INDEX( tag, ' ', - 1 );
行业
将行业简化划分:
ALTER TABLE liepin ADD ( industry1 VARCHAR ( 255 ) );
UPDATE liepin
SET industry1 = '其他'
WHERE
industry IS NOT NULL;
UPDATE liepin
SET industry1 = '互联网/电子商务'
WHERE
industry LIKE '%互联网%';
UPDATE liepin
SET industry1 = '投资'
WHERE
industry LIKE '%投资%';
UPDATE liepin
SET industry1 = '计算机软件'
WHERE
industry LIKE '%计算机%';
UPDATE liepin
SET industry1 = 'IT服务'
WHERE
industry LIKE '%IT%';
UPDATE liepin
SET industry1 = '咨询'
WHERE
industry LIKE '%咨询%';
UPDATE liepin
SET industry1 = '保险'
WHERE
industry LIKE '%保险%';
UPDATE liepin
SET industry1 = '银行'
WHERE
industry LIKE '%银行%';
UPDATE liepin
SET industry1 = '服装'
WHERE
industry LIKE '%服装%';
UPDATE liepin
SET industry1 = '通信'
WHERE
industry LIKE '%通信%';
UPDATE liepin
SET industry1 = '食品'
WHERE
industry LIKE '%食品%';
UPDATE liepin
SET industry1 = '百货'
WHERE
industry LIKE '%百货%';
UPDATE liepin
SET industry1 = '游戏'
WHERE
industry LIKE '%游戏%';
UPDATE liepin
SET industry1 = '房地产'
WHERE
industry LIKE '%房地产%';
公司规模
将公司规模统一为:1-49人、50-99人、...、10000人以上
ALTER TABLE liepin ADD ( size1 VARCHAR ( 255 ) );
UPDATE liepin
SET size1 = SUBSTRING_INDEX( size, ':',- 1 )
WHERE
size LIKE '%人%';
存入新表
将清洗过的,需要用来分析的数据存入新表
CREATE TABLE data (
id INT ( 255 ) auto_increment PRIMARY KEY,
jobtitle VARCHAR ( 255 ),
company VARCHAR ( 255 ),
average_salary FLOAT ( 10 ),
position1 VARCHAR ( 255 ),
position2 VARCHAR ( 255 ),
education VARCHAR ( 255 ),
experience VARCHAR ( 255 ),
age VARCHAR ( 255 ),
language VARCHAR ( 255 ),
industry1 VARCHAR ( 255 ),
size1 VARCHAR ( 255 ) ,
label1 VARCHAR ( 255 ),
label2 VARCHAR ( 255 )
);
ALTER TABLE data auto_increment = 1;
INSERT INTO data ( jobtitle, company, average_salary, position1, position2, education, experience, age, LANGUAGE, industry1, size1, label1, label2 ) SELECT DISTINCT
jobtitle,
company,
average_salary,
position1,
position2,
education,
experience,
age,
language,
industry1,
size1,
label1,
label2
FROM
liepin;
数据清洗部分结束。