数据集:https://www.kaggle.com/ruiqurm/lianjia
本数据集包含2010年至2018年1月份链家网站上挂牌出售的二手房信息
第一部分:数据准备&清洗工作
1.1 数据准备
所使用到的数据包:
library (tidyverse)
library (psych)
library (xts)
library (tseries)
library (forecast)
library (dplyr)
library (ggplot2)
library (VIM)
library (ggmap)
library (xgboost)
library (dygraphs)
library (knitr)
library (Matrix)
library(RCurl)
library(geosphere)
library(viridis)
library(lubridate)
说明一下,这里的数据包可能并没有全部被使用,有一些是在尝试过程中使用的,但是并没有尝试成功,然后又忘了及时把没用到的数据包删掉,过了几天之后就忘记是哪个包有用哪个包没用了……
首先对数据集进行观察:
beijing_house = read.csv ('D:/R语言/House Price Model/new.csv')
#firstly, lets see the data set
glimpse (beijing_house)
可以看出该数据集当中数据量十分巨大,将近32万条数据,同时一共有26个不同的变量
url :房子对应的链家链接
id:在链家网上的id
Lng:房子所处经度
Lat:房子所处纬度
Cid:社区id
tradeTime:交易时间
DOM:Days on Market,挂牌出售时长
followers:关注人数
totalPrice:房产总价
price:每平米价格
square:面积
livingRoom:客厅数量
drawingRoom:书房数量
kitchen:厨房数量
bathRoom:卫生间数量
floor:楼层
buildingType:建筑形式,1 = 塔式,2 = 平房, 3 = 蝶式,4 = 板式,具体区别可以百度
constructionTime:建筑时间
renovationCondition:装修状态,1 = 其他,2 = 毛坯, 3 = 简装, 4 = 精装
buildingStructure:建筑结构,1 = 未知, 2 = 混合,3 = 木砖结构, 4 = 砖混凝土结构,5 = 钢构,6 = 钢筋混凝土结构
ladderRatio:梯户比
elevator:是否有电梯
fiveyearsproperty:是否已过五年产权期限
subway:是否有地铁
communityAverage:社区均价
district:所属区,
1 = 东城区,
2 = 丰台区,
3 = 大兴亦庄,
4 = 大兴区,
5 =房山区,
6 = 昌平区,
7 = 朝阳区,
8 = 海淀区,
9 = 石景山区,
10 = 顺义区,
11 = 通州区,
12 = 顺义区,
13 = 门头沟
第二部分 数据清洗和填充
2.1 数据清洗
从数据当中可以看出,url,id,Cid这几个数据对房价预测并没有什么卵用,所以直接去掉
house_price = select (beijing_house, -url, -id, -Cid)
同时floor变量当中存在着汉字与数据夹杂的情况,对于后面的分析步骤来说比较麻烦,因此需要分离和用数字进行分类
floor = substring (house_price$floor, 0, 1)
house_price$floorpos = floor
house_price$floor=sub('^.','',house_price$floor)
house_price = house_price %>%
mutate(floorpos = case_when(floorpos == "底" ~ 1,
floorpos == "低" ~ 2,
floorpos == "中" ~ 3,
floorpos == "高" ~ 4,
floorpos == "顶" ~ 5,
floorpos == "未" ~ 0))
2.2 数据填充
首先来康康都有哪些数据缺失了的
missing_data = data.frame(lapply(house_price,function(x) sum(is.na(x))))
missing_data
可以看出DOM这一变量的数据大量缺失,然后还有其他几个变量也有一定量的缺失值,现在我们来把缺失数据的数量变得更直观一点
missing_price = select (house_price, DOM, buildingType, elevator, fiveYearsProperty, subway, communityAverage)
aggr (missing_price, prop = TRUE, numbers = TRUE)
可以看出DOM缺失的频率占到了一半以上,这个太大了;其他几个变量缺失值的占比很小很小,因此我们可以直接忽略掉这些个数据。首先对DOM这个变量进行处理,看一下这个变量的分布情况
qqnorm(house_price$DOM)
qqline(house_price$DOM)
从QQ散点图可以看出这一部分变量是具有一定的斜率的,因此可以直接采用中位数进行填充
house_price$DOM<-ifelse(is.na(house_price$DOM),median(house_price$DOM,na.rm=TRUE),house_price$DOM)
剩下的缺失值直接删掉就好
house_price1 <- na.omit(house_price)
这时候再来看一下清洗之后的数据状态
dim (house_price1)
少了两千多个数据,和30万条数据比起来根本不算什么,所以还ok,数据清洗和准备工作告一段落。
第三部分 数据可视化
3.1 价格热力图
首先整一幅北京市地图,下载地址:https://www.kaggle.com/eraw0x/beijing-map/download/DtXJ8lv7gp5r6Mm3C6Zs%2Fversions%2FISfNUAnnDEEQIZbKxWlj%2Ffiles%2Fbeijing_map.RData?datasetVersionNumber=1
大概长这样:
有经纬度就很舒服,在R里加载地图
load(file = "D:/R语言/NTU/House Price Model/beijing_map.RData",verbose = TRUE)
把价格做成热力图并且呈现在地图上
beijing + geom_point(data =house_price1, aes(house_price1$Lng,house_price1$Lat,color=price),size=1.3,alpha=.5)+ scale_color_viridis()
可以大致看出北京市城市呈现出辐射状的分布,中心区域价格最高,然后呈环状向外价格逐层递减,西北区域略高于其他区域
3.2可视化数据处理
对要画图的数据在进行一些处理,将数字转换成文字
house_price2 = house_price1 %>%
mutate(district = case_when(district == 1 ~ "DongCheng",
district == 2 ~ "FengTai",
district == 3 ~ "DaXing",
district == 4 ~ "YiZhuang",
district == 5 ~ "FangShan",
district == 6 ~ "ChangPing",
district == 7 ~ "ChaoYang",
district == 8 ~ "HaiDian",
district == 9 ~ "ShiJingShan",
district == 10 ~ "XiCheng",
district == 11 ~ "TongZhou",
district == 12 ~ "ShunYi",
district == 13 ~ "MenTouGou"))
house_price2 = house_price2 %>%
mutate(buildingType = case_when(buildingType == 1 ~ "Tower",
buildingType == 2 ~ "Bungalow",
buildingType == 3 ~ "Plate&Tower",
buildingType == 4 ~ "Plate"))
house_price2 = house_price2 %>%
mutate(buildingStructure = case_when(buildingStructure == 1 ~ "Unavailable",
buildingStructure == 2 ~ "Mixed",
buildingStructure == 3 ~ "Brick/Wood",
buildingStructure == 4 ~ "Brick/Concrete",
buildingStructure == 5 ~ "Steel",
buildingStructure == 6 ~ "Steel/Concrete"))
house_price2 = house_price2 %>%
mutate(renovationCondition = case_when(renovationCondition == 1 ~ "Other",
renovationCondition == 2 ~ "Rough",
renovationCondition == 3 ~ "Simplicity",
renovationCondition == 4 ~ "Hardcover"))
house_price2 = house_price2 %>%
mutate(elevator = case_when(elevator == 1 ~ "Has_Elevator",
elevator != 1 ~ "No_elevator"))
house_price2 = house_price2 %>%
mutate(fiveYearsProperty = case_when(fiveYearsProperty == 1 ~ "Ownership < 5Yrs",
fiveYearsProperty != 1 ~ "Ownership > 5Yrs"))
house_price2 = house_price2 %>%
mutate(subway = case_when(subway == 1 ~ "Has_Subway",
subway != 1 ~ "No_Subway"))
然后开始针对其中一些变量进行可视化处理,我选择了行政区划、建筑形式、建筑结构、装修状态、电梯、五年产权限制和地铁几个因素进行可视化
3.3 行政区划价格箱线图
ggplot(house_price2, aes(reorder(x= district, -price), y=price, color = district))+geom_boxplot() + labs(title = "Prices of the District", y =" Price Per Sqft")+coord_flip()
可以看出西城区价格最高,门头沟价格最低
3.4 建筑形式价格箱线图
ggplot(house_price2 , aes(x= buildingType, y=price, color = buildingType))+geom_boxplot() + labs(title = "Prices In Function Of The Building Type", y =" Price Per Sqft")
平房建筑价格最高,北京寸土寸金,平房容积率低,大部分应该是豪华别墅和四合院之类的房子,因此最贵
3.5 建筑结构价格箱线图
ggplot(house_price2, aes(x= buildingStructure, y=price, color = buildingStructure))+geom_boxplot() + labs(title = "Prices In Function Of The Building Structure", y =" Price Per Sqft")
木砖结构价格最高,推测木砖结构房屋大多是处于中心城区的老房子、四合院一类的房子,因此价格偏高
3.6 装修状态价格箱线图
ggplot(house_price2, aes(x= renovationCondition, y=price, color = renovationCondition))+geom_boxplot() + labs(title = "Prices In Function Of The Renovation Condition", y =" Price Per Sqft")
基本符合生活常识,精装和简装价格要偏高一点
3.7 电梯价格箱线图
ggplot(house_price2, aes(x= elevator, y=price, color = elevator))+geom_boxplot() + labs(title = "Prices In Function Of The elevator", y =" Price Per Sqft")
价格差不多,北京作为一座历史城市老房子比较多,而且集中于中心城区,所以价格相差不多可以理解
3.8 地铁价格箱线图
ggplot(house_price2, aes(x= subway, y=price, color = subway))+geom_boxplot() + labs(title = "Prices In Function Of The subway", y =" Price Per Sqft")
有地铁那还是贵一点的
3.9 房价时间趋势图
house_price1$tradeTime = as.Date(house_price1$tradeTime)
house_price1$constructionTime = as.Date(house_price1$tradeTime)
house_price1$tradeTimeM = floor_date(house_price1$tradeTime, "month")
house_price1$tradeTimeY = floor_date(house_price1$tradeTime, "year")
housePrice_time1 = house_price1 %>%
filter(tradeTimeM >= ymd("2010-01-01") & tradeTimeM < ymd("2018-12-31")) %>%
group_by(tradeTimeM) %>%
summarize(mean = mean(price))
HP_xts <- xts(housePrice_time1[,-1], order.by = housePrice_time1$tradeTimeM)
dygraph(HP_xts, main = "Sales Count & Price Per Square Meter",
ylab = "Average Monthly Price") %>%
dySeries("mean", label = "Mean Price/SQFT") %>%
dyOptions(stackedGraph = TRUE) %>%
dyRangeSelector(height = 20)
从时间变化价格图能够看出来,2010年到2018年房价基本翻了7倍,房地产仍然是非常保值的投资
第四部分 预测模型
最近在Kaggle的数据竞赛当中,XGBOOST算法在很多比赛当中都取得了非常亮眼的成绩,同时XGBOOST模型本身对于二元数据和线性数据和时间序列都具有非常好的包容性,因此选择XGBOOST算法进行建模
4.1 训练集与测试集选择
training = house_price1 %>%
filter ( year(tradeTime) != 2018)
training = rbind (training, validation[0:100, ])
validation = house_price1 %>% filter (year (tradeTime)==2018)
validation = validation[101:219, ]
训练集选取2018年1月之前的所有数据以及2018年1月份一半的房产数据,测试集则选择2018年1月份下半月的数据,一共119条
4.2 训练集和测试集数据处理
将其中汉字部分装换为数字
training = training %>%
mutate(floorpos = case_when(floorpos == "底" ~ 1,
floorpos == "低" ~ 2,
floorpos == "中" ~ 3,
floorpos == "高" ~ 4,
floorpos == "顶" ~ 5,
floorpos == "未" ~ 0))
validation = validation %>%
mutate(floorpos = case_when(floorpos == "底" ~ 1,
floorpos == "低" ~ 2,
floorpos == "中" ~ 3,
floorpos == "高" ~ 4,
floorpos == "顶" ~ 5,
floorpos == "未" ~ 0))
4.3 距离变量加入
在训练集和测试集中加入新的‘distance’变量,该变量依照北京市环状辐射状的价格变化趋势,将天安门广场所在的坐标作为北京市市中心,计算房子和北京市中心之间的距离
location = data.frame (Lng = training$Lng, Lat = training$Lat)
location = data.matrix (location)
center = data.frame (Lng = 116.23, Lat = 39.54)
center = data.matrix (center)
distance = round (distm(location, center, fun=distVincentyEllipsoid), 0)#计算两点之间的距离
distance = data.frame (distance)
training = cbind(training, distance)
training1 = select (training, -totalPrice, -price)
location = data.frame (Lng = validation$Lng, Lat = validation$Lat)
location = data.matrix (location)
distance = round (distm(location, center, fun=distVincentyEllipsoid), 0)
distance = data.frame (distance)
validation = cbind(validation, distance)
validation1 = select (validation, -totalPrice )
4.4 XGBOOST模型训练
set.seed(122)
Price_xg = xgboost(data = data.matrix(training1), label = training$price, max.depth = 6, eta = 0.55, nrounds = 55, objective = "reg:linear")
pred_xgb = predict(Price_xg, data.matrix(validation1[, c('Lng',
'Lat',
'tradeTime',
'DOM',
'followers',
'square',
'livingRoom',
'drawingRoom',
"kitchen",
"bathRoom",
"floor",
"buildingType",
"constructionTime",
"renovationCondition",
"buildingStructure",
"ladderRatio",
"elevator",
"fiveYearsProperty",
"subway",
"district",
"communityAverage",
"floorpos",
"distance")]))
4.5 模型分析
计算RMSE
RMSE = sqrt (((sum((validation1$price - pred_xgb)^2))/219))
RMSE
对于均价将近6万每平方米的测试集来说,3500+的RMSE可以说是非常优秀的结果了
各种变量的重要性:
可以看出,社区均价所代表的房地产所处的地产板块对于价格起到了非常重要的作用,同时交易时间也是非常重要的一个因素,说明房价对时间十分敏感。
后续的研究方向:经济形势和相关政策对房价的影响
第五部分 CODE
library (tidyverse)
library (psych)
library (xts)
library (tseries)
library (forecast)
library (dplyr)
library (ggplot2)
library (VIM)
library (ggmap)
library (xgboost)
library (dygraphs)
library (knitr)
library (Matrix)
library(RCurl)
library(geosphere)
library(viridis)
library(lubridate)
beijing_house = read.csv ('D:/R语言/NTU/House Price Model/new.csv')
#firstly, lets see the data set
glimpse (beijing_house)
#By carefully examing the dataset, we found that there are some useless data: url, id, cid, so we should drop them
house_price = select (beijing_house, -url, -id, -Cid)
#In floor we can see there are Chinese character along with the floor numbers, therefore I split them into different columns as different factors
floor = substring (house_price$floor, 0, 1)
house_price$floorpos = floor
house_price$floor=sub('^.','',house_price$floor)
#check the missing data of the dataset
missing_data = data.frame(lapply(house_price,function(x) sum(is.na(x))))
missing_data
#Visualize the missing data
missing_price = select (house_price, DOM, buildingType, elevator, fiveYearsProperty, subway, communityAverage)
aggr (missing_price, prop = TRUE, numbers = TRUE)
qqnorm(house_price$DOM)
qqline(house_price$DOM)
house_price = as.data.frame(house_price)
#Using median
house_price$DOM<- ifelse(is.na(house_price$DOM),median(house_price$DOM,na.rm=TRUE),house_price$DOM)
aggr (house_price, prop = TRUE, numbers = TRUE)
#omit other mising data
house_price1 <- na.omit(house_price)
dim (house_price1)
dim (house_price)
glimpse (house_price1)
load(file = "D:/R语言/NTU/House Price Model/beijing_map.RData",verbose = TRUE)
beijing
house_price2 = house_price1 %>%
mutate(district = case_when(district == 1 ~ "DongCheng",
district == 2 ~ "FengTai",
district == 3 ~ "DaXing",
district == 4 ~ "FaXing",
district == 5 ~ "FangShan",
district == 6 ~ "ChangPing",
district == 7 ~ "ChaoYang",
district == 8 ~ "HaiDian",
district == 9 ~ "ShiJingShan",
district == 10 ~ "XiCheng",
district == 11 ~ "TongZhou",
district == 12 ~ "ShunYi",
district == 13 ~ "MenTouGou"))
house_price2 = house_price2 %>%
mutate(buildingType = case_when(buildingType == 1 ~ "Tower",
buildingType == 2 ~ "Bungalow",
buildingType == 3 ~ "Plate/Tower",
buildingType == 4 ~ "Plate"))
house_price2 = house_price2 %>%
mutate(buildingStructure = case_when(buildingStructure == 1 ~ "Unavailable",
buildingStructure == 2 ~ "Mixed",
buildingStructure == 3 ~ "Brick/Wood",
buildingStructure == 4 ~ "Brick/Concrete",
buildingStructure == 5 ~ "Steel",
buildingStructure == 6 ~ "Steel/Concrete"))
house_price2 = house_price2 %>%
mutate(renovationCondition = case_when(renovationCondition == 1 ~ "Other",
renovationCondition == 2 ~ "Rough",
renovationCondition == 3 ~ "Simplicity",
renovationCondition == 4 ~ "Hardcover"))
house_price2 = house_price2 %>%
mutate(elevator = case_when(elevator == 1 ~ "Has_Elevator",
elevator != 1 ~ "No_elevator"))
house_price2 = house_price2 %>%
mutate(fiveYearsProperty = case_when(fiveYearsProperty == 1 ~ "Ownership < 5Yrs",
fiveYearsProperty != 1 ~ "Ownership > 5Yrs"))
house_price2 = house_price2 %>%
mutate(subway = case_when(subway == 1 ~ "Has_Subway",
subway != 1 ~ "No_Subway"))
#Price & District
ggplot(house_price2, aes(reorder(x= district, -price), y=price, color = district))+geom_boxplot() + labs(title = "Prices of the District", y =" Price Per Sqft")+coord_flip()
#Pricing Mapping
beijing + geom_point(data =house_price1, aes(house_price1$Lng,house_price1$Lat,color=price),size=1.3,alpha=.5)+ scale_color_viridis()
#Price Comparison on Different BuildingType
ggplot(house_price2 , aes(x= buildingType, y=price, color = buildingType))+geom_boxplot() + labs(title = "Prices In Function Of The Building Type", y =" Price Per Sqft")
#Price Comparison on Different BuildingStructure
ggplot(house_price2, aes(x= buildingStructure, y=price, color = buildingStructure))+geom_boxplot() + labs(title = "Prices In Function Of The Building Structure", y =" Price Per Sqft")
#Price Comparison on Different RenovationCondition
ggplot(house_price2, aes(x= renovationCondition, y=price, color = renovationCondition))+geom_boxplot() + labs(title = "Prices In Function Of The Renovation Condition", y =" Price Per Sqft")
#Price Comparison on Elevator
ggplot(house_price2, aes(x= elevator, y=price, color = elevator))+geom_boxplot() + labs(title = "Prices In Function Of The elevator", y =" Price Per Sqft")
#Price Comparison on Subway
ggplot(house_price2, aes(x= subway, y=price, color = subway))+geom_boxplot() + labs(title = "Prices In Function Of The subway", y =" Price Per Sqft")
#Five Years Ownership
ggplot(house_price2, aes(x= fiveYearsProperty, y=price, color = fiveYearsProperty))+geom_boxplot() + labs(title = "Prices In Function Of The five Years Property Variable", y =" Price Per Sqft")
house_price1$tradeTime = as.Date(house_price1$tradeTime)
house_price1$constructionTime = as.Date(house_price1$tradeTime)
validation = house_price1 %>% filter (year (tradeTime)==2018)
training = house_price1 %>%
filter ( year(tradeTime) != 2018)
training = rbind (training, validation[0:100, ])
validation = validation[101:219, ]
training = training %>%
mutate(floorpos = case_when(floorpos == "底" ~ 1,
floorpos == "低" ~ 2,
floorpos == "中" ~ 3,
floorpos == "高" ~ 4,
floorpos == "顶" ~ 5,
floorpos == "未" ~ 0))
validation = validation %>%
mutate(floorpos = case_when(floorpos == "底" ~ 1,
floorpos == "低" ~ 2,
floorpos == "中" ~ 3,
floorpos == "高" ~ 4,
floorpos == "顶" ~ 5,
floorpos == "未" ~ 0))
#calculate the distance between center of Beijing (the Forbidden City) and location of teh house
location = data.frame (Lng = training$Lng, Lat = training$Lat)
location = data.matrix (location)
center = data.frame (Lng = 116.23, Lat = 39.54)
center = data.matrix (center)
distance = round (distm(location, center, fun=distVincentyEllipsoid), 0)
distance = data.frame (distance)
training = cbind(training, distance)
training1 = select (training, -totalPrice, -price)
location = data.frame (Lng = validation$Lng, Lat = validation$Lat)
location = data.matrix (location)
center = data.frame (Lng = 116.23, Lat = 39.54)
center = data.matrix (center)
distance = round (distm(location, center, fun=distVincentyEllipsoid), 0)
distance = data.frame (distance)
validation = cbind(validation, distance)
validation1 = select (validation, -totalPrice )
set.seed(122)
#8 0.55 56
Price_xg = xgboost(data = data.matrix(training1), label = training$price, max.depth = 6, eta = 0.55, nrounds = 55, objective = "reg:linear")
pred_xgb = predict(Price_xg, data.matrix(validation1[, c('Lng',
'Lat',
'tradeTime',
'DOM',
'followers',
'square',
'livingRoom',
'drawingRoom',
"kitchen",
"bathRoom",
"floor",
"buildingType",
"constructionTime",
"renovationCondition",
"buildingStructure",
"ladderRatio",
"elevator",
"fiveYearsProperty",
"subway",
"district",
"communityAverage",
"floorpos",
"distance")]))
RMSE = sqrt (((sum((validation1$price - pred_xgb)^2))/219))
RMSE
#获取变量的重要性
model = xgb.dump(Price_xg,with_stats = T)
model
names = dimnames(data.matrix(training1[,c(1:23)]))
importance_matrix = xgb.importance(names[[2]],model=Price_xg) # 计算变量重要性
names
xgb.plot.importance(importance_matrix[,])
house_price1$tradeTimeM = floor_date(house_price1$tradeTime, "month")
house_price1$tradeTimeY = floor_date(house_price1$tradeTime, "year")
housePrice_time1 = house_price1 %>%
filter(tradeTimeM >= ymd("2010-01-01") & tradeTimeM < ymd("2018-12-31")) %>%
group_by(tradeTimeM) %>%
summarize(mean = mean(price))
HP_xts <- xts(housePrice_time1[,-1], order.by = housePrice_time1$tradeTimeM)
dygraph(HP_xts, main = "Sales Count & Price Per Square Meter",
ylab = "Average Monthly Price") %>%
dySeries("mean", label = "Mean Price/SQFT") %>%
dyOptions(stackedGraph = TRUE) %>%
dyRangeSelector(height = 20)