注意：该项目只展示部分功能

1.开发环境

发语言：python
采用技术：Spark、Hadoop、Django、Vue、Echarts等技术框架
数据库：MySQL
开发环境：PyCharm

2 系统设计

随着上海城市化进程的不断推进和人口流入的持续增长，租房市场数据呈现出海量、多维度、实时更新的特点，传统的租房信息查询方式已无法满足租房者对市场全面了解的需求。在大数据技术日趋成熟的背景下，通过运用Python、Spark、Hadoop等大数据处理技术，结合Vue、Echarts等前端可视化技术，构建一个集数据采集、处理、分析、可视化于一体的基于spark和python的租房数据分析可视化与房价预测系统，能够为租房者提供科学化、精准化的租房决策支持，已成为解决当前租房信息不对称问题的有效途径。

基于spark和python的租房数据分析可视化与房价预测系统的开发具有重要的实际应用价值和社会意义，通过大数据分析技术深度挖掘上海租房市场的内在规律和特征，为租房者提供全方位、多维度的市场洞察，帮助其做出更加理性和科学的租房决策。系统不仅能够降低租房者的信息搜集成本和决策风险，还能通过数据驱动的方式促进租房市场的透明化发展，推动房地产租赁行业的数字化转型升级，为政府部门制定相关政策提供数据参考和决策依据。

本研究以上海租房市场为研究对象，构建了一个基于spark和python的租房数据分析可视化与房价预测系统，研究内容涵盖了租房市场的四个核心维度：区域分析、交通便利性、房源特征和市场洞察，通过Python、Spark、Hadoop等大数据技术对海量租房数据进行深度挖掘和分析。系统采用MySQL数据库存储结构化数据，运用Vue框架构建交互式前端界面，通过Echarts实现数据的多样化可视化展示。研究方法包括统计分析、相关性分析、聚类分析、密度分析等多种数据挖掘技术，并结合随机森林回归算法构建租金预测模型。通过对房源位置、租金、面积、户型、朝向、地铁信息等关键字段的综合分析，系统能够为租房者提供科学化的决策支持，帮助其在复杂的租房市场中找到最适合的房源选择。研究成果不仅为个人租房决策提供了数据支撑，也为房地产行业的数字化发展和政府政策制定提供了有价值的参考依据。

基于spark和python的租房数据分析可视化与房价预测系统采用模块化设计，主要包含四大核心分析维度和一个智能预测模块：
区域分析模块：深入分析上海各区域的房源分布、租金水平、价格区间、户型结构、性价比和朝向特点，为租房者提供区域选择的全面参考。
交通便利性分析模块：基于地铁交通网络，分析地铁沿线房源分布密度、地铁距离对租金的影响、各线路租金水平对比等，帮助租房者平衡通勤便利性与租金成本。
房源特征分析模块：从户型、面积、朝向、租赁方式、关键词等多个维度分析房源特征对租金的影响，为租房者提供精准的房源特征价值评估。
租房市场洞察模块：运用聚类分析、性价比分析、密度分析等高级数据挖掘技术，深度解读市场价格层次、供需关系、异常值识别等市场规律。
租金预测模块：基于机器学习算法，综合区域、面积、户型、地铁便利性、朝向等关键特征，为用户提供准确的租金价格预测。

3 系统展示

3.1 大屏页面

大屏上.png

大屏下.png

3.2 分析页面

地铁交通.png

地铁交通2.png

地铁交通3.png

房源特征.png

房源特征2.png

区域房源.png

区域房源2.png

市场洞察.png

3.3 基础页面

数据管理.png

租房信息.png

5 部分功能代码


spark = SparkSession.builder.appName("ShanghaiRentalAnalysis").config("spark.sql.adaptive.enabled", "true").config("spark.sql.adaptive.coalescePartitions.enabled", "true").getOrCreate()
def analyze_district_average_rent(rental_data_path):
    df = spark.read.option("header", "true").option("inferSchema", "true").csv(rental_data_path)
    df = df.filter(df.rent.isNotNull() & df.location.isNotNull() & (df.rent > 0))
    district_pattern = r'(浦东新区|黄浦区|静安区|徐汇区|长宁区|普陀区|虹口区|杨浦区|闵行区|宝山区|嘉定区|金山区|松江区|青浦区|奉贤区|崇明区)'
    df = df.withColumn("district", regexp_extract(col("location"), district_pattern, 1))
    df = df.filter(df.district != "")
    district_stats = df.groupBy("district").agg(
        avg("rent").alias("avg_rent"),
        count("*").alias("house_count"),
        min("rent").alias("min_rent"),
        max("rent").alias("max_rent"),
        stddev("rent").alias("rent_stddev")
    )
    district_stats = district_stats.withColumn("avg_rent", round(col("avg_rent"), 2))
    district_stats = district_stats.withColumn("rent_stddev", round(col("rent_stddev"), 2))
    total_houses = df.count()
    district_stats = district_stats.withColumn("proportion", round((col("house_count") / total_houses) * 100, 2))
    district_stats = district_stats.orderBy(desc("avg_rent"))
    median_rent_df = df.groupBy("district").agg(
        expr("percentile_approx(rent, 0.5)").alias("median_rent")
    )
    district_final = district_stats.join(median_rent_df, on="district", how="left")
    district_final = district_final.withColumn("rent_level", 
        when(col("avg_rent") > 8000, "高租金区域")
        .when(col("avg_rent") > 5000, "中等租金区域")
        .otherwise("低租金区域")
    )
    result_list = district_final.collect()
    return [row.asDict() for row in result_list]
def analyze_metro_distance_rent_relationship(rental_data_path):
    df = spark.read.option("header", "true").option("inferSchema", "true").csv(rental_data_path)
    df = df.filter(df.rent.isNotNull() & df.metro_info.isNotNull() & (df.rent > 0))
    df = df.withColumn("has_metro", when(col("metro_info").contains("地铁"), 1).otherwise(0))
    distance_pattern = r'距离.*?(\d+)米'
    df = df.withColumn("metro_distance", 
        when(col("has_metro") == 1, regexp_extract(col("metro_info"), distance_pattern, 1).cast("int"))
        .otherwise(9999)
    )
    df = df.filter((df.metro_distance <= 2000) | (df.has_metro == 0))
    df = df.withColumn("distance_range",
        when(col("metro_distance") <= 200, "200米以内")
        .when(col("metro_distance") <= 500, "200-500米")
        .when(col("metro_distance") <= 800, "500-800米")
        .when(col("metro_distance") <= 1200, "800-1200米")
        .when(col("metro_distance") <= 2000, "1200-2000米")
        .otherwise("无地铁覆盖")
    )
    distance_rent_stats = df.groupBy("distance_range").agg(
        avg("rent").alias("avg_rent"),
        count("*").alias("house_count"),
        min("rent").alias("min_rent"),
        max("rent").alias("max_rent"),
        stddev("rent").alias("rent_stddev")
    )
    distance_rent_stats = distance_rent_stats.withColumn("avg_rent", round(col("avg_rent"), 2))
    correlation_df = df.filter(df.has_metro == 1).select(
        corr("metro_distance", "rent").alias("distance_rent_correlation")
    )
    correlation_value = correlation_df.collect()[0]["distance_rent_correlation"]
    metro_premium = df.groupBy("has_metro").agg(avg("rent").alias("avg_rent")).collect()
    metro_rent = next(row["avg_rent"] for row in metro_premium if row["has_metro"] == 1)
    no_metro_rent = next(row["avg_rent"] for row in metro_premium if row["has_metro"] == 0)
    premium_percentage = ((metro_rent - no_metro_rent) / no_metro_rent) * 100
    result_stats = distance_rent_stats.orderBy("distance_range").collect()
    return {
        "distance_rent_analysis": [row.asDict() for row in result_stats],
        "correlation_coefficient": round(correlation_value, 4),
        "metro_premium_percentage": round(premium_percentage, 2),
        "metro_avg_rent": round(metro_rent, 2),
        "no_metro_avg_rent": round(no_metro_rent, 2)
    }
def predict_rental_price(district, area, house_type, near_metro, metro_distance, orientation, model_path):
    from pyspark.ml.feature import VectorAssembler, StringIndexer
    from pyspark.ml.regression import RandomForestRegressor
    from pyspark.ml import Pipeline
    from pyspark.ml.evaluation import RegressionEvaluator
    training_df = spark.read.option("header", "true").option("inferSchema", "true").csv("training_data.csv")
    training_df = training_df.filter(training_df.rent.isNotNull() & (training_df.rent > 0) & (training_df.area > 0))
    training_df = training_df.withColumn("metro_distance_filled", 
        when(col("near_metro") == 1, col("metro_distance")).otherwise(2000)
    )
    district_indexer = StringIndexer(inputCol="district", outputCol="district_indexed")
    house_type_indexer = StringIndexer(inputCol="house_type", outputCol="house_type_indexed")
    orientation_indexer = StringIndexer(inputCol="orientation", outputCol="orientation_indexed")
    assembler = VectorAssembler(
        inputCols=["district_indexed", "area", "house_type_indexed", "near_metro", "metro_distance_filled", "orientation_indexed"],
        outputCol="features"
    )
    rf = RandomForestRegressor(featuresCol="features", labelCol="rent", numTrees=100, maxDepth=10, seed=42)
    pipeline = Pipeline(stages=[district_indexer, house_type_indexer, orientation_indexer, assembler, rf])
    train_data, test_data = training_df.randomSplit([0.8, 0.2], seed=42)
    model = pipeline.fit(train_data)
    predictions = model.transform(test_data)
    evaluator = RegressionEvaluator(labelCol="rent", predictionCol="prediction", metricName="rmse")
    rmse = evaluator.evaluate(predictions)
    mae_evaluator = RegressionEvaluator(labelCol="rent", predictionCol="prediction", metricName="mae")
    mae = mae_evaluator.evaluate(predictions)
    r2_evaluator = RegressionEvaluator(labelCol="rent", predictionCol="prediction", metricName="r2")
    r2 = r2_evaluator.evaluate(predictions)
    input_data = spark.createDataFrame([
        (district, area, house_type, near_metro, metro_distance if near_metro else 2000, orientation)
    ], ["district", "area", "house_type", "near_metro", "metro_distance_filled", "orientation"])
    prediction_result = model.transform(input_data)
    predicted_rent = prediction_result.select("prediction").collect()[0]["prediction"]
    feature_importance = model.stages[-1].featureImportances.toArray()
    feature_names = ["district", "area", "house_type", "near_metro", "metro_distance", "orientation"]
    importance_dict = dict(zip(feature_names, feature_importance))
    area_per_sqm = predicted_rent / area
    similar_properties = training_df.filter(
        (abs(col("area") - area) <= area * 0.2) & 
        (col("district") == district) & 
        (col("house_type") == house_type)
    )
    market_stats = similar_properties.agg(
        avg("rent").alias("market_avg"),
        min("rent").alias("market_min"),
        max("rent").alias("market_max")
    ).collect()[0]
    return {
        "predicted_rent": round(predicted_rent, 2),
        "rent_per_sqm": round(area_per_sqm, 2),
        "model_accuracy": {"rmse": round(rmse, 2), "mae": round(mae, 2), "r2": round(r2, 4)},
        "feature_importance": importance_dict,
        "market_comparison": {
            "market_average": round(market_stats["market_avg"], 2),
            "market_range": f"{round(market_stats['market_min'], 2)}-{round(market_stats['market_max'], 2)}",
            "prediction_vs_market": round((predicted_rent - market_stats["market_avg"]) / market_stats["market_avg"] * 100, 2)
        }
    }

源码项目、定制开发、文档报告、PPT、代码答疑
希望和大家多多交流

大数据实战项目-基于spark和python的租房数据分析可视化与房价预测系统-数据挖掘技术在租房数据中的应用研究