基于大数据的北京医保药品数据分析系统（附源码+lw+ppt）

北京医保药品数据分析系统-系统介绍

本系统是一套基于大数据技术的北京市医保药品数据分析系统，采用Python+Django作为主要开发语言和后端框架，结合Hadoop+Spark大数据处理平台构建完整的医保药品数据分析解决方案。系统通过集成HDFS分布式存储、Spark SQL数据查询、Pandas和NumPy科学计算库等核心技术，实现对北京市医保药品海量数据的高效采集、存储、处理和深度分析。前端采用Vue+ElementUI+Echarts技术栈打造直观友好的用户界面，支持药品核心属性分析、药品生产厂家分析、药品数据分析系统大屏、药品数据挖掘分析、医保报销策略分析、用户药品数据分析大屏、用户中药及颗粒分析等多项核心功能模块。系统能够将复杂的医保药品数据通过可视化图表形式展现，包括药品价格分布、报销比例趋势、厂家市场份额、中药使用规律等多维度分析结果，让用户能够直观地观察和分析北京市医保药品的使用特征和政策效果。整个系统架构采用前后端分离设计，后端通过RESTful API接口提供数据服务，前端通过Ajax异步调用获取分析结果，确保系统的可扩展性和维护性。

北京医保药品数据分析系统-选题背景

随着我国医疗保障体系的不断完善和医保制度改革的深入推进，医保药品数据呈现出数据量庞大、种类繁多、关联关系复杂的特点。北京市作为全国医疗资源最为集中的地区之一，其医保药品使用数据具有重要的研究价值和参考意义。传统的医保数据分析方法往往依赖人工统计和简单的数据库查询，面对海量的药品采购记录、报销数据、患者用药信息时显得效率低下且难以发现深层次的数据规律。医保管理部门和政策制定者需要通过数据分析来了解药品使用趋势、报销政策效果、不同厂家药品的市场表现等信息，以便优化医保政策和提高资金使用效率。同时，随着大数据技术的快速发展，Hadoop、Spark等分布式计算框架为处理海量医保数据提供了技术支撑，Python在数据科学领域的广泛应用也为构建专业的医保数据分析系统创造了条件。在这样的背景下，开发一套基于大数据技术的北京医保药品数据分析系统，既能满足对医保数据深度挖掘的实际需求，也符合当前大数据技术在医疗健康领域应用的发展趋势。

从技术学习角度来看，本课题能够将大数据理论知识与医保数据分析的实际应用场景相结合，通过构建完整的数据处理和分析流水线，加深对Hadoop生态系统、Spark计算引擎、Python数据分析库等技术的理解和运用。项目涉及数据采集、清洗、存储、分析、可视化等完整的技术链条，有助于培养系统性的大数据处理思维和解决复杂问题的能力。从实用价值来说，系统能够为医保管理人员和政策研究者提供便捷的数据分析工具，帮助他们更好地理解北京市医保药品的使用规律和政策效果，对优化医保管理和政策制定具有一定的参考价值。从学习成长的维度分析，项目整合了当前主流的大数据处理技术和前端可视化技术，通过实际开发过程能够积累宝贵的工程经验，为今后从事相关技术工作打下基础。虽然这只是一个毕业设计项目，规模和复杂度相对有限，但通过认真的设计和实现，依然可以在技术学习、实践能力培养和解决实际问题等方面发挥积极作用。

北京医保药品数据分析系统-技术选型

大数据框架：Hadoop+Spark（本次没用Hive，支持定制）
开发语言：Python+Java（两个版本都支持）
后端框架：Django+Spring Boot(Spring+SpringMVC+Mybatis)（两个版本都支持）
前端：Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery
详细技术点：Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy
数据库：MySQL

北京医保药品数据分析系统-图片展示

药品核心属性分析.png

药品生产厂家分析.png

药品数据挖掘分析.png

医保报销策略分析.png

用户中药及颗粒分析.png

北京医保药品数据分析系统-视频展示

北京医保药品数据分析系统-代码展示

北京医保药品数据分析系统-代码
from pyspark.sql import SparkSession
from django.http import JsonResponse
from django.views import View
import pandas as pd
import numpy as np
from pyspark.sql.functions import col, count, sum, avg, desc, asc, when, isnan, isnull
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.clustering import KMeans
import json

spark = SparkSession.builder.appName("BeijingMedicalInsuranceAnalysis").config("spark.sql.adaptive.enabled", "true").config("spark.sql.adaptive.coalescePartitions.enabled", "true").getOrCreate()

def drug_core_attribute_analysis(request):
    if request.method == 'GET':
        df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/medical_db").option("dbtable", "drug_info").option("user", "root").option("password", "password").load()
        price_distribution = df.select("drug_price").describe().collect()
        price_ranges = df.withColumn("price_range", 
            when(col("drug_price") < 50, "低价药(50元以下)")
            .when((col("drug_price") >= 50) & (col("drug_price") < 200), "中价药(50-200元)")
            .when((col("drug_price") >= 200) & (col("drug_price") < 500), "较高价药(200-500元)")
            .otherwise("高价药(500元以上)")
        ).groupBy("price_range").agg(count("*").alias("drug_count"), avg("drug_price").alias("avg_price")).collect()
        category_analysis = df.groupBy("drug_category").agg(
            count("*").alias("category_count"),
            avg("drug_price").alias("avg_category_price"),
            sum("annual_usage").alias("total_usage")
        ).orderBy(desc("category_count")).collect()
        dosage_form_stats = df.groupBy("dosage_form").agg(
            count("*").alias("form_count"),
            avg("drug_price").alias("avg_form_price")
        ).orderBy(desc("form_count")).collect()
        prescription_type_analysis = df.groupBy("prescription_type").agg(
            count("*").alias("type_count"),
            avg("drug_price").alias("avg_type_price"),
            sum("annual_usage").alias("type_total_usage")
        ).collect()
        efficacy_distribution = df.groupBy("main_efficacy").agg(
            count("*").alias("efficacy_count"),
            avg("drug_price").alias("avg_efficacy_price")
        ).orderBy(desc("efficacy_count")).limit(10).collect()
        price_efficacy_correlation = df.select("drug_price", "main_efficacy").toPandas()
        correlation_matrix = price_efficacy_correlation.groupby('main_efficacy')['drug_price'].agg(['mean', 'count', 'std']).round(2)
        price_stats = [{"stat": row['summary'], "value": float(row['drug_price'])} for row in price_distribution]
        range_data = [{"range": row['price_range'], "count": int(row['drug_count']), "avg_price": round(float(row['avg_price']), 2)} for row in price_ranges]
        category_data = [{"category": row['drug_category'], "count": int(row['category_count']), "avg_price": round(float(row['avg_category_price']), 2), "usage": int(row['total_usage'])} for row in category_analysis]
        form_data = [{"form": row['dosage_form'], "count": int(row['form_count']), "avg_price": round(float(row['avg_form_price']), 2)} for row in dosage_form_stats]
        return JsonResponse({'code': 200, 'price_stats': price_stats, 'price_ranges': range_data, 'categories': category_data, 'dosage_forms': form_data, 'message': '药品核心属性分析完成'})

def medical_insurance_reimbursement_analysis(request):
    if request.method == 'POST':
        data = json.loads(request.body)
        analysis_year = data.get('year', 2023)
        region = data.get('region', '全市')
        df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/medical_db").option("dbtable", "reimbursement_data").option("user", "root").option("password", "password").load()
        yearly_data = df.filter(col("reimb_year") == analysis_year)
        if region != '全市':
            yearly_data = yearly_data.filter(col("region") == region)
        reimbursement_ratio_analysis = yearly_data.groupBy("drug_category").agg(
            avg("reimbursement_ratio").alias("avg_reimb_ratio"),
            sum("total_cost").alias("total_category_cost"),
            sum("reimbursed_amount").alias("total_reimbursed"),
            count("*").alias("prescription_count")
        ).withColumn("actual_reimb_ratio", col("total_reimbursed") / col("total_category_cost")).orderBy(desc("total_category_cost")).collect()
        monthly_trends = yearly_data.groupBy("reimb_month").agg(
            sum("total_cost").alias("monthly_cost"),
            sum("reimbursed_amount").alias("monthly_reimbursed"),
            avg("reimbursement_ratio").alias("avg_monthly_ratio")
        ).orderBy("reimb_month").collect()
        age_group_analysis = yearly_data.withColumn("age_group",
            when(col("patient_age") < 18, "未成年(18岁以下)")
            .when((col("patient_age") >= 18) & (col("patient_age") < 60), "成年人(18-60岁)")
            .otherwise("老年人(60岁以上)")
        ).groupBy("age_group").agg(
            avg("reimbursement_ratio").alias("avg_age_reimb_ratio"),
            sum("total_cost").alias("age_total_cost"),
            count("*").alias("age_prescription_count")
        ).collect()
        hospital_level_analysis = yearly_data.groupBy("hospital_level").agg(
            avg("reimbursement_ratio").alias("avg_hospital_reimb_ratio"),
            sum("total_cost").alias("hospital_total_cost"),
            count("*").alias("hospital_prescription_count")
        ).orderBy(desc("hospital_total_cost")).collect()
        policy_effectiveness = yearly_data.agg(
            avg("reimbursement_ratio").alias("overall_avg_ratio"),
            sum("total_cost").alias("total_medical_cost"),
            sum("reimbursed_amount").alias("total_fund_expenditure")
        ).collect()[0]
        fund_utilization_rate = float(policy_effectiveness['total_fund_expenditure']) / float(policy_effectiveness['total_medical_cost'])
        category_results = [{"category": row['drug_category'], "avg_ratio": round(float(row['avg_reimb_ratio']), 3), "total_cost": float(row['total_category_cost']), "actual_ratio": round(float(row['actual_reimb_ratio']), 3)} for row in reimbursement_ratio_analysis]
        monthly_results = [{"month": int(row['reimb_month']), "cost": float(row['monthly_cost']), "reimbursed": float(row['monthly_reimbursed']), "ratio": round(float(row['avg_monthly_ratio']), 3)} for row in monthly_trends]
        age_results = [{"age_group": row['age_group'], "avg_ratio": round(float(row['avg_age_reimb_ratio']), 3), "total_cost": float(row['age_total_cost'])} for row in age_group_analysis]
        return JsonResponse({'code': 200, 'category_analysis': category_results, 'monthly_trends': monthly_results, 'age_analysis': age_results, 'fund_utilization': round(fund_utilization_rate, 3), 'message': '医保报销策略分析完成'})

def drug_data_mining_analysis(request):
    if request.method == 'POST':
        data = json.loads(request.body)
        mining_type = data.get('type', 'clustering')
        df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/medical_db").option("dbtable", "drug_usage_data").option("user", "root").option("password", "password").load()
        if mining_type == 'clustering':
            feature_cols = ["drug_price", "annual_usage", "reimbursement_ratio", "patient_count"]
            assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
            feature_df = assembler.transform(df.na.fill(0))
            kmeans = KMeans(k=5, seed=42, featuresCol="features", predictionCol="cluster")
            model = kmeans.fit(feature_df)
            clustered_df = model.transform(feature_df)
            cluster_analysis = clustered_df.groupBy("cluster").agg(
                count("*").alias("cluster_size"),
                avg("drug_price").alias("avg_cluster_price"),
                avg("annual_usage").alias("avg_cluster_usage"),
                avg("reimbursement_ratio").alias("avg_cluster_reimb")
            ).collect()
            cluster_centers = model.clusterCenters()
            top_drugs_per_cluster = []
            for i in range(5):
                cluster_drugs = clustered_df.filter(col("cluster") == i).orderBy(desc("annual_usage")).limit(5).select("drug_name", "drug_price", "annual_usage").collect()
                top_drugs_per_cluster.append({"cluster_id": i, "top_drugs": [{"name": drug['drug_name'], "price": float(drug['drug_price']), "usage": int(drug['annual_usage'])} for drug in cluster_drugs]})
        elif mining_type == 'association':
            frequent_combinations = df.groupBy("drug_category", "prescription_type").agg(
                count("*").alias("combination_count"),
                avg("drug_price").alias("avg_combination_price")
            ).filter(col("combination_count") > 100).orderBy(desc("combination_count")).limit(20).collect()
            association_rules = []
            for combo in frequent_combinations:
                support = float(combo['combination_count']) / df.count()
                rule_info = {
                    'category': combo['drug_category'],
                    'prescription_type': combo['prescription_type'],
                    'count': int(combo['combination_count']),
                    'support': round(support, 4),
                    'avg_price': round(float(combo['avg_combination_price']), 2)
                }
                association_rules.append(rule_info)
        usage_pattern_analysis = df.groupBy("season", "drug_category").agg(
            sum("annual_usage").alias("seasonal_usage"),
            avg("drug_price").alias("seasonal_avg_price")
        ).collect()
        anomaly_detection = df.filter((col("drug_price") > df.select(avg("drug_price")).collect()[0][0] + 3 * df.select(stddev("drug_price")).collect()[0][0]) | (col("annual_usage") > df.select(avg("annual_usage")).collect()[0][0] + 3 * df.select(stddev("annual_usage")).collect()[0][0])).collect()
        cluster_results = [{"cluster": int(row['cluster']), "size": int(row['cluster_size']), "avg_price": round(float(row['avg_cluster_price']), 2), "avg_usage": round(float(row['avg_cluster_usage']), 2)} for row in cluster_analysis]
        pattern_results = [{"season": row['season'], "category": row['drug_category'], "usage": int(row['seasonal_usage']), "avg_price": round(float(row['seasonal_avg_price']), 2)} for row in usage_pattern_analysis]
        anomaly_results = [{"drug_name": anomaly['drug_name'], "price": float(anomaly['drug_price']), "usage": int(anomaly['annual_usage']), "anomaly_type": "价格异常" if anomaly['drug_price'] > 1000 else "用量异常"} for anomaly in anomaly_detection[:10]]
        return JsonResponse({'code': 200, 'clusters': cluster_results, 'top_drugs_clusters': top_drugs_per_cluster, 'usage_patterns': pattern_results, 'anomalies': anomaly_results, 'message': '药品数据挖掘分析完成'})

北京医保药品数据分析系统-文档展示

文档.png

获取源码-结语

这套基于大数据技术的北京医保药品数据分析系统算是把理论和实践结合得比较好的一个项目了，从Hadoop+Spark的大数据处理到Python+Django的后端开发，再到Vue+Echarts的数据可视化，技术栈还是挺全面的。虽然只是个毕业设计，但做下来确实能学到不少东西，特别是对医疗数据分析和大数据处理流程的理解会更深入一些。系统的几个核心功能像药品属性分析、报销策略分析这些，实际应用价值也还可以。如果你也在为毕设选题发愁，或者对这个项目感兴趣想了解更多技术细节的话，欢迎在评论区留言交流。觉得有帮助的话记得点个赞，需要完整资料的同学可以私信我哦！

基于大数据的北京医保药品数据分析系统（附源码+lw+ppt）