DB4AI,即Database for AI,是用数据库和数据管理的技术提升AI流水线全过程性能的技术,包括前期的数据准备、加速训练推理、降低模型成本、以及产业化部署等。
在模型结构基本成熟的今天,数据成为影响模型性能的一个关键要素,Garbage-in garbage-out即是说如果数据不满足模型训练要求,再好的模型也无法习得知识。部分学者进一步提出了Data-centric AI的概念,即讲模型训练从以模型为中心(参数调整)迁移到以数据为中心,重点关注数据对模型的影响。具体来说包括数据获取和整合、数据标签、数据清洗和准备、数据削减和增强等。编者推荐以下两篇文章以供参考。
- Jarrahi, Mohammad Hossein, Ali Memariani, and Shion Guha. "The principles of data-centric ai." Communications of the ACM 66.8 (2023): 84-92
- Zha, Daochen, et al. "Data-centric artificial intelligence: A survey." ACM Computing Surveys 57.5 (2025): 1-42.
数据库领域研究重点研究数据管理的问题,许多Data-centric AI中的问题,比如数据整合、数据清洗、标签众包等,本身就是数据库领域的经典问题。更重要的是,数据库领域研究的核心在于效率,包括提升模型训练和推理的速度,降低成本。在AI流水线已基本成熟的今天,如何用数据库的方法提升模型生产效益,提高用户体验,已成为一个热点问题。
在近三年(2022-2025)的数据库顶级会议(SIGMOD、VLDB、ICDE)中,研究人员对DB4AI领域的研究热度逐年升高,在论文总量上所占比规模逐渐扩大。在本文中,编者浏览了这三年来三大会的相关文章,按照研究方向进行整理,主要分为数据清洗、数据准备、模型训练和推理加速(算法层面)、多模态数据管理、高效的机器学习算法、可解释性六大方向,在每个方向下又整理了具体的研究问题,并在问题下附上最新的顶会论文,每篇论文简短的总结其研究问题或技术方法,以供读者参考。
受限于编者本人能力,所归纳方向及方向下的论文可能有所遗漏,有的显式标出,有的则没有,尤其是没有整理期刊论文和非数据库以外的会议论文,希望各位读者能在评论区或私信予以纠正补充,以持续更新此文章。
另外,由于整理未必全面,所以没有对各方向计数统计,以免误导。
一、Data Cleaning
1.1 Missing Value Imputation
- 缺失值插补【SIGMOD'24】(UNSW) Missing Data Imputation with Uncertainty-Driven Network
- 【ICDE'24】(西班牙、比利时)Mitigating Data Sparsity in Integrated Data through Text Conceptualization
1.2 Duplication Detection
- 表间冗余检测【SIGMOD'24】(意大利、德国)Determining the Largest Overlap between Tables
- 冗余数据对不同模型的影响【VLDB'24】(IBM)How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses
1.3 Imbalanced Class Distribution
- 特征重要性检测【SIGMOD'24】(清华、蚂蚁)FeatureLTE: Learning to Estimate Feature Importance
- 数据分布偏移对模型准确性影响【VLDB'24,SDS】(巴黎西岱)Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines
1.4 Mislabel Detection
- 基于潜在空间分析的数据清洗【VLDB'24】(德国TUD)Generalizable Data Cleaning of Tabular Data in Latent Space
- 【ICDE‘24】(北师、中科院计算所)Label Noise Correction for Federated Learning: A Secure, Efficient and Reliable Realization
- 【ICDE‘24】(阿卡萨斯大学)Contrastive Learning for Fraud Detection from Noisy Labels
1.5 Cleaning Pipeline
- 基于LLM的数据清洗【SIGMOD'25】GEIL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models
- 数据清洗+特征扩充流水线【SIGMOD'25】CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning
- 拼装清洗算法构建端到端的数据清洗流水线【SIGMOD'24】(奥地利、德国)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications
- 数据预处理流水线【SIGMOD'23】(人大、北理、清华)HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
- 【ICDE'24】(深算所)BClean: A Bayesian Data Cleaning System
1.6 Data Integration(Entity matching)
- 实体匹配【SIGMOD'23】(人大、清华)Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration
【TBD】
1.7 Others
- 数据独立性检测【SIGMOD'24】(西安大略大学、UCSD)OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport
- 多维时序数据清洗【VLDB'24】(哈工、清华)MTSClean: Efficient Constraint-based Cleaning for Multi-Dimensional Time Series Data
- 多维时序数据清洗【SIGMOD'25】(北理工)Multivariate Time Series Cleaning under Speed Constraints
- 去噪声【ICDE'24】(HKUST)Triple-d: Denoising Distant Supervision for High-quality Data Creation
- 数据质量评估【ICDE'23】(ETHZ、MSR)Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise
二、Data Preparation
2.1 Data Augmentation (Generation)
- 基于知识图谱和RAG的LLM对话benchmak生成【SIGMOD'25】(加拿大)Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs
- 条件约束的表格数据生成【SIGMOD'24】(人大、清华)Controllable Tabular Data Synthesis Using Diffusion Models
- 文本-表格对数据扩增【SIGMOD'24】(法国、意大利)Generation of Training Examples for Tabular Natural Language Inference
- 时序数据生成benchmark【VLDB'24】(NUS)TSGBench: Time Series Generation Benchmark
- 优化弱监督学习标签数据【VLDB'23】(华盛顿大学)Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming
- 对抗模型生成表格数据【VLDBJ'24】(人大、清华)Tabular data synthesis with generative adversarial networks: design space and optimizations
- 强化指导数据生成【ICDE'24】(北理、人大)Mitigating Data Scarcity in Supervised Machine Learning through Reinforcement Learning Guided Data Generation
- 扩散模型合成表格数据【ICDE'24】(荷兰、瑞士)SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models
- 特征扩充(one-to-many table)【ICDE'24】(SFU)FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables
- 基于Join Path特征扩充【ICDE'24】(荷兰)AutoFeat: Transitive Feature Discovery over Join Paths
- 表格类文档数据生成【ICDE'24】(Google Research、JHU)FieldSwap: Data Augmentation for Effective Form-Like Document Extraction
- 模糊数据【ICDE‘23】(Eurecom)Data Ambiguity Profiling for the Generation of Training Examples
2.2 Data Selection (Coreset selection, Data acquisition, active learning)
- 选择数据提高模型置信度【SIGMOD'24】(约克大学、多伦多大学)Data Acquisition for Improving Model Confidence
- 多样化coreset【SIGMOD'24】(UIUC、Cornell)Faster Algorithms for Fair Max-Min Diversification in Rd
- 优化coreset selection【VLDB'24】(澳洲)Optimizing Data Acquisition to Enhance Machine Learning Performance
- 主动学习优化模型公平性【VLDB'24】(KAIST、Georgia Tech)Falcon: Fair Active Learning using Multi-armed Bandits
- coreset selection 【SIGMOD'23】(北理、人大、清华)GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data
- 基于多表JOIN扩充特征并筛选coreset【VLDB'23】(清华)Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning
- AutoML训练加速【VLDB'23,SDS】(UCL、以色列)SubStrat: A Subset-Based Optimization Strategy for Faster AutoML
- 允许用户犯错的主动学习【SIGMOD'23】(俄勒冈州立大学、Eurecom)Exploratory Training: When Annotators Learn About Data
- 【ICDE‘24】(HKUST)Effective Data Selection and Replay for Unsupervised Continual Learning
- 【VLDB'21】(约克大学、多伦多大学)Data acquisition for improving machine learning models
2.3 Data Extraction/Discovery
- 从文本中抽取表格数据【SIGMOD'24】(多伦多大学)Unstructured Data Fusion for Schema and Data Extraction
- 数据集搜索【SIGMOD'24】(芝加哥大学)Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach
- 从半结构化网页中抽取结构化数据【VLDB'23】(Amazon)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages
- 数据集搜索【VLDB'23】(东北大学、Megagon实验室)Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning
- 【VLDB'23】(卡塔尔)Cross Modal Data Discovery over Structured and Unstructured
Data Lakes - 从github中抽取表格数据【SIGMOD‘23,短文】(阿姆斯特丹大学)GitTables: A Large-Scale Corpus of Relational Tables
- 多源数据预处理、图谱构建【ICDE‘24】(加拿大)KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science
2.4 Crowdsourcing
【TBD】
三、Improving Model Training/Inference(Algorithmic level)
3.1 推理计算加速
3.1.1 全流程加速
- 基于模型输出稳定性加速推理计算【VLDB'24】(CUHK)Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines
- 动态张量size下的推理框架计算【SIGMOD'24】(阿里、人大) BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach
- 集成学习中通过冗余识别推理加速【ICDE'23】(科大)Efficient Deep Ensemble Inference via Query Difficulty-dependent Task Scheduling
3.1.2 (稀疏)矩阵乘法加速
- 基于模型裁剪的decoder-only模型矩阵乘法计算加速(减少显存数据传输量)【VLDB'24】(阿里、悉尼大学)Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
- 稀疏矩阵链乘法计算【SIGMOD'24】(港中深、华为)On Efficient Large Sparse Matrix Chain Multiplication
3.1.3 Transformer计算加速
- 时序Transformer计算【SIGMOD'24】(宾大、MIT、清华)RITA: Group Attention is All You Need for Timeseries Analytics
- 时序Transformer计算【VLDB'24】(HKBU、沙特)DARKER: Efficient Transformer with Data-driven Attention Mechanism for Time Series
- 多卡transformer训练【VLDB'23】(CMU、北大)Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
3.1.4 CPU推理加速
- 新型推理架构加速数据库内(CPU)推理计算【VLDB'24】(HPI、UIC)InferDB: In-Database Machine Learning Inference Using Indexes
- 资源受限下的模型推理加速(面向数据库CPU操作的tensor操作加速)【VLDB'24】(浙大、阿里)SmartLite: A DBMS-Based Serving System for DNN Inference in Resource-Constrained Environments
3.1.5 其它
- 数据基本预处理操作(图像剪裁、放缩等)加速【VLDB'24】(韩国UNIST)FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
- 数据预处理时CPU资源分配【ICDE‘24】(三星)FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
- 时空数据预处理【SIGMOD'23】(NTU、阿里)ST4ML: Machine Learning Oriented Spatio-Temporal Data Processing at Scale
3.2 训练加速
3.2.1 分布式训练
- 多卡多模型训练调度【VLDB'24】(UCSD)Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
- 多种云服务商训练【VLDB'24】(TUM、多伦多大学)How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study
- 公有云上多卡训练降低通信代价【VLDB'23】(JHU、AWS)MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
- 推荐模型分布式训练容错【VLDB'23】(CMU)Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding
- 分布式训练【ICDE'24】(浙大)SparDL: Distributed Deep Learning Training with Efficient Sparse Communication
- 分布式训练降低通信代价【ICDE'23】(北大)SK-Gradient: Efficient Communication for Distributed Machine Learning with Data Sketch
3.2.2 推荐模型训练(Embedding处理)
- 推荐模型训练加速【SIGMOD'23】(CUHK、南科大、Meta)FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication
3.2.3 其他
- 模型更新时的数据选择和触发频率选择【SIGMOD’25】(ETHZ、哥本哈根、TUM)Modyn: Data-Centric Machine Learning Pipeline Orchestration
- 量化模型精度校准加速【VLDB'24】(丹麦阿尔伯格大学,华师大)Core: Data-Efficient, On-Device Continual Calibration for Quantized Models
- ml pipeline基于等价性优化【ICDE'24】(西班牙、比利时、希腊)HYPPO: Using Equivalences to Optimize Pipelines in Exploratory Machine Learning
3.3 AI工具优化
- Pandas代码重写【SIGMOD'24】(UIUC)Dias: Dynamic Rewriting of Pandas Code
- Notebook checkpoint/有状态代码迁移【VLDB'24】(UIUC)ElasticNotebook: Enabling Live Migration for Computational Notebooks
- Dataframes数据去重以降低内存需求【VLDB'24】(UW-Madison、CMU)SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis
- 【VLDB'23】(UCB、UIUC、宾大)Bolt-on, Compact, and Rapid Program Slicing for Notebooks
- 【ICDE'24】(爱丁堡大学)PyTond: Efficient Python Data Science on the Shoulders of Databases
3.4 GNN训练加速
- 时序GNN【SIGMOD'24】(HKUST)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data Placement
- 多卡训练【SIGMOD'24】(NUS)HongTu: Scalable Full-Graph GNN Training on Multiple GPUs
- 分布式GNN训练【VLDB‘24】(东北大学)DynaHB: A Communication-Avoiding Asynchronous Distributed Framework with Hybrid Batches for Dynamic GNN Training
- 动态图【SIGMOD'24】(日本)DGC: Training Dynamic Graphs with Spatio-Temporal Non-Uniformity using Graph
- 流式GNN【VLDB'24】(华威大学)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks
- 基于数据和硬件的执行计划【VLDB'24】(HKUST)DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning
- 基于硬件加速GNN操作【VLDB'24】(UIUC、英伟达)Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
- 降低内存需求【VLDB'24】(清华、NYU、Amazon)FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training
【TBD】
四、多模态数据管理
4.1 多模态数据检索
4.1.1 数据检索范式
- 基于用户反馈的语义对齐和查询优化【SIGMOD'24】(Cornell)ThalamusDB: Approximate Query Processing on Multi-Modal Data
- 文本+表格数据联合查询优化【VLDB'24】(德国TUD)ELEET: Efficient Learned Query Execution over Text and Tables
- 基于用户反馈修正query embedding,提高图像搜索语义准确性【SIGMOD'23】(MIT)SeeSaw: Interactive Ad-hoc Search Over Image Databases
- 距离计算需调用模型的KNN(使用小代理模型加速)【VLDB'23】(UBC、CNRS)On Efficient Approximate Queries over Machine Learning Models
- 任意查询语义下的语义检索【VLDB'23】(慕尼黑大学、哥本哈根大学)Fast Search-by-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests
- 基于小代理模型的语义检索【SIGMOD'22】(斯坦福)TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data
- 多度量空间搜索【ICDE‘24】(浙大)HJG: An Effective Hierarchical Joint Graph for ANNS in Multi-Metric Spaces
- 图片+文本协同图片检索【ICDE'24】(浙大、杭电)MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality
- 图片+图协同响应查询【ICDE'24】(浙科、纽卡斯特、北理)Across Images and Graphs for Question Answering
4.1.2 向量最近邻检索
4.1.2.1 新索引设计
- 结合量化和图的索引【SIGMOD'25】(NTU)SymphonyQG: towards Symphonious Integration of Quantization and Graph for Approximate Nearest Neighbor Search
- 哈希索引【SIGMOD'25】(中科院、巴黎西岱)Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor Search
- 基于向量量化的索引【SIGMOD'24】(NTU)RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search
- 磁盘图索引【SIGMOD'24】(浙大、Zilliz、杭电)Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment
- 树图结合的向量索引【VLDB'24】(哈工大)DIDS: Double Indices and Double Summarizations for Fast Similarity Search
- 图索引【SIGMOD'23】(广州大学、HKBU)Efficient Approximate Nearest Neighbor Search in Multi-dimensional Databases
- LSH+图索引【VLDB'23】(HKUST)Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional Spaces
- 树+图索引支持大数据集【VLDB'23,SDS】(摩洛哥、巴黎西岱)Elpis: Graph-Based Similarity Search for Scalable Data Science
- LSH索引【VLDB'23】(佛罗里达大学)LIDER: an efficient high-dimensional learned index for large-scale dense passage retrieval
4.1.2.2 查询分析与综述
- 基于图的ANN难度定义和Benchmark生成【VLDB'25】(复旦、巴黎西岱)
-Hardness: A Query Hardness Measure for Graph-Based ANN Indexes
- 图索引survey【SIGMOD'25】(摩洛哥、巴黎西岱)Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art
- 向量数据库【ICDE'24】(普渡大学)Are There Fundamental Limitations in Supporting Vector Data Management in Relational Databases? A Case Study of PostgreSQL
- 参数调优【ICDE‘24】(南开、蚂蚁)VDTuner: Automated Performance Tuning for Vector Data Management Systems
4.1.2.3 查询优化算法
- 基于三角不等式的查询剪枝【SIGMOD'25】(人大、清华)Tribase: A Vector Data Query Engine for Reliable and Lossless Pruning Compression using Triangle Inequalities
- 图索引剪枝【SIGMOD'23】(NTU)High-Dimensional Approximate Nearest Neighbor Search: with Reliable and Efficient Distance Comparison Operations
- 标量量化【VLDB'23】(英特尔)Similarity search in the blink of an eye with compressed indices
- 优化磁盘索引中的量化指导【ICDE‘24】(杭电)Routing-Guided Learned Product Quantization for Graph-Based Approximate Nearest Neighbor Search
4.1.2.4 过滤条件限制
- 任意谓词下的向量搜索【SIGMOD'25】(复旦)Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor Search
- 标量范围限制下的向量查询【SIGMOD'25】(NTU)iRangeGraph: Improvising Range-dedicated Graphs for Range-filtering Nearest Neighbor Search
- 标量范围限制下的向量查询【SIGMOD'24】(Rutgers、阿里)Range-Filtering Approximate Nearest Neighbor Search
- 任意标量谓词下的向量查询【SIGMOD'24】(斯坦福、UCB)ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
4.1.2.5 特殊环境/条件下的向量搜索
- 双向量表示ANN【SIGMOD'25】(NTU)DEG: Efficient Hybrid Vector Search Using the Dynamic Edge Navigation Graph
- 跨模态ANN【VLDB'24】(复旦)RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search
- 云上向量数据库【SIGMOD'24】(普渡大学、MSR)Vexless: A Serverless Vector Data Management System Using Cloud Functions
- 文本中的“近冗余”检测【SIGMOD'25】(Rutgers)Near-Duplicate Text Alignment with One Permutation Hashing
- 文本中的“近冗余”检测【SIGMOD’23】(Rutgers)Near-Duplicate Sequence Search at Scale for Neural Language Model Memorization Evaluation
- 安全联邦检索向量【SIGMOD'24】(HKBU) FedKNN: Secure Federated k-Nearest Neighbor Search
- 基于树嵌入的任意度量下相似性搜索【SIGMOD'23】(北航、HKUST)LiteHST: A Tree Embedding based Method for Similarity Search
- 内积搜索【VLDB'23】(华科、Zilliz、HKUST)FARGO: Fast Maximum Inner Product Search via Global Multi-Probing
- 低维度量空间搜索【VLDB'23】(希腊、丹麦)Adaptive Indexing in High-Dimensional Metric Spaces
- 内积搜索【ICDE'24】(中科院深圳)Reconsidering Tree based Methods for k-Maximum Inner-Product Search: The LRUS-CoverTree
- GPU上建图【ICDE‘24】(英伟达)CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
- 反向KNN【ICDE‘24】(交大、阿里)Efficient Reverse k Approximate Nearest Neighbor Search over High-Dimensional Vectors
- 反向KNN【ICDE‘24】(HKBU、南科大、华为)QSRP: Efficient Reverse k-Ranks Query Processing on High-dimensional Embeddings
- 稀疏向量内积搜索【ICDE'24】(HKUST、华科、澳门)Efficient Approximate Maximum Inner Product Search over Sparse Vectors
4.1.3 检索增强生成(RAG)
- VectorDB与Prefill阶段overlap提升RAG效率【SIGMOD'25】(华科)AquaPipe: A Quality-Aware Pipeline for Knowledge Retrieval and Large Language Models
- 基于新硬件加速RAG流水线【VLDB'25】(ETHZ)Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
4.1.4 视频数据查询
- 视频表示形式与查询优化【VLDB'25】(北理)TVM: A Tile-based Video Management Framework
- 视频数据查询加速【VLDB'23】(Georgia Tech、Adobe)Seiden: Revisiting Query Processing in Video Database Systems
- 复杂视频查询优化【VLDB'23】(斯坦福)Optimizing Video Analytics with Declarative Model Relationships
- 模糊语义下的视频查询【VLDB'23】(华盛顿大学)EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions
- 地理位置敏感的视频查询【VLDB'25】(UCB)Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations
- 结合用户提示的视频查询优化【VLDB'25】(多伦多大学、约克大学)Optimizing Video Queries with Declarative Clues
- 优化视频LIMIT查询【VLDB'25】(密歇根、USC、MIT)Optimizing Video Selection LIMIT Queries With Commonsense Knowledge
- 端上设备视频分析【ICDE'24】(中大)COUPLE: Orchestrating Video Analytics on Heterogeneous Mobile Processors
- 【ICDE'23】(多伦多大学、约克大学)Track Merging for Effective Video Query Processing
- 【ICDE'23】(多伦多大学、约克大学)Marshalling Model Inference in Video Streams
4.2 面向AI的多模态数据存储
4.2.1 图像/视频
- 用于模型训练的图像格式【SIGMOD'24】(Harvard)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage Format
- 图像压缩以提高数据加载效率【ICDE'23】The Art of Losing to Win: Using Lossy Image Compression to Improve Data Loading in Deep Learning Pipelines
- 视频流数据处理【VLDB'23】(MIT)Extract-Transform-Load for Video Streams
4.2.2 嵌入向量与张量
- 分类特征的嵌入向量压缩【SIGMOD'24】(北大)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation Models
- 向量压缩在推荐系统和RAG中的技术survey【VLDB'24】(北大)Experimental Analysis of Large-scale Learnable Vector Storage Compression
- 在持久化内存上分布式存储Embedding table【ICDE'23】(第四范式、NUS、阿里、英特尔)OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory
- 基于持久化内存大规模低成本的Embedding映射计算(CPU)【VLDB'23,SDS】(清华、快手)PetPS: Supporting Huge Embedding Models with Persistent Memory
- 稀疏张量表示【SIGMOD'24】(HKUST,上交)STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically
- 任意形式的张量表示【SIGMOD'23】(RelationalAI、爱丁堡大学、华盛顿大学)Optimizing Tensor Programs on Flexible Storage
- 张量压缩【ICDE'24】(湖大、中科院网络所)A Robust Low-rank Tensor Decomposition and Quantization based Compression Method
4.2.3 模型压缩
- survey与新方法【VLDB'24】(Virginia, Minnesota)Everything You Always Wanted to Know About Storage Compressibility of Pre-Trained ML Models but Were Afraid to Ask
- 自适应模型压缩方法搜索【ICDE'24】(哈工大)AutoMC: Automated Model Compression Based on Domain Knowledge and Progressive Search
4.2.4 其它类型数据
- LLM训练时的激活管理以降低内存需求【SIGMOD'25】(北大、腾讯)MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
- 梯度存储与分析【VLDB'24】(WPI、MIT)MetaStore: Analyzing Deep Learning Meta-Data at Scale
4.2.5 Feature Store
- 支持更新的Feature Store【VLDB'24】(UCB)RALF: Accuracy-Aware Scheduling for Feature Store Maintenance
- Feature Store查询优化【VLDB'24 SDS】(芝加哥大学、微软、Linkedin)Optimizing Data Pipelines for Machine Learning in Feature Stores
五、更高效的机器学习算法实现
5.1 聚类
- k-means【SIGMOD'24】(丹麦、法国)Settling Time vs. Accuracy Tradeoffs for Clustering Big Data
- k-means【SIGMOD'24】(武大、Oceanbase)F3KM: Federated, Fair, and Fast k-means
- dbscan 【SIGMOD'24】(中科大)Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms
- k-multi-means 【SIGMOD'24】(日本)Efficient Algorithm for K-Multiple-Means
【TBD】
5.2 回归
- XGBoost【SIGMOD'25】(北邮)SecureXGB: A Secure and Efficient Multi-party Protocol for Vertical Federated XGBoost
- GBDT【SIGMOD'23】(NUS)DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine Unlearning
5.3 其它
- 数据库内训练树模型【VLDB'23】(哥伦比亚大学、微软)JoinBoost: Grow Trees Over Normalized Data Using Only SQL
- 异常检测【SIGMOD'23】(CMU)TOD: GPU-accelerated Outlier Detection via Tensor Operations
- 属性推荐【SIGMOD'24】(CUHK)Efficient Approximation Framework for Attribute Recommendation
- 频繁项检测【SIGMOD'24】(西班牙、法国)Language-Model Based Informed Partition of Databases to Speed Up Pattern Mining
- 马尔可夫决策过程【VLDB'23, SDS】(丹麦、希腊)SIFTER: Space-Efficient Value Iteration for Finite-Horizon MDPs
- 决策树【ICDE'24】(约翰霍普金斯大学、莱斯大学)T-Rex (Tree-Rectangles): Reformulating Decision Tree Traversal as Hyperrectangle Enclosure
六、模型可解释性
6.1 因果推断(反事实解释)
- 支持溯源的可解释AI【SIGMOD'25】(阿里、乔治城大学)Provenance-Enabled Explainable AI
- 分类边界探测【SIGMOD'24】(WPI)FACET: Robust Counterfactual Explanation Analytics
- 用户端因果推断【SIGMOD'24】(爱丁堡大学)Counterfactual Explanation at Will, with Zero Privacy Leakage
- 聚合查询结果解释【SIGMOD'24】(以色列、MIT、杜克)Summarized Causal Explanations For Aggregate Views
- 特征层面的因果解释【SIGMOD'24】(爱丁堡大学)Relative Keys: Putting Feature Explanation into Context
6.2 Embedding解释
- 表格Embedding解释【VLDB'25】(密歇根、阿姆斯特丹大学)Observatory: Characterizing Embeddings of Relational Tables
- 表格Embedding解释【SIGMOD'24】(以色列)TabEE: Tabular Embeddings Explanations
6.3 GNN解释
- 【SIGMOD'24】(浙大、丹麦阿尔伯格大学)View-based Explanations for Graph Neural Networks
- 【VLDB'23】(HKUST、HKPU)HENCE-X: Toward Heterogeneity-agnostic Multi-level Explainability for Deep Graph Networks
- 【VLDB'23】(HKUST)On Data-Aware Global Explainability of Graph Neural Networks
6.4 其它
- CNN解释【VLDB'23】(滑铁卢大学、AT&T)POEM: Pattern-Oriented Explanations of Convolutional Neural Networks
七、其它
7.1 表格数据理解/问答
- 基于表格学习的列属性标柱【SIGMOD'24】(美国Megagon Lab)Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation
- 基于LLM的表格数据问答【VLDB'24】(UW-Madison,微软)ReAcTable: Enhancing ReAct for Table Question Answering
- 列语义类型标注【VLDB'23】(HKUST)RECA: Related Tables Enhanced Column Semantic Type Annotation Framework
- 列类型注释【ICDE'24】(HKUST)KGLink: A column type annotation method that combines knowledge graph and pre-trained language model
- 通过正则化列对的关系降低表格学习中的过拟合【SIGMOD'23】Regularized Pairwise Relationship based Analytics for Structured Data
7.2 模型选择
- 利用数据库免训练高效模型选择【VLDB'24】(NUS、浙大、杜克)Database Native Model Selection: Harnessing Deep Neural Networks in Database Systems
- 模型选择服务迁移学习【VLDB'23】(ETHZ)SHiFT: An Efficient, Flexible Search Engine for Transfer Learning
- 【ICDE‘24】(人大)A Two-Phase Recall-and-Select Framework for Fast Model Selection
7.3 数据可视化
【TBD】