DB4AI，即Database for AI，是用数据库和数据管理的技术提升AI流水线全过程性能的技术，包括前期的数据准备、加速训练推理、降低模型成本、以及产业化部署等。

在模型结构基本成熟的今天，数据成为影响模型性能的一个关键要素，Garbage-in garbage-out即是说如果数据不满足模型训练要求，再好的模型也无法习得知识。部分学者进一步提出了Data-centric AI的概念，即讲模型训练从以模型为中心（参数调整）迁移到以数据为中心，重点关注数据对模型的影响。具体来说包括数据获取和整合、数据标签、数据清洗和准备、数据削减和增强等。编者推荐以下两篇文章以供参考。

Jarrahi, Mohammad Hossein, Ali Memariani, and Shion Guha. "The principles of data-centric ai." Communications of the ACM 66.8 (2023): 84-92
Zha, Daochen, et al. "Data-centric artificial intelligence: A survey." ACM Computing Surveys 57.5 (2025): 1-42.

数据库领域研究重点研究数据管理的问题，许多Data-centric AI中的问题，比如数据整合、数据清洗、标签众包等，本身就是数据库领域的经典问题。更重要的是，数据库领域研究的核心在于效率，包括提升模型训练和推理的速度，降低成本。在AI流水线已基本成熟的今天，如何用数据库的方法提升模型生产效益，提高用户体验，已成为一个热点问题。

在近三年（2022-2025）的数据库顶级会议（SIGMOD、VLDB、ICDE）中，研究人员对DB4AI领域的研究热度逐年升高，在论文总量上所占比规模逐渐扩大。在本文中，编者浏览了这三年来三大会的相关文章，按照研究方向进行整理，主要分为数据清洗、数据准备、模型训练和推理加速（算法层面）、多模态数据管理、高效的机器学习算法、可解释性六大方向，在每个方向下又整理了具体的研究问题，并在问题下附上最新的顶会论文，每篇论文简短的总结其研究问题或技术方法，以供读者参考。

受限于编者本人能力，所归纳方向及方向下的论文可能有所遗漏，有的显式标出，有的则没有，尤其是没有整理期刊论文和非数据库以外的会议论文，希望各位读者能在评论区或私信予以纠正补充，以持续更新此文章。
另外，由于整理未必全面，所以没有对各方向计数统计，以免误导。

一、Data Cleaning

1.1 Missing Value Imputation

缺失值插补【SIGMOD'24】(UNSW) Missing Data Imputation with Uncertainty-Driven Network
【ICDE'24】（西班牙、比利时）Mitigating Data Sparsity in Integrated Data through Text Conceptualization

1.2 Duplication Detection

表间冗余检测【SIGMOD'24】（意大利、德国）Determining the Largest Overlap between Tables
冗余数据对不同模型的影响【VLDB'24】(IBM)How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses

1.3 Imbalanced Class Distribution

特征重要性检测【SIGMOD'24】（清华、蚂蚁）FeatureLTE: Learning to Estimate Feature Importance
数据分布偏移对模型准确性影响【VLDB'24，SDS】（巴黎西岱）Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

1.4 Mislabel Detection

基于潜在空间分析的数据清洗【VLDB'24】（德国TUD）Generalizable Data Cleaning of Tabular Data in Latent Space
【ICDE‘24】（北师、中科院计算所）Label Noise Correction for Federated Learning: A Secure, Efficient and Reliable Realization
【ICDE‘24】（阿卡萨斯大学）Contrastive Learning for Fraud Detection from Noisy Labels

1.5 Cleaning Pipeline

基于LLM的数据清洗【SIGMOD'25】GEIL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models
数据清洗+特征扩充流水线【SIGMOD'25】CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning
拼装清洗算法构建端到端的数据清洗流水线【SIGMOD'24】(奥地利、德国)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications
数据预处理流水线【SIGMOD'23】（人大、北理、清华）HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
【ICDE'24】（深算所）BClean: A Bayesian Data Cleaning System

1.6 Data Integration（Entity matching）

实体匹配【SIGMOD'23】（人大、清华）Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration
【TBD】

1.7 Others

数据独立性检测【SIGMOD'24】（西安大略大学、UCSD）OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport
多维时序数据清洗【VLDB'24】（哈工、清华）MTSClean: Efficient Constraint-based Cleaning for Multi-Dimensional Time Series Data
多维时序数据清洗【SIGMOD'25】（北理工）Multivariate Time Series Cleaning under Speed Constraints
去噪声【ICDE'24】（HKUST）Triple-d: Denoising Distant Supervision for High-quality Data Creation
数据质量评估【ICDE'23】（ETHZ、MSR）Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

二、Data Preparation

2.1 Data Augmentation (Generation)

基于知识图谱和RAG的LLM对话benchmak生成【SIGMOD'25】（加拿大）Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs
条件约束的表格数据生成【SIGMOD'24】（人大、清华）Controllable Tabular Data Synthesis Using Diffusion Models
文本-表格对数据扩增【SIGMOD'24】（法国、意大利）Generation of Training Examples for Tabular Natural Language Inference
时序数据生成benchmark【VLDB'24】（NUS）TSGBench: Time Series Generation Benchmark
优化弱监督学习标签数据【VLDB'23】（华盛顿大学）Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming
对抗模型生成表格数据【VLDBJ'24】（人大、清华）Tabular data synthesis with generative adversarial networks: design space and optimizations
强化指导数据生成【ICDE'24】（北理、人大）Mitigating Data Scarcity in Supervised Machine Learning through Reinforcement Learning Guided Data Generation
扩散模型合成表格数据【ICDE'24】（荷兰、瑞士）SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models
特征扩充（one-to-many table）【ICDE'24】（SFU）FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables
基于Join Path特征扩充【ICDE'24】（荷兰）AutoFeat: Transitive Feature Discovery over Join Paths
表格类文档数据生成【ICDE'24】（Google Research、JHU）FieldSwap: Data Augmentation for Effective Form-Like Document Extraction
模糊数据【ICDE‘23】（Eurecom）Data Ambiguity Profiling for the Generation of Training Examples

2.2 Data Selection (Coreset selection, Data acquisition, active learning)

选择数据提高模型置信度【SIGMOD'24】（约克大学、多伦多大学）Data Acquisition for Improving Model Confidence
多样化coreset【SIGMOD'24】（UIUC、Cornell）Faster Algorithms for Fair Max-Min Diversification in Rd
优化coreset selection【VLDB'24】（澳洲）Optimizing Data Acquisition to Enhance Machine Learning Performance
主动学习优化模型公平性【VLDB'24】（KAIST、Georgia Tech）Falcon: Fair Active Learning using Multi-armed Bandits
coreset selection 【SIGMOD'23】（北理、人大、清华）GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data
基于多表JOIN扩充特征并筛选coreset【VLDB'23】（清华）Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning
AutoML训练加速【VLDB'23，SDS】（UCL、以色列）SubStrat: A Subset-Based Optimization Strategy for Faster AutoML
允许用户犯错的主动学习【SIGMOD'23】（俄勒冈州立大学、Eurecom）Exploratory Training: When Annotators Learn About Data
【ICDE‘24】（HKUST）Effective Data Selection and Replay for Unsupervised Continual Learning
【VLDB'21】（约克大学、多伦多大学）Data acquisition for improving machine learning models

2.3 Data Extraction/Discovery

从文本中抽取表格数据【SIGMOD'24】（多伦多大学）Unstructured Data Fusion for Schema and Data Extraction
数据集搜索【SIGMOD'24】（芝加哥大学）Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach
从半结构化网页中抽取结构化数据【VLDB'23】（Amazon）Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages
数据集搜索【VLDB'23】（东北大学、Megagon实验室）Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning
【VLDB'23】（卡塔尔）Cross Modal Data Discovery over Structured and Unstructured
Data Lakes
从github中抽取表格数据【SIGMOD‘23，短文】（阿姆斯特丹大学）GitTables: A Large-Scale Corpus of Relational Tables
多源数据预处理、图谱构建【ICDE‘24】（加拿大）KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science

2.4 Crowdsourcing

【TBD】

三、Improving Model Training/Inference（Algorithmic level）

3.1 推理计算加速

3.1.1 全流程加速

基于模型输出稳定性加速推理计算【VLDB'24】（CUHK）Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines
动态张量size下的推理框架计算【SIGMOD'24】(阿里、人大) BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach
集成学习中通过冗余识别推理加速【ICDE'23】（科大）Efficient Deep Ensemble Inference via Query Difficulty-dependent Task Scheduling

3.1.2 （稀疏）矩阵乘法加速

基于模型裁剪的decoder-only模型矩阵乘法计算加速（减少显存数据传输量）【VLDB'24】（阿里、悉尼大学）Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
稀疏矩阵链乘法计算【SIGMOD'24】（港中深、华为）On Efficient Large Sparse Matrix Chain Multiplication

3.1.3 Transformer计算加速

时序Transformer计算【SIGMOD'24】（宾大、MIT、清华）RITA: Group Attention is All You Need for Timeseries Analytics
时序Transformer计算【VLDB'24】（HKBU、沙特）DARKER: Efficient Transformer with Data-driven Attention Mechanism for Time Series
多卡transformer训练【VLDB'23】（CMU、北大）Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

3.1.4 CPU推理加速

新型推理架构加速数据库内（CPU）推理计算【VLDB'24】（HPI、UIC）InferDB: In-Database Machine Learning Inference Using Indexes
资源受限下的模型推理加速（面向数据库CPU操作的tensor操作加速）【VLDB'24】（浙大、阿里）SmartLite: A DBMS-Based Serving System for DNN Inference in Resource-Constrained Environments

3.1.5 其它

数据基本预处理操作（图像剪裁、放缩等）加速【VLDB'24】（韩国UNIST）FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
数据预处理时CPU资源分配【ICDE‘24】（三星）FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
时空数据预处理【SIGMOD'23】（NTU、阿里）ST4ML: Machine Learning Oriented Spatio-Temporal Data Processing at Scale

3.2 训练加速

3.2.1 分布式训练

多卡多模型训练调度【VLDB'24】（UCSD）Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
多种云服务商训练【VLDB'24】（TUM、多伦多大学）How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study
公有云上多卡训练降低通信代价【VLDB'23】（JHU、AWS）MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
推荐模型分布式训练容错【VLDB'23】（CMU）Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding
分布式训练【ICDE'24】（浙大）SparDL: Distributed Deep Learning Training with Efficient Sparse Communication
分布式训练降低通信代价【ICDE'23】（北大）SK-Gradient: Efficient Communication for Distributed Machine Learning with Data Sketch

3.2.2 推荐模型训练（Embedding处理）

推荐模型训练加速【SIGMOD'23】（CUHK、南科大、Meta）FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication

3.2.3 其他

模型更新时的数据选择和触发频率选择【SIGMOD’25】（ETHZ、哥本哈根、TUM）Modyn: Data-Centric Machine Learning Pipeline Orchestration
量化模型精度校准加速【VLDB'24】(丹麦阿尔伯格大学，华师大）Core: Data-Efficient, On-Device Continual Calibration for Quantized Models
ml pipeline基于等价性优化【ICDE'24】（西班牙、比利时、希腊）HYPPO: Using Equivalences to Optimize Pipelines in Exploratory Machine Learning

3.3 AI工具优化

Pandas代码重写【SIGMOD'24】（UIUC）Dias: Dynamic Rewriting of Pandas Code
Notebook checkpoint/有状态代码迁移【VLDB'24】（UIUC）ElasticNotebook: Enabling Live Migration for Computational Notebooks
Dataframes数据去重以降低内存需求【VLDB'24】（UW-Madison、CMU）SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis
【VLDB'23】（UCB、UIUC、宾大）Bolt-on, Compact, and Rapid Program Slicing for Notebooks
【ICDE'24】（爱丁堡大学）PyTond: Efficient Python Data Science on the Shoulders of Databases

3.4 GNN训练加速

时序GNN【SIGMOD'24】（HKUST）SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data Placement
多卡训练【SIGMOD'24】（NUS）HongTu: Scalable Full-Graph GNN Training on Multiple GPUs
分布式GNN训练【VLDB‘24】（东北大学）DynaHB: A Communication-Avoiding Asynchronous Distributed Framework with Hybrid Batches for Dynamic GNN Training
动态图【SIGMOD'24】（日本）DGC: Training Dynamic Graphs with Spatio-Temporal Non-Uniformity using Graph
流式GNN【VLDB'24】（华威大学）D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks
基于数据和硬件的执行计划【VLDB'24】（HKUST）DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning
基于硬件加速GNN操作【VLDB'24】（UIUC、英伟达）Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
降低内存需求【VLDB'24】（清华、NYU、Amazon）FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training
【TBD】

四、多模态数据管理

4.1 多模态数据检索

4.1.1 数据检索范式

基于用户反馈的语义对齐和查询优化【SIGMOD'24】（Cornell）ThalamusDB: Approximate Query Processing on Multi-Modal Data
文本+表格数据联合查询优化【VLDB'24】（德国TUD）ELEET: Efficient Learned Query Execution over Text and Tables
基于用户反馈修正query embedding，提高图像搜索语义准确性【SIGMOD'23】（MIT）SeeSaw: Interactive Ad-hoc Search Over Image Databases
距离计算需调用模型的KNN（使用小代理模型加速）【VLDB'23】（UBC、CNRS）On Efficient Approximate Queries over Machine Learning Models
任意查询语义下的语义检索【VLDB'23】（慕尼黑大学、哥本哈根大学）Fast Search-by-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests
基于小代理模型的语义检索【SIGMOD'22】（斯坦福）TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data
多度量空间搜索【ICDE‘24】（浙大）HJG: An Effective Hierarchical Joint Graph for ANNS in Multi-Metric Spaces
图片+文本协同图片检索【ICDE'24】（浙大、杭电）MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality
图片+图协同响应查询【ICDE'24】（浙科、纽卡斯特、北理）Across Images and Graphs for Question Answering

4.1.2 向量最近邻检索

4.1.2.1 新索引设计

结合量化和图的索引【SIGMOD'25】（NTU）SymphonyQG: towards Symphonious Integration of Quantization and Graph for Approximate Nearest Neighbor Search
哈希索引【SIGMOD'25】（中科院、巴黎西岱）Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor Search
基于向量量化的索引【SIGMOD'24】（NTU）RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search
磁盘图索引【SIGMOD'24】（浙大、Zilliz、杭电）Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment
树图结合的向量索引【VLDB'24】（哈工大）DIDS: Double Indices and Double Summarizations for Fast Similarity Search
图索引【SIGMOD'23】（广州大学、HKBU）Efficient Approximate Nearest Neighbor Search in Multi-dimensional Databases
LSH+图索引【VLDB'23】（HKUST）Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional Spaces
树+图索引支持大数据集【VLDB'23，SDS】（摩洛哥、巴黎西岱）Elpis: Graph-Based Similarity Search for Scalable Data Science
LSH索引【VLDB'23】（佛罗里达大学）LIDER: an efficient high-dimensional learned index for large-scale dense passage retrieval

4.1.2.2 查询分析与综述

基于图的ANN难度定义和Benchmark生成【VLDB'25】（复旦、巴黎西岱） $Steiner$ -Hardness: A Query Hardness Measure for Graph-Based ANN Indexes
图索引survey【SIGMOD'25】（摩洛哥、巴黎西岱）Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art
向量数据库【ICDE'24】（普渡大学）Are There Fundamental Limitations in Supporting Vector Data Management in Relational Databases? A Case Study of PostgreSQL
参数调优【ICDE‘24】（南开、蚂蚁）VDTuner: Automated Performance Tuning for Vector Data Management Systems

4.1.2.3 查询优化算法

基于三角不等式的查询剪枝【SIGMOD'25】（人大、清华）Tribase: A Vector Data Query Engine for Reliable and Lossless Pruning Compression using Triangle Inequalities
图索引剪枝【SIGMOD'23】（NTU）High-Dimensional Approximate Nearest Neighbor Search: with Reliable and Efficient Distance Comparison Operations
标量量化【VLDB'23】（英特尔）Similarity search in the blink of an eye with compressed indices
优化磁盘索引中的量化指导【ICDE‘24】（杭电）Routing-Guided Learned Product Quantization for Graph-Based Approximate Nearest Neighbor Search

4.1.2.4 过滤条件限制

任意谓词下的向量搜索【SIGMOD'25】（复旦）Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor Search
标量范围限制下的向量查询【SIGMOD'25】（NTU）iRangeGraph: Improvising Range-dedicated Graphs for Range-filtering Nearest Neighbor Search
标量范围限制下的向量查询【SIGMOD'24】（Rutgers、阿里）Range-Filtering Approximate Nearest Neighbor Search
任意标量谓词下的向量查询【SIGMOD'24】（斯坦福、UCB）ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data

4.1.2.5 特殊环境/条件下的向量搜索

双向量表示ANN【SIGMOD'25】（NTU）DEG: Efficient Hybrid Vector Search Using the Dynamic Edge Navigation Graph
跨模态ANN【VLDB'24】（复旦）RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search
云上向量数据库【SIGMOD'24】（普渡大学、MSR）Vexless: A Serverless Vector Data Management System Using Cloud Functions
文本中的“近冗余”检测【SIGMOD'25】（Rutgers）Near-Duplicate Text Alignment with One Permutation Hashing
文本中的“近冗余”检测【SIGMOD’23】（Rutgers）Near-Duplicate Sequence Search at Scale for Neural Language Model Memorization Evaluation
安全联邦检索向量【SIGMOD'24】(HKBU) FedKNN: Secure Federated k-Nearest Neighbor Search
基于树嵌入的任意度量下相似性搜索【SIGMOD'23】（北航、HKUST）LiteHST: A Tree Embedding based Method for Similarity Search
内积搜索【VLDB'23】（华科、Zilliz、HKUST）FARGO: Fast Maximum Inner Product Search via Global Multi-Probing
低维度量空间搜索【VLDB'23】（希腊、丹麦）Adaptive Indexing in High-Dimensional Metric Spaces
内积搜索【ICDE'24】（中科院深圳）Reconsidering Tree based Methods for k-Maximum Inner-Product Search: The LRUS-CoverTree
GPU上建图【ICDE‘24】（英伟达）CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
反向KNN【ICDE‘24】（交大、阿里）Efficient Reverse k Approximate Nearest Neighbor Search over High-Dimensional Vectors
反向KNN【ICDE‘24】（HKBU、南科大、华为）QSRP: Efficient Reverse k-Ranks Query Processing on High-dimensional Embeddings
稀疏向量内积搜索【ICDE'24】(HKUST、华科、澳门）Efficient Approximate Maximum Inner Product Search over Sparse Vectors

4.1.3 检索增强生成（RAG）

VectorDB与Prefill阶段overlap提升RAG效率【SIGMOD'25】（华科）AquaPipe: A Quality-Aware Pipeline for Knowledge Retrieval and Large Language Models
基于新硬件加速RAG流水线【VLDB'25】（ETHZ）Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

4.1.4 视频数据查询

视频表示形式与查询优化【VLDB'25】（北理）TVM: A Tile-based Video Management Framework
视频数据查询加速【VLDB'23】（Georgia Tech、Adobe）Seiden: Revisiting Query Processing in Video Database Systems
复杂视频查询优化【VLDB'23】（斯坦福）Optimizing Video Analytics with Declarative Model Relationships
模糊语义下的视频查询【VLDB'23】（华盛顿大学）EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions
地理位置敏感的视频查询【VLDB'25】（UCB）Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations
结合用户提示的视频查询优化【VLDB'25】（多伦多大学、约克大学）Optimizing Video Queries with Declarative Clues
优化视频LIMIT查询【VLDB'25】（密歇根、USC、MIT）Optimizing Video Selection LIMIT Queries With Commonsense Knowledge
端上设备视频分析【ICDE'24】（中大）COUPLE: Orchestrating Video Analytics on Heterogeneous Mobile Processors
【ICDE'23】（多伦多大学、约克大学）Track Merging for Effective Video Query Processing
【ICDE'23】（多伦多大学、约克大学）Marshalling Model Inference in Video Streams

4.2 面向AI的多模态数据存储

4.2.1 图像/视频

用于模型训练的图像格式【SIGMOD'24】（Harvard）The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage Format
图像压缩以提高数据加载效率【ICDE'23】The Art of Losing to Win: Using Lossy Image Compression to Improve Data Loading in Deep Learning Pipelines
视频流数据处理【VLDB'23】（MIT）Extract-Transform-Load for Video Streams

4.2.2 嵌入向量与张量

分类特征的嵌入向量压缩【SIGMOD'24】（北大）CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation Models
向量压缩在推荐系统和RAG中的技术survey【VLDB'24】（北大）Experimental Analysis of Large-scale Learnable Vector Storage Compression
在持久化内存上分布式存储Embedding table【ICDE'23】（第四范式、NUS、阿里、英特尔）OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory
基于持久化内存大规模低成本的Embedding映射计算（CPU）【VLDB'23,SDS】（清华、快手）PetPS: Supporting Huge Embedding Models with Persistent Memory
稀疏张量表示【SIGMOD'24】（HKUST，上交）STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically
任意形式的张量表示【SIGMOD'23】（RelationalAI、爱丁堡大学、华盛顿大学）Optimizing Tensor Programs on Flexible Storage
张量压缩【ICDE'24】（湖大、中科院网络所）A Robust Low-rank Tensor Decomposition and Quantization based Compression Method

4.2.3 模型压缩

survey与新方法【VLDB'24】（Virginia, Minnesota）Everything You Always Wanted to Know About Storage Compressibility of Pre-Trained ML Models but Were Afraid to Ask
自适应模型压缩方法搜索【ICDE'24】（哈工大）AutoMC: Automated Model Compression Based on Domain Knowledge and Progressive Search

4.2.4 其它类型数据

LLM训练时的激活管理以降低内存需求【SIGMOD'25】（北大、腾讯）MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
梯度存储与分析【VLDB'24】（WPI、MIT）MetaStore: Analyzing Deep Learning Meta-Data at Scale

4.2.5 Feature Store

支持更新的Feature Store【VLDB'24】（UCB）RALF: Accuracy-Aware Scheduling for Feature Store Maintenance
Feature Store查询优化【VLDB'24 SDS】（芝加哥大学、微软、Linkedin）Optimizing Data Pipelines for Machine Learning in Feature Stores

五、更高效的机器学习算法实现

5.1 聚类

k-means【SIGMOD'24】（丹麦、法国）Settling Time vs. Accuracy Tradeoffs for Clustering Big Data
k-means【SIGMOD'24】（武大、Oceanbase）F3KM: Federated, Fair, and Fast k-means
dbscan 【SIGMOD'24】（中科大）Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms
k-multi-means 【SIGMOD'24】（日本）Efficient Algorithm for K-Multiple-Means
【TBD】

5.2 回归

XGBoost【SIGMOD'25】（北邮）SecureXGB: A Secure and Efficient Multi-party Protocol for Vertical Federated XGBoost
GBDT【SIGMOD'23】（NUS）DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine Unlearning

5.3 其它

数据库内训练树模型【VLDB'23】（哥伦比亚大学、微软）JoinBoost: Grow Trees Over Normalized Data Using Only SQL
异常检测【SIGMOD'23】（CMU）TOD: GPU-accelerated Outlier Detection via Tensor Operations
属性推荐【SIGMOD'24】（CUHK）Efficient Approximation Framework for Attribute Recommendation
频繁项检测【SIGMOD'24】（西班牙、法国）Language-Model Based Informed Partition of Databases to Speed Up Pattern Mining
马尔可夫决策过程【VLDB'23, SDS】（丹麦、希腊）SIFTER: Space-Efficient Value Iteration for Finite-Horizon MDPs
决策树【ICDE'24】（约翰霍普金斯大学、莱斯大学）T-Rex (Tree-Rectangles): Reformulating Decision Tree Traversal as Hyperrectangle Enclosure

六、模型可解释性

6.1 因果推断（反事实解释）

支持溯源的可解释AI【SIGMOD'25】（阿里、乔治城大学）Provenance-Enabled Explainable AI
分类边界探测【SIGMOD'24】（WPI）FACET: Robust Counterfactual Explanation Analytics
用户端因果推断【SIGMOD'24】（爱丁堡大学）Counterfactual Explanation at Will, with Zero Privacy Leakage
聚合查询结果解释【SIGMOD'24】（以色列、MIT、杜克）Summarized Causal Explanations For Aggregate Views
特征层面的因果解释【SIGMOD'24】（爱丁堡大学）Relative Keys: Putting Feature Explanation into Context

6.2 Embedding解释

表格Embedding解释【VLDB'25】（密歇根、阿姆斯特丹大学）Observatory: Characterizing Embeddings of Relational Tables
表格Embedding解释【SIGMOD'24】（以色列）TabEE: Tabular Embeddings Explanations

6.3 GNN解释

【SIGMOD'24】（浙大、丹麦阿尔伯格大学）View-based Explanations for Graph Neural Networks
【VLDB'23】（HKUST、HKPU）HENCE-X: Toward Heterogeneity-agnostic Multi-level Explainability for Deep Graph Networks
【VLDB'23】（HKUST）On Data-Aware Global Explainability of Graph Neural Networks

6.4 其它

CNN解释【VLDB'23】（滑铁卢大学、AT&T）POEM: Pattern-Oriented Explanations of Convolutional Neural Networks

七、其它

7.1 表格数据理解/问答

基于表格学习的列属性标柱【SIGMOD'24】（美国Megagon Lab）Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation
基于LLM的表格数据问答【VLDB'24】（UW-Madison，微软）ReAcTable: Enhancing ReAct for Table Question Answering
列语义类型标注【VLDB'23】（HKUST）RECA: Related Tables Enhanced Column Semantic Type Annotation Framework
列类型注释【ICDE'24】（HKUST）KGLink: A column type annotation method that combines knowledge graph and pre-trained language model
通过正则化列对的关系降低表格学习中的过拟合【SIGMOD'23】Regularized Pairwise Relationship based Analytics for Structured Data

7.2 模型选择

利用数据库免训练高效模型选择【VLDB'24】（NUS、浙大、杜克）Database Native Model Selection: Harnessing Deep Neural Networks in Database Systems
模型选择服务迁移学习【VLDB'23】（ETHZ）SHiFT: An Efficient, Flexible Search Engine for Transfer Learning
【ICDE‘24】（人大）A Two-Phase Recall-and-Select Framework for Fast Model Selection

7.3 数据可视化