DB4AI研究方向整理分类及论文汇总(数据库顶会近三年23-25论文SIGMOD、VLDB、ICDE)

DB4AI,即Database for AI,是用数据库和数据管理的技术提升AI流水线全过程性能的技术,包括前期的数据准备、加速训练推理、降低模型成本、以及产业化部署等。

在模型结构基本成熟的今天,数据成为影响模型性能的一个关键要素,Garbage-in garbage-out即是说如果数据不满足模型训练要求,再好的模型也无法习得知识。部分学者进一步提出了Data-centric AI的概念,即讲模型训练从以模型为中心(参数调整)迁移到以数据为中心,重点关注数据对模型的影响。具体来说包括数据获取和整合、数据标签、数据清洗和准备、数据削减和增强等。编者推荐以下两篇文章以供参考。

  • Jarrahi, Mohammad Hossein, Ali Memariani, and Shion Guha. "The principles of data-centric ai." Communications of the ACM 66.8 (2023): 84-92
  • Zha, Daochen, et al. "Data-centric artificial intelligence: A survey." ACM Computing Surveys 57.5 (2025): 1-42.

数据库领域研究重点研究数据管理的问题,许多Data-centric AI中的问题,比如数据整合、数据清洗、标签众包等,本身就是数据库领域的经典问题。更重要的是,数据库领域研究的核心在于效率,包括提升模型训练和推理的速度,降低成本。在AI流水线已基本成熟的今天,如何用数据库的方法提升模型生产效益,提高用户体验,已成为一个热点问题。

在近三年(2022-2025)的数据库顶级会议(SIGMOD、VLDB、ICDE)中,研究人员对DB4AI领域的研究热度逐年升高,在论文总量上所占比规模逐渐扩大。在本文中,编者浏览了这三年来三大会的相关文章,按照研究方向进行整理,主要分为数据清洗、数据准备、模型训练和推理加速(算法层面)、多模态数据管理、高效的机器学习算法、可解释性六大方向,在每个方向下又整理了具体的研究问题,并在问题下附上最新的顶会论文,每篇论文简短的总结其研究问题或技术方法,以供读者参考。

受限于编者本人能力,所归纳方向及方向下的论文可能有所遗漏,有的显式标出,有的则没有,尤其是没有整理期刊论文和非数据库以外的会议论文,希望各位读者能在评论区或私信予以纠正补充,以持续更新此文章。
另外,由于整理未必全面,所以没有对各方向计数统计,以免误导。

一、Data Cleaning

1.1 Missing Value Imputation

  • 缺失值插补【SIGMOD'24】(UNSW) Missing Data Imputation with Uncertainty-Driven Network
  • 【ICDE'24】(西班牙、比利时)Mitigating Data Sparsity in Integrated Data through Text Conceptualization

1.2 Duplication Detection

  • 表间冗余检测【SIGMOD'24】(意大利、德国)Determining the Largest Overlap between Tables
  • 冗余数据对不同模型的影响【VLDB'24】(IBM)How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses

1.3 Imbalanced Class Distribution

  • 特征重要性检测【SIGMOD'24】(清华、蚂蚁)FeatureLTE: Learning to Estimate Feature Importance
  • 数据分布偏移对模型准确性影响【VLDB'24,SDS】(巴黎西岱)Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

1.4 Mislabel Detection

  • 基于潜在空间分析的数据清洗【VLDB'24】(德国TUD)Generalizable Data Cleaning of Tabular Data in Latent Space
  • 【ICDE‘24】(北师、中科院计算所)Label Noise Correction for Federated Learning: A Secure, Efficient and Reliable Realization
  • 【ICDE‘24】(阿卡萨斯大学)Contrastive Learning for Fraud Detection from Noisy Labels

1.5 Cleaning Pipeline

  • 基于LLM的数据清洗【SIGMOD'25】GEIL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models
  • 数据清洗+特征扩充流水线【SIGMOD'25】CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning
  • 拼装清洗算法构建端到端的数据清洗流水线【SIGMOD'24】(奥地利、德国)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications
  • 数据预处理流水线【SIGMOD'23】(人大、北理、清华)HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
  • 【ICDE'24】(深算所)BClean: A Bayesian Data Cleaning System

1.6 Data Integration(Entity matching)

  • 实体匹配【SIGMOD'23】(人大、清华)Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration
    【TBD】

1.7 Others

  • 数据独立性检测【SIGMOD'24】(西安大略大学、UCSD)OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport
  • 多维时序数据清洗【VLDB'24】(哈工、清华)MTSClean: Efficient Constraint-based Cleaning for Multi-Dimensional Time Series Data
  • 多维时序数据清洗【SIGMOD'25】(北理工)Multivariate Time Series Cleaning under Speed Constraints
  • 去噪声【ICDE'24】(HKUST)Triple-d: Denoising Distant Supervision for High-quality Data Creation
  • 数据质量评估【ICDE'23】(ETHZ、MSR)Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

二、Data Preparation

2.1 Data Augmentation (Generation)

  • 基于知识图谱和RAG的LLM对话benchmak生成【SIGMOD'25】(加拿大)Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs
  • 条件约束的表格数据生成【SIGMOD'24】(人大、清华)Controllable Tabular Data Synthesis Using Diffusion Models
  • 文本-表格对数据扩增【SIGMOD'24】(法国、意大利)Generation of Training Examples for Tabular Natural Language Inference
  • 时序数据生成benchmark【VLDB'24】(NUS)TSGBench: Time Series Generation Benchmark
  • 优化弱监督学习标签数据【VLDB'23】(华盛顿大学)Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming
  • 对抗模型生成表格数据【VLDBJ'24】(人大、清华)Tabular data synthesis with generative adversarial networks: design space and optimizations
  • 强化指导数据生成【ICDE'24】(北理、人大)Mitigating Data Scarcity in Supervised Machine Learning through Reinforcement Learning Guided Data Generation
  • 扩散模型合成表格数据【ICDE'24】(荷兰、瑞士)SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models
  • 特征扩充(one-to-many table)【ICDE'24】(SFU)FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables
  • 基于Join Path特征扩充【ICDE'24】(荷兰)AutoFeat: Transitive Feature Discovery over Join Paths
  • 表格类文档数据生成【ICDE'24】(Google Research、JHU)FieldSwap: Data Augmentation for Effective Form-Like Document Extraction
  • 模糊数据【ICDE‘23】(Eurecom)Data Ambiguity Profiling for the Generation of Training Examples

2.2 Data Selection (Coreset selection, Data acquisition, active learning)

  • 选择数据提高模型置信度【SIGMOD'24】(约克大学、多伦多大学)Data Acquisition for Improving Model Confidence
  • 多样化coreset【SIGMOD'24】(UIUC、Cornell)Faster Algorithms for Fair Max-Min Diversification in Rd
  • 优化coreset selection【VLDB'24】(澳洲)Optimizing Data Acquisition to Enhance Machine Learning Performance
  • 主动学习优化模型公平性【VLDB'24】(KAIST、Georgia Tech)Falcon: Fair Active Learning using Multi-armed Bandits
  • coreset selection 【SIGMOD'23】(北理、人大、清华)GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data
  • 基于多表JOIN扩充特征并筛选coreset【VLDB'23】(清华)Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning
  • AutoML训练加速【VLDB'23,SDS】(UCL、以色列)SubStrat: A Subset-Based Optimization Strategy for Faster AutoML
  • 允许用户犯错的主动学习【SIGMOD'23】(俄勒冈州立大学、Eurecom)Exploratory Training: When Annotators Learn About Data
  • 【ICDE‘24】(HKUST)Effective Data Selection and Replay for Unsupervised Continual Learning
  • 【VLDB'21】(约克大学、多伦多大学)Data acquisition for improving machine learning models

2.3 Data Extraction/Discovery

  • 从文本中抽取表格数据【SIGMOD'24】(多伦多大学)Unstructured Data Fusion for Schema and Data Extraction
  • 数据集搜索【SIGMOD'24】(芝加哥大学)Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach
  • 从半结构化网页中抽取结构化数据【VLDB'23】(Amazon)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages
  • 数据集搜索【VLDB'23】(东北大学、Megagon实验室)Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning
  • 【VLDB'23】(卡塔尔)Cross Modal Data Discovery over Structured and Unstructured
    Data Lakes
  • 从github中抽取表格数据【SIGMOD‘23,短文】(阿姆斯特丹大学)GitTables: A Large-Scale Corpus of Relational Tables
  • 多源数据预处理、图谱构建【ICDE‘24】(加拿大)KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science

2.4 Crowdsourcing

【TBD】

三、Improving Model Training/Inference(Algorithmic level)

3.1 推理计算加速

3.1.1 全流程加速

  • 基于模型输出稳定性加速推理计算【VLDB'24】(CUHK)Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines
  • 动态张量size下的推理框架计算【SIGMOD'24】(阿里、人大) BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach
  • 集成学习中通过冗余识别推理加速【ICDE'23】(科大)Efficient Deep Ensemble Inference via Query Difficulty-dependent Task Scheduling

3.1.2 (稀疏)矩阵乘法加速

  • 基于模型裁剪的decoder-only模型矩阵乘法计算加速(减少显存数据传输量)【VLDB'24】(阿里、悉尼大学)Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
  • 稀疏矩阵链乘法计算【SIGMOD'24】(港中深、华为)On Efficient Large Sparse Matrix Chain Multiplication

3.1.3 Transformer计算加速

  • 时序Transformer计算【SIGMOD'24】(宾大、MIT、清华)RITA: Group Attention is All You Need for Timeseries Analytics
  • 时序Transformer计算【VLDB'24】(HKBU、沙特)DARKER: Efficient Transformer with Data-driven Attention Mechanism for Time Series
  • 多卡transformer训练【VLDB'23】(CMU、北大)Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

3.1.4 CPU推理加速

  • 新型推理架构加速数据库内(CPU)推理计算【VLDB'24】(HPI、UIC)InferDB: In-Database Machine Learning Inference Using Indexes
  • 资源受限下的模型推理加速(面向数据库CPU操作的tensor操作加速)【VLDB'24】(浙大、阿里)SmartLite: A DBMS-Based Serving System for DNN Inference in Resource-Constrained Environments

3.1.5 其它

  • 数据基本预处理操作(图像剪裁、放缩等)加速【VLDB'24】(韩国UNIST)FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
  • 数据预处理时CPU资源分配【ICDE‘24】(三星)FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
  • 时空数据预处理【SIGMOD'23】(NTU、阿里)ST4ML: Machine Learning Oriented Spatio-Temporal Data Processing at Scale

3.2 训练加速

3.2.1 分布式训练

  • 多卡多模型训练调度【VLDB'24】(UCSD)Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
  • 多种云服务商训练【VLDB'24】(TUM、多伦多大学)How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study
  • 公有云上多卡训练降低通信代价【VLDB'23】(JHU、AWS)MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
  • 推荐模型分布式训练容错【VLDB'23】(CMU)Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding
  • 分布式训练【ICDE'24】(浙大)SparDL: Distributed Deep Learning Training with Efficient Sparse Communication
  • 分布式训练降低通信代价【ICDE'23】(北大)SK-Gradient: Efficient Communication for Distributed Machine Learning with Data Sketch

3.2.2 推荐模型训练(Embedding处理)

  • 推荐模型训练加速【SIGMOD'23】(CUHK、南科大、Meta)FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication

3.2.3 其他

  • 模型更新时的数据选择和触发频率选择【SIGMOD’25】(ETHZ、哥本哈根、TUM)Modyn: Data-Centric Machine Learning Pipeline Orchestration
  • 量化模型精度校准加速【VLDB'24】(丹麦阿尔伯格大学,华师大)Core: Data-Efficient, On-Device Continual Calibration for Quantized Models
  • ml pipeline基于等价性优化【ICDE'24】(西班牙、比利时、希腊)HYPPO: Using Equivalences to Optimize Pipelines in Exploratory Machine Learning

3.3 AI工具优化

  • Pandas代码重写【SIGMOD'24】(UIUC)Dias: Dynamic Rewriting of Pandas Code
  • Notebook checkpoint/有状态代码迁移【VLDB'24】(UIUC)ElasticNotebook: Enabling Live Migration for Computational Notebooks
  • Dataframes数据去重以降低内存需求【VLDB'24】(UW-Madison、CMU)SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis
  • 【VLDB'23】(UCB、UIUC、宾大)Bolt-on, Compact, and Rapid Program Slicing for Notebooks
  • 【ICDE'24】(爱丁堡大学)PyTond: Efficient Python Data Science on the Shoulders of Databases

3.4 GNN训练加速

  • 时序GNN【SIGMOD'24】(HKUST)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data Placement
  • 多卡训练【SIGMOD'24】(NUS)HongTu: Scalable Full-Graph GNN Training on Multiple GPUs
  • 分布式GNN训练【VLDB‘24】(东北大学)DynaHB: A Communication-Avoiding Asynchronous Distributed Framework with Hybrid Batches for Dynamic GNN Training
  • 动态图【SIGMOD'24】(日本)DGC: Training Dynamic Graphs with Spatio-Temporal Non-Uniformity using Graph
  • 流式GNN【VLDB'24】(华威大学)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks
  • 基于数据和硬件的执行计划【VLDB'24】(HKUST)DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning
  • 基于硬件加速GNN操作【VLDB'24】(UIUC、英伟达)Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
  • 降低内存需求【VLDB'24】(清华、NYU、Amazon)FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training
    【TBD】

四、多模态数据管理

4.1 多模态数据检索

4.1.1 数据检索范式

  • 基于用户反馈的语义对齐和查询优化【SIGMOD'24】(Cornell)ThalamusDB: Approximate Query Processing on Multi-Modal Data
  • 文本+表格数据联合查询优化【VLDB'24】(德国TUD)ELEET: Efficient Learned Query Execution over Text and Tables
  • 基于用户反馈修正query embedding,提高图像搜索语义准确性【SIGMOD'23】(MIT)SeeSaw: Interactive Ad-hoc Search Over Image Databases
  • 距离计算需调用模型的KNN(使用小代理模型加速)【VLDB'23】(UBC、CNRS)On Efficient Approximate Queries over Machine Learning Models
  • 任意查询语义下的语义检索【VLDB'23】(慕尼黑大学、哥本哈根大学)Fast Search-by-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests
  • 基于小代理模型的语义检索【SIGMOD'22】(斯坦福)TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data
  • 多度量空间搜索【ICDE‘24】(浙大)HJG: An Effective Hierarchical Joint Graph for ANNS in Multi-Metric Spaces
  • 图片+文本协同图片检索【ICDE'24】(浙大、杭电)MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality
  • 图片+图协同响应查询【ICDE'24】(浙科、纽卡斯特、北理)Across Images and Graphs for Question Answering

4.1.2 向量最近邻检索

4.1.2.1 新索引设计

  • 结合量化和图的索引【SIGMOD'25】(NTU)SymphonyQG: towards Symphonious Integration of Quantization and Graph for Approximate Nearest Neighbor Search
  • 哈希索引【SIGMOD'25】(中科院、巴黎西岱)Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor Search
  • 基于向量量化的索引【SIGMOD'24】(NTU)RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search
  • 磁盘图索引【SIGMOD'24】(浙大、Zilliz、杭电)Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment
  • 树图结合的向量索引【VLDB'24】(哈工大)DIDS: Double Indices and Double Summarizations for Fast Similarity Search
  • 图索引【SIGMOD'23】(广州大学、HKBU)Efficient Approximate Nearest Neighbor Search in Multi-dimensional Databases
  • LSH+图索引【VLDB'23】(HKUST)Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional Spaces
  • 树+图索引支持大数据集【VLDB'23,SDS】(摩洛哥、巴黎西岱)Elpis: Graph-Based Similarity Search for Scalable Data Science
  • LSH索引【VLDB'23】(佛罗里达大学)LIDER: an efficient high-dimensional learned index for large-scale dense passage retrieval

4.1.2.2 查询分析与综述

  • 基于图的ANN难度定义和Benchmark生成【VLDB'25】(复旦、巴黎西岱)Steiner-Hardness: A Query Hardness Measure for Graph-Based ANN Indexes
  • 图索引survey【SIGMOD'25】(摩洛哥、巴黎西岱)Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art
  • 向量数据库【ICDE'24】(普渡大学)Are There Fundamental Limitations in Supporting Vector Data Management in Relational Databases? A Case Study of PostgreSQL
  • 参数调优【ICDE‘24】(南开、蚂蚁)VDTuner: Automated Performance Tuning for Vector Data Management Systems

4.1.2.3 查询优化算法

  • 基于三角不等式的查询剪枝【SIGMOD'25】(人大、清华)Tribase: A Vector Data Query Engine for Reliable and Lossless Pruning Compression using Triangle Inequalities
  • 图索引剪枝【SIGMOD'23】(NTU)High-Dimensional Approximate Nearest Neighbor Search: with Reliable and Efficient Distance Comparison Operations
  • 标量量化【VLDB'23】(英特尔)Similarity search in the blink of an eye with compressed indices
  • 优化磁盘索引中的量化指导【ICDE‘24】(杭电)Routing-Guided Learned Product Quantization for Graph-Based Approximate Nearest Neighbor Search

4.1.2.4 过滤条件限制

  • 任意谓词下的向量搜索【SIGMOD'25】(复旦)Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor Search
  • 标量范围限制下的向量查询【SIGMOD'25】(NTU)iRangeGraph: Improvising Range-dedicated Graphs for Range-filtering Nearest Neighbor Search
  • 标量范围限制下的向量查询【SIGMOD'24】(Rutgers、阿里)Range-Filtering Approximate Nearest Neighbor Search
  • 任意标量谓词下的向量查询【SIGMOD'24】(斯坦福、UCB)ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data

4.1.2.5 特殊环境/条件下的向量搜索

  • 双向量表示ANN【SIGMOD'25】(NTU)DEG: Efficient Hybrid Vector Search Using the Dynamic Edge Navigation Graph
  • 跨模态ANN【VLDB'24】(复旦)RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search
  • 云上向量数据库【SIGMOD'24】(普渡大学、MSR)Vexless: A Serverless Vector Data Management System Using Cloud Functions
  • 文本中的“近冗余”检测【SIGMOD'25】(Rutgers)Near-Duplicate Text Alignment with One Permutation Hashing
  • 文本中的“近冗余”检测【SIGMOD’23】(Rutgers)Near-Duplicate Sequence Search at Scale for Neural Language Model Memorization Evaluation
  • 安全联邦检索向量【SIGMOD'24】(HKBU) FedKNN: Secure Federated k-Nearest Neighbor Search
  • 基于树嵌入的任意度量下相似性搜索【SIGMOD'23】(北航、HKUST)LiteHST: A Tree Embedding based Method for Similarity Search
  • 内积搜索【VLDB'23】(华科、Zilliz、HKUST)FARGO: Fast Maximum Inner Product Search via Global Multi-Probing
  • 低维度量空间搜索【VLDB'23】(希腊、丹麦)Adaptive Indexing in High-Dimensional Metric Spaces
  • 内积搜索【ICDE'24】(中科院深圳)Reconsidering Tree based Methods for k-Maximum Inner-Product Search: The LRUS-CoverTree
  • GPU上建图【ICDE‘24】(英伟达)CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
  • 反向KNN【ICDE‘24】(交大、阿里)Efficient Reverse k Approximate Nearest Neighbor Search over High-Dimensional Vectors
  • 反向KNN【ICDE‘24】(HKBU、南科大、华为)QSRP: Efficient Reverse k-Ranks Query Processing on High-dimensional Embeddings
  • 稀疏向量内积搜索【ICDE'24】(HKUST、华科、澳门)Efficient Approximate Maximum Inner Product Search over Sparse Vectors

4.1.3 检索增强生成(RAG)

  • VectorDB与Prefill阶段overlap提升RAG效率【SIGMOD'25】(华科)AquaPipe: A Quality-Aware Pipeline for Knowledge Retrieval and Large Language Models
  • 基于新硬件加速RAG流水线【VLDB'25】(ETHZ)Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

4.1.4 视频数据查询

  • 视频表示形式与查询优化【VLDB'25】(北理)TVM: A Tile-based Video Management Framework
  • 视频数据查询加速【VLDB'23】(Georgia Tech、Adobe)Seiden: Revisiting Query Processing in Video Database Systems
  • 复杂视频查询优化【VLDB'23】(斯坦福)Optimizing Video Analytics with Declarative Model Relationships
  • 模糊语义下的视频查询【VLDB'23】(华盛顿大学)EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions
  • 地理位置敏感的视频查询【VLDB'25】(UCB)Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations
  • 结合用户提示的视频查询优化【VLDB'25】(多伦多大学、约克大学)Optimizing Video Queries with Declarative Clues
  • 优化视频LIMIT查询【VLDB'25】(密歇根、USC、MIT)Optimizing Video Selection LIMIT Queries With Commonsense Knowledge
  • 端上设备视频分析【ICDE'24】(中大)COUPLE: Orchestrating Video Analytics on Heterogeneous Mobile Processors
  • 【ICDE'23】(多伦多大学、约克大学)Track Merging for Effective Video Query Processing
  • 【ICDE'23】(多伦多大学、约克大学)Marshalling Model Inference in Video Streams

4.2 面向AI的多模态数据存储

4.2.1 图像/视频

  • 用于模型训练的图像格式【SIGMOD'24】(Harvard)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage Format
  • 图像压缩以提高数据加载效率【ICDE'23】The Art of Losing to Win: Using Lossy Image Compression to Improve Data Loading in Deep Learning Pipelines
  • 视频流数据处理【VLDB'23】(MIT)Extract-Transform-Load for Video Streams

4.2.2 嵌入向量与张量

  • 分类特征的嵌入向量压缩【SIGMOD'24】(北大)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation Models
  • 向量压缩在推荐系统和RAG中的技术survey【VLDB'24】(北大)Experimental Analysis of Large-scale Learnable Vector Storage Compression
  • 在持久化内存上分布式存储Embedding table【ICDE'23】(第四范式、NUS、阿里、英特尔)OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory
  • 基于持久化内存大规模低成本的Embedding映射计算(CPU)【VLDB'23,SDS】(清华、快手)PetPS: Supporting Huge Embedding Models with Persistent Memory
  • 稀疏张量表示【SIGMOD'24】(HKUST,上交)STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically
  • 任意形式的张量表示【SIGMOD'23】(RelationalAI、爱丁堡大学、华盛顿大学)Optimizing Tensor Programs on Flexible Storage
  • 张量压缩【ICDE'24】(湖大、中科院网络所)A Robust Low-rank Tensor Decomposition and Quantization based Compression Method

4.2.3 模型压缩

  • survey与新方法【VLDB'24】(Virginia, Minnesota)Everything You Always Wanted to Know About Storage Compressibility of Pre-Trained ML Models but Were Afraid to Ask
  • 自适应模型压缩方法搜索【ICDE'24】(哈工大)AutoMC: Automated Model Compression Based on Domain Knowledge and Progressive Search

4.2.4 其它类型数据

  • LLM训练时的激活管理以降低内存需求【SIGMOD'25】(北大、腾讯)MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
  • 梯度存储与分析【VLDB'24】(WPI、MIT)MetaStore: Analyzing Deep Learning Meta-Data at Scale

4.2.5 Feature Store

  • 支持更新的Feature Store【VLDB'24】(UCB)RALF: Accuracy-Aware Scheduling for Feature Store Maintenance
  • Feature Store查询优化【VLDB'24 SDS】(芝加哥大学、微软、Linkedin)Optimizing Data Pipelines for Machine Learning in Feature Stores

五、更高效的机器学习算法实现

5.1 聚类

  • k-means【SIGMOD'24】(丹麦、法国)Settling Time vs. Accuracy Tradeoffs for Clustering Big Data
  • k-means【SIGMOD'24】(武大、Oceanbase)F3KM: Federated, Fair, and Fast k-means
  • dbscan 【SIGMOD'24】(中科大)Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms
  • k-multi-means 【SIGMOD'24】(日本)Efficient Algorithm for K-Multiple-Means
    【TBD】

5.2 回归

  • XGBoost【SIGMOD'25】(北邮)SecureXGB: A Secure and Efficient Multi-party Protocol for Vertical Federated XGBoost
  • GBDT【SIGMOD'23】(NUS)DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine Unlearning

5.3 其它

  • 数据库内训练树模型【VLDB'23】(哥伦比亚大学、微软)JoinBoost: Grow Trees Over Normalized Data Using Only SQL
  • 异常检测【SIGMOD'23】(CMU)TOD: GPU-accelerated Outlier Detection via Tensor Operations
  • 属性推荐【SIGMOD'24】(CUHK)Efficient Approximation Framework for Attribute Recommendation
  • 频繁项检测【SIGMOD'24】(西班牙、法国)Language-Model Based Informed Partition of Databases to Speed Up Pattern Mining
  • 马尔可夫决策过程【VLDB'23, SDS】(丹麦、希腊)SIFTER: Space-Efficient Value Iteration for Finite-Horizon MDPs
  • 决策树【ICDE'24】(约翰霍普金斯大学、莱斯大学)T-Rex (Tree-Rectangles): Reformulating Decision Tree Traversal as Hyperrectangle Enclosure

六、模型可解释性

6.1 因果推断(反事实解释)

  • 支持溯源的可解释AI【SIGMOD'25】(阿里、乔治城大学)Provenance-Enabled Explainable AI
  • 分类边界探测【SIGMOD'24】(WPI)FACET: Robust Counterfactual Explanation Analytics
  • 用户端因果推断【SIGMOD'24】(爱丁堡大学)Counterfactual Explanation at Will, with Zero Privacy Leakage
  • 聚合查询结果解释【SIGMOD'24】(以色列、MIT、杜克)Summarized Causal Explanations For Aggregate Views
  • 特征层面的因果解释【SIGMOD'24】(爱丁堡大学)Relative Keys: Putting Feature Explanation into Context

6.2 Embedding解释

  • 表格Embedding解释【VLDB'25】(密歇根、阿姆斯特丹大学)Observatory: Characterizing Embeddings of Relational Tables
  • 表格Embedding解释【SIGMOD'24】(以色列)TabEE: Tabular Embeddings Explanations

6.3 GNN解释

  • 【SIGMOD'24】(浙大、丹麦阿尔伯格大学)View-based Explanations for Graph Neural Networks
  • 【VLDB'23】(HKUST、HKPU)HENCE-X: Toward Heterogeneity-agnostic Multi-level Explainability for Deep Graph Networks
  • 【VLDB'23】(HKUST)On Data-Aware Global Explainability of Graph Neural Networks

6.4 其它

  • CNN解释【VLDB'23】(滑铁卢大学、AT&T)POEM: Pattern-Oriented Explanations of Convolutional Neural Networks

七、其它

7.1 表格数据理解/问答

  • 基于表格学习的列属性标柱【SIGMOD'24】(美国Megagon Lab)Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation
  • 基于LLM的表格数据问答【VLDB'24】(UW-Madison,微软)ReAcTable: Enhancing ReAct for Table Question Answering
  • 列语义类型标注【VLDB'23】(HKUST)RECA: Related Tables Enhanced Column Semantic Type Annotation Framework
  • 列类型注释【ICDE'24】(HKUST)KGLink: A column type annotation method that combines knowledge graph and pre-trained language model
  • 通过正则化列对的关系降低表格学习中的过拟合【SIGMOD'23】Regularized Pairwise Relationship based Analytics for Structured Data

7.2 模型选择

  • 利用数据库免训练高效模型选择【VLDB'24】(NUS、浙大、杜克)Database Native Model Selection: Harnessing Deep Neural Networks in Database Systems
  • 模型选择服务迁移学习【VLDB'23】(ETHZ)SHiFT: An Efficient, Flexible Search Engine for Transfer Learning
  • 【ICDE‘24】(人大)A Two-Phase Recall-and-Select Framework for Fast Model Selection

7.3 数据可视化

【TBD】

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容