A collection of papers and resources related to applying LLM techniques in data management (e.g., data processing, data optimization, and data analysis)
Kindly let us know if we have missed any great papers. Thank you!
- 0. System and Review
- 1. LLM for Data Processing
- 2. LLM for Database Optimization
- 3. LLM for Data Analysis
- 4. Data Management for LLM
How Large Language Models Will Disrupt Data Management
Raul Castro Fernandez, Aaron J. Elmore, Michael J. Franklin, Sanjay Krishnan, Chenhao Tan. VLDB 2023. [pdf]
From Large Language Models to Databases and Back: A Discussion on Research and Education
Sihem Amer-Yahia, Angela Bonifati, Lei Chen, Guoliang Li, Kyuseok Shim, Jianliang Xu, Xiaochun Yang. SIGMOD Record. [pdf]
DB-GPT: Large Language Model Meets Database
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. Data Science and Engineering 2023. [pdf]
LLM-Enhanced Data Management
Xuanhe Zhou, Xinyang Zhao, Guoliang Li. arxiv 2024. [pdf]
Can Foundation Models Wrangle Your Data?
Avanika Narayan, Ines Chami, Laurel J. Orr, Christopher Ré. VLDB 2022. [pdf]
Multimodal Table Understanding.
Zheng M, Feng X, Si Q, et al. ACL 2024. [pdf]
TableVLM: Multi-modal pre-training for table structure recognition.
Chen L, Huang C, Zheng X, et al. ACL 2023. [pdf]
LLM for Data Management
Guoliang Li, Xuanhe Zhou, Xinyang Zhao. VLDB 2024 Tutorial. [pdf]
Data Management For Training Large Language Models: A Survey
Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu. arxiv 2024. [pdf]
When Large Language Models Meet Vector Databases: A Survey
Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang. arxiv 2024. [pdf]
From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management
Immanuel Trummer. VLDB 2023. [pdf]
Jellyfish: A Large Language Model for Data Preprocessing.
Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada. arxiv 2024. [pdf]
LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing
Luyi Ma, Nikhil Thakurdesai, Jiao Chen, Jianpeng Xu, Evren Körpeoglu, Sushant Kumar, Kannan Achan. IEEE Big Data 2023. [pdf]
CleanAgent: Automating Data Standardization with LLM-based Agents
Danrui Qi, Jiannan Wang. arxiv 2024. [pdf]
LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs
Fabian Biester, Mohamed Abdelaal, Daniel Del Gaudio. arxiv 2024. [pdf]
SEED: Domain-Specific Data Curation With Large Language Models
Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, Michael Cafarella. arxiv 2023. [pdf]
Large Language Models as Data Preprocessors
Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada. arxiv 2023. [pdf]
Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration
Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, Xiaoyong Du. ICDE 2024. [pdf]
Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching
Tianshu Wang, Hongyu Lin, Xiaoyang Chen, Xianpei Han, Hao Wang, Zhenyu Zeng, Le Sun. arxiv 2024. [pdf]
ZeroEA: A Zero-Training Entity Alignment Framework via Pre-Trained Language Model
Nan Huo, Reynold Cheng, Ben Kao, Wentao Ning, Nur Al Hasan Haldar, Xiaodong Li, Jinyang Li, Mohammad Matin Najafi, Tian Li, Ge Qu. VLDB 2024. [pdf]
Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration
Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Guoliang Li, Xiaoyong Du. SIGMOD 2023. [pdf]
Entity matching using large language models
Ralph Peeters, Christian Bizer. arxiv 2023. [pdf]
Deep Entity Matching with Pre-Trained Language Models
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan. VLDB 2021. [pdf]
Dual-Objective Fine-Tuning of BERT for Entity Matching
Ralph Peeters, Christian Bizer. VLDB 2021. [pdf]
Schema Matching with Large Language Models: an Experimental Study
Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M. Peeters, Stijn Vansummeren. arxiv 2024. [pdf]
Schema Matching using Pre-Trained Language Models
Yunjia Zhang, Avrilia Floratou, Joyce Cahoon, Subru Krishnan, Andreas C. Müller, Dalitso Banda, Fotis Psallidas, Jignesh M. Patel. ICDE 2023. [pdf]
CHORUS: Foundation Models for Unified Data Discovery and Exploration
Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu. VLDB 2024. [pdf]
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher Ré. VLDB 2024. [pdf]
DeepJoin: Joinable Table Discovery with Pre-trained Language Models
Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, Masafumi Oyamada. VLDB 2023. [pdf]
Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation
Yiyan Li, Haoyang Li, Zhao Pu, Jing Zhang, Xinyi Zhang, Tao Ji, Luming Sun, Cuiping Li, Hong Chen. [pdf]
GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization
Jiale Lao, Yibo Wang, Yufei Li, Jianping Wang, Yunjia Zhang, Zhiyuan Cheng, Wanghu Chen, Mingjie Tang, Jianguo Wang. VLDB 2024. [pdf]
DB-BERT: a Database Tuning Tool that “Reads the Manual”
Immanuel Trummer. SIGMOD 2022. [pdf]
LLMTune: Accelerate Database Knob Tuning with Large Language Models
Huang X, Li H, Zhang J, et al. arXiv 2024 [pdf]
LATuner: An LLM-Enhanced Database Tuning System Based on Adaptive Surrogate Model
Fan C, Pan Z, Sun W, et al. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2024 [pdf]
Panda: Performance debugging for databases using LLM agents.
Singh V, Vaidya K E, Kumar V B, et al. CIDR 2024. [pdf]
LLM As DBA
Xuanhe Zhou, Guoliang Li, Zhiyuan Liu. arXiv 2023. [pdf]
D-Bot: Database Diagnosis System using Large Language Models
Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, et al. VLDB 2024. [pdf] [code]
The Dawn of Natural Language to SQL: Are We Fully Ready?
Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang. VLDB 2024. [pdf]
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, Jingren Zhou. VLDB 2024. [pdf]
CodeS: Towards Building Open-source Language Models for Text-to-SQL
Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen. SIGMOD 2024. [pdf]
Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL
Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Samuel Madden, Xiaoyong Du, Nan Tang. VLDB 2024. [pdf]
From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management
Immanuel Trummer. VLDB 2022. [pdf]
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning
Zihui Gu, Ju Fan, Nan Tang, et al. SIGMOD 2023. [pdf]
Db-gpt: Empowering database interactions with private large language models
Siqiao Xue, Caigao Jiang, Wenhui Shi, Fangyin Cheng, Keting Chen, Hongjun Yang, Zhiping Zhang, Jianshan He, Hongyang Zhang, Ganglin Wei, Wang Zhao, Fan Zhou, Danrui Qi, Hong Yi, Shaodong Liu, Faqiang Chen. arxiv 2023. [pdf]
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study
Yang Wu, Yao Wan, Hongyu Zhang, Yulei Sui, Wucai Wei, Wei Zhao, Guandong Xu, Hai Jin. SIGMOD 2024. [pdf]
LLM4Vis: Explainable Visualization Recommendation using ChatGPT
Lei Wang, Songheng Zhang, Yun Wang, Ee-Peng Lim, Yong Wang. EMNLP 2023. [pdf]
Data-Juicer: A One-Stop Data Processing System for Large Language Models
Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou. SIGMOD 2024. [pdf]
Relational Database Augmented Large Language Model
Zongyue Qin, Chen Luo, Zhengyang Wang, Haoming Jiang, Yizhou Sun. arxiv 2024. [pdf]
Survey of Vector Database Management Systems
James Jie Pan, Jianguo Wang, Guoliang Li. arxiv 2023. [pdf]