-
Chang, Ernie, Matteo Paltenghi, Yang Li, et al. 2024. Scaling Parameter-Constrained Language Models with Quality Data. arXiv.
-
Chardet. n.d. Chardet: The Universal Character Encoding Detector.
-
Codd, E. F. 1970. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6): 377–387.
-
Common Crawl. n.d. Common Crawl.
-
Dodge, Jesse, Maarten Sap, Ana Marasović, et al. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. arXiv.
-
Gao, Yunfan, Yun Xiong, Xinyu Gao, et al. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv.
-
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. Training Compute-Optimal Large Language Models. arXiv.
-
Kaplan, Jared, Sam McCandlish, Tom Henighan, et al. 2020. Scaling Laws for Neural Language Models. arXiv.
-
Lee, Cinoo, Kristina Gligorić, Pratyusha Ria Kalluri, et al. 2024. People Who Share Encounters with Racism Are Silenced Online by Humans and Machines, but a Guideline-Reframing Intervention Holds Promise. PNAS, 121(38).
-
LlamaIndex. n.d. Vector Stores.
-
Ma, Yingwei, Yue Liu, Yue Yu, et al. 2023. At Which Training Stage Does Code Data Help LLMs Reasoning?. arXiv.
-
Nguyen, Thuat, Chien Van Nguyen, Viet Dac Lai, et al. 2023. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv.
-
OpenAI Platform. n.d. Vector Embeddings.
-
Pemistahl. n.d. lingua-py.
-
Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, et al. 2023. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. arXiv.
-
Salley, Columbus, and E. F. Codd. 1998. Providing OLAP to User-Analysts: An IT Mandate.
-
Wang, Zige, Wanjun Zhong, Yufei Wang, et al. 2024. Data Management for Large Language Models: A Survey. arXiv.
-
WARC Specifications. n.d. The WARC Format 1.0.
-
Xu, Yipei, Dakuan Lu, Jiaqing Liang, et al. 2023. Source Prompt: Coordinated Pre-Training of Language Models on Diverse Corpora from Multiple Sources. arXiv.
-
Xue, Fuzhao, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, and Yang You. 2023. To Repeat or Not to Repeat: Insights from Scaling LLM Under Token-Crisis. arXiv.
-
Yang, Rui, Michael Fu, Chakkrit Tantithamthavorn, et al. 2025. RAGVA: Engineering Retrieval Augmented Generation-Based Virtual Assistants in Practice. arXiv.
- Gao, Leo, Stella Biderman, Sid Black, et al. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv.