LLMOps/Chapter04 at main · corazzon/LLMOps

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Chapter04

참고문헌

Chang, Ernie, Matteo Paltenghi, Yang Li, et al. 2024. Scaling Parameter-Constrained Language Models with Quality Data. arXiv.
Chardet. n.d. Chardet: The Universal Character Encoding Detector.
Codd, E. F. 1970. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6): 377–387.
Common Crawl. n.d. Common Crawl.
Dodge, Jesse, Maarten Sap, Ana Marasović, et al. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. arXiv.
Gao, Yunfan, Yun Xiong, Xinyu Gao, et al. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. Training Compute-Optimal Large Language Models. arXiv.
Kaplan, Jared, Sam McCandlish, Tom Henighan, et al. 2020. Scaling Laws for Neural Language Models. arXiv.
Lee, Cinoo, Kristina Gligorić, Pratyusha Ria Kalluri, et al. 2024. People Who Share Encounters with Racism Are Silenced Online by Humans and Machines, but a Guideline-Reframing Intervention Holds Promise. PNAS, 121(38).
LlamaIndex. n.d. Vector Stores.
Ma, Yingwei, Yue Liu, Yue Yu, et al. 2023. At Which Training Stage Does Code Data Help LLMs Reasoning?. arXiv.
Nguyen, Thuat, Chien Van Nguyen, Viet Dac Lai, et al. 2023. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv.
OpenAI Platform. n.d. Vector Embeddings.
Pemistahl. n.d. lingua-py.
Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, et al. 2023. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. arXiv.
Salley, Columbus, and E. F. Codd. 1998. Providing OLAP to User-Analysts: An IT Mandate.
Wang, Zige, Wanjun Zhong, Yufei Wang, et al. 2024. Data Management for Large Language Models: A Survey. arXiv.
WARC Specifications. n.d. The WARC Format 1.0.
Xu, Yipei, Dakuan Lu, Jiaqing Liang, et al. 2023. Source Prompt: Coordinated Pre-Training of Language Models on Diverse Corpora from Multiple Sources. arXiv.
Xue, Fuzhao, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, and Yang You. 2023. To Repeat or Not to Repeat: Insights from Scaling LLM Under Token-Crisis. arXiv.
Yang, Rui, Michael Fu, Chakkrit Tantithamthavorn, et al. 2025. RAGVA: Engineering Retrieval Augmented Generation-Based Virtual Assistants in Practice. arXiv.

읽을거리

Gao, Leo, Stella Biderman, Sid Black, et al. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Chapter04

참고문헌

읽을거리

FilesExpand file tree

Chapter04

Directory actions

More options

Directory actions

More options

Latest commit

History

Chapter04

Folders and files

parent directory

README.md

Chapter04

참고문헌

읽을거리