You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -127,20 +127,112 @@ As the task consists of several steps, the loss function shall consider all of t
127
127
128
128
### **III. Dataset**
129
129
130
-
Documents available on the Platform
131
-
Some documents are in Image form
132
-
Some document are in Text form
133
-
Each document have origination metadata
134
-
Documents may have v1-v2-v3-... versions
135
-
For some Documents we know the diff between vX and vY
136
-
For some Documents we do not know the diff. Only know how Document looked like at each version
137
-
138
-
-**Key Takeaways:**
139
-
1. The success of an ML system is heavily dependent on the quality and relevance of its underlying datasets. Identifying and utilizing the right data sources is crucial for system effectiveness.
140
-
2. Data preparation is a critical step in ML system development, requiring a thoughtful balance between automation and manual oversight to ensure data quality and relevance.
141
-
3. Metadata plays a vital role in maintaining data consistency and supporting effective data management throughout the lifecycle of an ML system.
142
-
4. Addressing the cold start problem requires creative approaches to data acquisition and utilization, emphasizing the need for flexibility and innovation in early system development stages.
143
-
5. Establishing a healthy data pipeline is foundational to the long-term success and scalability of ML systems, underscoring the importance of reproducibility, consistency, and availability in data management practices.
130
+
We have two types of data:
131
+
* data that was used to train the main LLM model;
132
+
* data to perform RAG on.
133
+
134
+
We don't control the data for the main LLM training (meaning that we're coping with the LLM limitations and don't influence this until the moment we realize that we could really benefit from fine-tuning, which will become completely different project).
135
+
136
+
#### i. Data to perform RAG on - description
137
+
138
+
We don't distinguish between client roles for data access. Every client would have access to every document. So we basically have a shared dataset for the whole system.
139
+
140
+
Includes a set of Documents available on the Platform. Documents can be in text format (Markdown) or scanned/image formats.
141
+
142
+
1. For Markdown documents
143
+
- Expected Document Size: Up to 500 pages.
144
+
- Structure: Documents larger than 10 pages typically include a table of contents and dedicated sections, such as introduction or glossary.
145
+
- Content: Documents may include text with all Markdown features (e.g., quotes, headings, formulas, tables).
146
+
2. For documents in Image form (scans/images): no additional description available. They are just files that contain image/scan inside.
147
+
3. Each document have origination metadata
148
+
4. Documents may have v1-v2-v3-... versions. For some Documents we know the diff between vX and vY. For some Documents we do not know the diff, and only know how Document looked like at each version.
149
+
150
+
Clients can edit documents online via the platform or upload documents from their local machines. Each document receives a version number upon:
151
+
152
+
- Saving
153
+
- Uploading a new document
154
+
- Uploading a version of the existing document
155
+
156
+
Clients can access all versions of each document.
157
+
158
+
#### ii. Data cleaning
159
+
160
+
Data cleaning process should be automatized with reproducible scripts. Script runs once for each new document that gets uploaded to the system and then for each new version of that document.
161
+
162
+
**1. Markdown Documents:**
163
+
- Inter-document links (links that reference another part of the same document or another document in the system): enrich such links with aliases before using them for RAG purposes, i.e. add a descriptive label or name to the link that makes it more meaningful. This helps RAG to understand the context and content of the link. Example: "Section 2" -> "Section 2: Data Cleaning Procedures"
164
+
- Plain URLs (links to external web resources): don't modify, keep them as-is for LLM consumption.
165
+
- Table of Contents (ToC): Generate or validate the presence of a ToC for documents longer than 10 pages.
166
+
167
+
**2. Scanned/Image Documents:**
168
+
- Enhance the quality of scans (e.g., adjusting brightness/contrast, removing noise).
169
+
- Perform Optical Character Recognition (OCR) for scans. Store both initial scan and its recognized content.
170
+
171
+
We don't perform duplicate removal neither for markdown nor for images.scans, considering that if the client uploaded several duplicating documents, he has the reason to do this, and this is as it should be.
172
+
173
+
Cleaned documents and images should be stored separately from the original files in a `cleaned_data` directory. This ensures keeping the original versions for reference and debugging.
174
+
175
+
#### iii. Metadata
176
+
177
+
**Document Metadata:**
178
+
- Document title
179
+
- Author
180
+
- Creation date
181
+
- Last modified date
182
+
- Table of Contents (for text documents)
183
+
- Summary (need to discuss whether it's necessary)
184
+
- Version history
185
+
- Version number
186
+
- Editor
187
+
- Version creation date
188
+
- Changes made in the version (if available)
189
+
- Diff information (if available)
190
+
191
+
**Handling Metadata:**
192
+
193
+
For Markdown documents, embed metadata in a YAML format at the top of each document. For images, metadata can be stored in a separate JSON file with the same name as the image.
194
+
195
+
#### Example Metadata Structure for a Markdown Document:
196
+
197
+
```yaml
198
+
---
199
+
title: "Sample Document"
200
+
author: "John Doe"
201
+
created_at: "2023-01-01"
202
+
last_modified: "2024-06-30"
203
+
toc:
204
+
- chapter: Introduction
205
+
page_start: 2
206
+
starts_with: In this article we're about to introduce RAG implementation system for high-load cases.
207
+
chapter_summary: Introduction to RAG implementation system for high-load cases with author's motivation and real-world examples
208
+
- chapter: Chapter 1
209
+
page_start: 3
210
+
starts_with: Let's consider a situation where we have a platform designed for collaborative work and document sharing among clients.
211
+
chapter_summary: Problem statement and available data are described.
212
+
- chapter: Chapter 2
213
+
page_start: 6
214
+
starts_with: In order to perform quality RAG, we need the data to be prepared for this.
215
+
chapter_summary: Data cleaning schema and other aspects.
216
+
- chapter: Conclusion
217
+
page_start: 10
218
+
starts_with: Now let's move on to conclusion.
219
+
chapter_summary: Conclusion about the ways we can built a system
220
+
summary: "This document provides an overview of..."
221
+
version_info:
222
+
- version: "v1"
223
+
editor: "Jane Smith"
224
+
change_date: "2023-02-01"
225
+
diff: "Initial creation of the document."
226
+
- version: "v2"
227
+
editor: "John Doe"
228
+
change_date: "2023-06-15"
229
+
diff: "Added new chapter on advanced topics."
230
+
- version: "v3"
231
+
editor: "Jane Smith"
232
+
change_date: "2024-06-30"
233
+
diff: "Updated the introduction and conclusion sections."
0 commit comments