You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Design_Doc_Examples/RAG_Q&A_for collaborative_work_platform.md
+23-17Lines changed: 23 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -131,19 +131,18 @@ Clients can access all versions of each document.
131
131
132
132
#### ii. Data cleaning
133
133
134
-
Data cleaning process should be automatized with reproducible scripts. They need to run with some schedule, regularly + on huge new document upload.
135
-
Question: is version control needed? We already have one in the system. How to deal with that?
134
+
Data cleaning process should be automatized with reproducible scripts. Script runs once for each new document that gets uploaded to the system and then for each new version of that document.
136
135
137
136
**1. Markdown Documents:**
138
-
- Links: do we need to save links? If yes, check that links within the document are not broken. Update or remove broken links.
139
-
- Duplicate removal. Need to ensure that different versions of one document are not treated as duplicates despite being very similar.
140
-
- Table of Contents (ToC): Generate or validate the presence of a ToC for documents longer than 10 pages. Tools like Pandoc or custom scripts can automate this.
141
-
- Extract and clean the text from Markdown files, removing any Markdown syntax.
137
+
- Inter-document links (links that reference another part of the same document or another document in the system): enrich such links with aliases before using them for RAG purposes, i.e. add a descriptive label or name to the link that makes it more meaningful. This helps RAG to understand the context and content of the link. Example: "Section 2" -> "Section 2: Data Cleaning Procedures"
138
+
- Plain URLs (links to external web resources): don't modify, keep them as-is for LLM consumption.
139
+
- Table of Contents (ToC): Generate or validate the presence of a ToC for documents longer than 10 pages.
142
140
143
141
**2. Scanned/Image Documents:**
144
142
- Enhance the quality of scans (e.g., adjusting brightness/contrast, removing noise).
145
-
- Optical Character Recognition (OCR) for scans. Is it necessary? If yes, convert scans into text with OCR (Tesseract).
146
-
- Duplicate removal
143
+
- Perform Optical Character Recognition (OCR) for scans. Store both initial scan and its recognized content.
144
+
145
+
We don't perform duplicate removal neither for markdown nor for images.scans, considering that if the client uploaded several duplicating documents, he has the reason to do this, and this is as it should be.
147
146
148
147
Cleaned documents and images should be stored separately from the original files in a `cleaned_data` directory. This ensures keeping the original versions for reference and debugging.
149
148
@@ -167,8 +166,6 @@ Cleaned documents and images should be stored separately from the original files
167
166
168
167
For Markdown documents, embed metadata in a YAML format at the top of each document. For images, metadata can be stored in a separate JSON file with the same name as the image.
169
168
170
-
Do we need to add cleaning scripts information into the yaml?
171
-
172
169
#### Example Metadata Structure for a Markdown Document:
173
170
174
171
```yaml
@@ -178,10 +175,22 @@ author: "John Doe"
178
175
created_at: "2023-01-01"
179
176
last_modified: "2024-06-30"
180
177
toc:
181
-
- Introduction
182
-
- Chapter 1
183
-
- Chapter 2
184
-
- Conclusion
178
+
- chapter: Introduction
179
+
page_start: 2
180
+
starts_with: In this article we're about to introduce RAG implementation system for high-load cases.
181
+
chapter_summary: Introduction to RAG implementation system for high-load cases with author's motivation and real-world examples
182
+
- chapter: Chapter 1
183
+
page_start: 3
184
+
starts_with: Let's consider a situation where we have a platform designed for collaborative work and document sharing among clients.
185
+
chapter_summary: Problem statement and available data are described.
186
+
- chapter: Chapter 2
187
+
page_start: 6
188
+
starts_with: In order to perform quality RAG, we need the data to be prepared for this.
189
+
chapter_summary: Data cleaning schema and other aspects.
190
+
- chapter: Conclusion
191
+
page_start: 10
192
+
starts_with: Now let's move on to conclusion.
193
+
chapter_summary: Conclusion about the ways we can built a system
185
194
summary: "This document provides an overview of..."
186
195
version_info:
187
196
- version: "v1"
@@ -196,9 +205,6 @@ version_info:
196
205
editor: "Jane Smith"
197
206
change_date: "2024-06-30"
198
207
diff: "Updated the introduction and conclusion sections."
0 commit comments