Skip to content

Commit 68efef8

Browse files
committed
Addressed the comment from review
1 parent 907846c commit 68efef8

File tree

1 file changed

+23
-17
lines changed

1 file changed

+23
-17
lines changed

Design_Doc_Examples/RAG_Q&A_for collaborative_work_platform.md

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -131,19 +131,18 @@ Clients can access all versions of each document.
131131

132132
#### ii. Data cleaning
133133

134-
Data cleaning process should be automatized with reproducible scripts. They need to run with some schedule, regularly + on huge new document upload.
135-
Question: is version control needed? We already have one in the system. How to deal with that?
134+
Data cleaning process should be automatized with reproducible scripts. Script runs once for each new document that gets uploaded to the system and then for each new version of that document.
136135

137136
**1. Markdown Documents:**
138-
- Links: do we need to save links? If yes, check that links within the document are not broken. Update or remove broken links.
139-
- Duplicate removal. Need to ensure that different versions of one document are not treated as duplicates despite being very similar.
140-
- Table of Contents (ToC): Generate or validate the presence of a ToC for documents longer than 10 pages. Tools like Pandoc or custom scripts can automate this.
141-
- Extract and clean the text from Markdown files, removing any Markdown syntax.
137+
- Inter-document links (links that reference another part of the same document or another document in the system): enrich such links with aliases before using them for RAG purposes, i.e. add a descriptive label or name to the link that makes it more meaningful. This helps RAG to understand the context and content of the link. Example: "Section 2" -> "Section 2: Data Cleaning Procedures"
138+
- Plain URLs (links to external web resources): don't modify, keep them as-is for LLM consumption.
139+
- Table of Contents (ToC): Generate or validate the presence of a ToC for documents longer than 10 pages.
142140

143141
**2. Scanned/Image Documents:**
144142
- Enhance the quality of scans (e.g., adjusting brightness/contrast, removing noise).
145-
- Optical Character Recognition (OCR) for scans. Is it necessary? If yes, convert scans into text with OCR (Tesseract).
146-
- Duplicate removal
143+
- Perform Optical Character Recognition (OCR) for scans. Store both initial scan and its recognized content.
144+
145+
We don't perform duplicate removal neither for markdown nor for images.scans, considering that if the client uploaded several duplicating documents, he has the reason to do this, and this is as it should be.
147146

148147
Cleaned documents and images should be stored separately from the original files in a `cleaned_data` directory. This ensures keeping the original versions for reference and debugging.
149148

@@ -167,8 +166,6 @@ Cleaned documents and images should be stored separately from the original files
167166

168167
For Markdown documents, embed metadata in a YAML format at the top of each document. For images, metadata can be stored in a separate JSON file with the same name as the image.
169168

170-
Do we need to add cleaning scripts information into the yaml?
171-
172169
#### Example Metadata Structure for a Markdown Document:
173170

174171
```yaml
@@ -178,10 +175,22 @@ author: "John Doe"
178175
created_at: "2023-01-01"
179176
last_modified: "2024-06-30"
180177
toc:
181-
- Introduction
182-
- Chapter 1
183-
- Chapter 2
184-
- Conclusion
178+
- chapter: Introduction
179+
page_start: 2
180+
starts_with: In this article we're about to introduce RAG implementation system for high-load cases.
181+
chapter_summary: Introduction to RAG implementation system for high-load cases with author's motivation and real-world examples
182+
- chapter: Chapter 1
183+
page_start: 3
184+
starts_with: Let's consider a situation where we have a platform designed for collaborative work and document sharing among clients.
185+
chapter_summary: Problem statement and available data are described.
186+
- chapter: Chapter 2
187+
page_start: 6
188+
starts_with: In order to perform quality RAG, we need the data to be prepared for this.
189+
chapter_summary: Data cleaning schema and other aspects.
190+
- chapter: Conclusion
191+
page_start: 10
192+
starts_with: Now let's move on to conclusion.
193+
chapter_summary: Conclusion about the ways we can built a system
185194
summary: "This document provides an overview of..."
186195
version_info:
187196
- version: "v1"
@@ -196,9 +205,6 @@ version_info:
196205
editor: "Jane Smith"
197206
change_date: "2024-06-30"
198207
diff: "Updated the introduction and conclusion sections."
199-
cleaning_info:
200-
cleaned_at: "2024-07-03"
201-
tools_used: ["Markdown Linter", "Link Checker"]
202208
---
203209
```
204210

0 commit comments

Comments
 (0)