Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowledge doc ingestion #148

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aakankshaduggal
Copy link
Member

@aakankshaduggal aakankshaduggal commented Oct 25, 2024


### 3.3 Introducing the Document Chunking Command

- **Command Overview**: We propose a new command, `ilab document format`, which will:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for using the word format? The first time I read the command without reading the whole document, it gave me the impression the command was for transforming a document from one format to another format (e.g. pdf to md, or pdf to json). Should we consider a word or verb that more closely resembles what is actually happening in this step? For example:

ilab docs import --input path/to/document.pdf --output path/to/schema
ilab docs ingest --input path/to/document.pdf --output path/to/schema
ilab docs process --input path/to/document.pdf --output path/to/schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose ilab data being used instead of ilab docs or ilab document - we already have this command group implemented in the CLI, and it would be good if users can keep their data manipulation in a single command group. I'd like some folks from UX to weigh-in here as well.

Copy link

@JustinXHale JustinXHale Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If think the document-related commands are going to expand considerably, then this might be the time to create a new command group. If the document processing will remain a small subset, then integrating them under the data group would simplify the CLI structure.

ingest Provide the document path to ingest and process into the desired scheme.
process Provide the document path to process into the scheme.
import Specify the document path to import and format according to the scheme requirements.
chunk Enter the document path to split and structure for the scheme.

‘ilab data [verb] [path]’

  • Pro: Keeps all data-related commands in one place, which potentially makes a unified experience for the user and maybe more easily discoverable for tasks. This simplifies the CLI structure.
  • Con: The broad scope may lead to potentially cluttering the group with varied tasks. Users who are focused on document-specific action might find it harder to locate

‘Ilab docs/document’

  • Pro: Creates a clear and dedicated space for document specific command, which potentially makes it easier for users working with document related functions. This leaves a lot of room for scalability
  • Con: Adds another command group, fragmenting the CLI, especially if the document task/commands are minimal. Users might need to switch between command groups if they are working with documents and other data types.

## 5. InstructLab Schema Overview

### Key Components:
- **Docling JSON Output**: The output from Docling will be the instructlab schema, which serves as the backbone for both SDG and RAG workflows. For specific details around the leaf node path or timestamp, we will include that as a part of the file nomenclature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the ingestion command, should we consider a flag where the pipeline could augment the metadata of the final output like --metadata ./path-to-metadata.json to add information such as attribution, timestamps, ilab version, schema version, etc.?

@aakankshaduggal aakankshaduggal marked this pull request as ready for review October 28, 2024 20:17
@nathan-weinberg
Copy link
Member

cc @juliadenham

Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @aakankshaduggal thank you for writing it up!

I would like to see this shared in #sdg in upstream Slack and maybe also sent out to dev@instructlab.ai so community members can weigh-in as well


## 3. Proposed Approach

### 3.1 Custom InstructLab Schema Design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth pointing out we have an existing schema package maintained by the @instructlab/schema-maintainers: https://github.com/instructlab/schema

Right now this is only for Taxonomy schema, but we could extend this with Classes designed specifically for this usecase


### 3.2 PDF and Document Conversion via Docling

- **Docling Integration**: We will leverage **Docling** to convert files into structured JSON, which will be the starting point for the instructlab schema. Individual components will post-process this JSON as per the requirements of the specific SDG and RAG workflows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include a link here to somewhere where folks can read more about Docling, and perhaps a bit as to why Docling is the chosen solution here?


### 3.3 Introducing the Document Chunking Command

- **Command Overview**: We propose a new command, `ilab document format`, which will:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose ilab data being used instead of ilab docs or ilab document - we already have this command group implemented in the CLI, and it would be good if users can keep their data manipulation in a single command group. I'd like some folks from UX to weigh-in here as well.

- Take a document path (defined in `qna.yaml`).
- Format and chunk the document into the desired schema.

- **Implementation Details**: Initially, this functionality will be integrated into the existing SDG repository. Over time, it can evolve into a standalone utility, allowing external integrations and wider usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the motivation for moving this out of the SDG repository? "allowing external integrations and wider usage" doesn't really tell me much


- **Current Challenge**: Knowledge documents are stored in Git-based repositories, which may be unfamiliar to many users.
- **Proposed Solution**:
- Allow users to input a local directory and provide an automated script that:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a "script," why not just have this be part of the code? We can detect if a given directory is git-tracked (e.g. by checking for a .git subdirectory) and do the manipulation described if not


Here is a conceptual diagram illustrating the workflow from document ingestion to schema conversion and chunking:

![Knowledge_Document_Ingestion_Workflow](https://github.com/user-attachments/assets/06504b1b-bc8f-4909-b6a2-732a056613c5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice diagram!

@nathan-weinberg nathan-weinberg linked an issue Oct 29, 2024 that may be closed by this pull request
@nathan-weinberg
Copy link
Member

Is this related? #120 cc @makelinux

@nathan-weinberg
Copy link
Member

This one from @jjasghar also seems related? #106

@nathan-weinberg
Copy link
Member

One more #64

@relyt0925
Copy link

relyt0925 commented Oct 31, 2024

@aakankshaduggal (trying to envision overall flow in my head): I almost view this as proposing two independent yet related enhancements. One is the ability to define references to "documents" in a variety of ways versus just through git references. The other is actually talking about new document formats and how they would be injested.

so do you envision a user will still declaratively define "pointers" in their taxonomy to the backing doc storage similar to what is done today in knowledge like the following example:

document:
  repo: https://github.com/relyt0925/rbc-knowledge
  commit: 99dae176de4927940aee4faaeb0f645b3ee4582b
  patterns:
    - pdf_chunk*.md

However this "declarative definition" is now more flexible in the sense that it no longer has to necessarily just be
repo, commit, pattern It could be something like filepath within the base of a taxonomy which could look something like this

document:
  local_directory: documents/docchunks/ 
  patterns:
    - pdf_chunk0.md

Which would then in ilab data generate when I am processing the leaf node lead to the sdg process looking in a local path relative to the "taxonomy base" path for the documents to use in sdg?

(Scoping this comment to comment one which is really a document independent topic). Is there more specifics on the general number of formats that we want to introduce? Do we have specifics on how that document section enhancement would look like?) I ask about the other formats to see if we are bringing in formats that bring in the need for implicit dependencies (like for example a S3 bucket where somehow in the schema we then need to build a flexible way for the user to define how they want to interact with the COS bucket: which could be different in different environments.)

@relyt0925
Copy link

relyt0925 commented Oct 31, 2024

Then 2: the document type enhancement

First question: would it also be accurate to say that as we add in new document types (independent of the ways we reference them): we are still going to keep the declarative nature of the taxonomy where a user will explicitly reference the document in the taxonomy section. SDG then will handle when looking at the document determining it's type and then if it needs to be processed by docling and chunked. It will then produce the chunks (in the example of a 3 MB PDF file about 250 md chunks are produced): and handle ensuring those are processed as the "set" of documents for sdg? This would continue if multiple pdf files were defined?

I am curious if you are envisioning things remaining in that flow versus what I would call a "pre processing" flow where users have to expilictly use the tooling to get the pdf docs converted as a pre req step to setting up a taxonomy, then create a knowledge repo (that would always only contain markdown documents), and then create a leaf node that points to the markdown documents only. Does that make sense the difference at a high level on what I am talking on?

So basically in option 1 which I think is what we are after:
SDG would see as it's parsing the leaf node something like

document:
  local_directory: documents/pdfs/ 
  patterns:
    - pdf1.pdf

Then know in processing: ok this doc is type PDF: first I need to go through and convert the pdf document to markdown chunks. Let me automatically do that. Then ok: now I know all these chunks are the full set of "documents" I am running for the leaf node. Ok let me then take that and run that for the leaf node and now we are off to the races same flow we have currently. Same idea for docx files or any other file type we add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Knowledge Document Ingestion Pipeline Design Proposal
5 participants