Docs2DB

Content focused RAG system.

Docs2DB builds a RAG database from a directory of content files. It:

Retrieves data from a given location on disk.
Stores source data in working folders in Docling format.
Processes data into chunks and embeddings (Granite, others are possible)
Loads data in a PostgresDB and produces pg_dump files (Milvus is possible)

Quickstart

make docs2db SOURCE=/Users/me/Documents/my-pdfs

This will create a ragdb_dump.sql you may use for RAG in Postrgesql.

Test your rag with:

make db-up (restarts the db you just created, it's still there)
uv run python ./scripts/rag_demo_client.py --interactive

Ingestion

The ingestion process populates /content with Docling doc files in json format.

Ingest documents with uv run docs2db ingest path/to/source/files

Source files can be in a directory structure, it will be recreated in the /content directory that gets created. Source files may be any type that Docling can ingest: .html, .htm, .pdf, .docx, .pptx, .xlsx, .md, .csv

Processing

Before a database can be made or RAG can be served, the source documents need embeddings.

The /content directory holds Docling docs in .json format. In addition, it holds chunks and embeddings files alongside each of those doc files.

uv run docs2db chunks
- creates a .chunks.json file for each source file
uv run docs2db embed
- creates a .gran.json granite embedding file for each of these chunks files
uv run docs2db audit
- reports the number of source, chunk and embedding files
- logs warnings

Use uv run docs2db chunks --help or uv run docs2db embed --help to learn more.

Database

Docs2DB uses PostgreSQL with the pgvector extension for storing documents, chunks, and embeddings.

make db-up
- creates the database if it doesn't exist
- uses existing version of the database if it exists in Docker volumes
make db-down
- stops the container
- data persists across container restarts
make db-drop
- drops the database and all contents
- use when you need a clean slate
make load (or uv run docs2db load)
- load all documents, chunks and embeddings into database
- initilize database schema
- load pgvector
make db-status (or uv run docs2db db-status)
- report state of the database
  - running
  - initialized
  - contains data
  - detects configuration errors
make db-dump (or uv run docs2db db-dump)
- make ragdb_dump.sql from the current Postgresql database

Testing

Try out your RAG database with the demo client

uv run python scripts/rag_demo_client.py --query "wind energy" --limit 3
uv run python scripts/rag_demo_client.py --interactive

Automated testing requires its own postgres database, start one with make db-up-test and run tests with make test (or uv run docs2db test)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
scripts		scripts
src/docs2db		src/docs2db
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
postgres-compose.yml		postgres-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Docs2DB

Quickstart

Ingestion

Processing

Database

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

rhel-lightspeed/docs2db

Folders and files

Latest commit

History

Repository files navigation

Docs2DB

Quickstart

Ingestion

Processing

Database

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages