Skip to content

Open source ETL framework for retrieval augmented generation (RAG). Sync data from your SaaS tools to a vector store, where they can be easily queried by GPT apps

License

Notifications You must be signed in to change notification settings

vijaykramesh/sidekick

 
 

Repository files navigation

Connect your SaaS tools to a vector database and keep your data synced

Slack License License

Sidekick is a framework for integrating with SaaS tools like Salesforce, Github, Notion, Zendesk and syncing data between these tools and a vector store. You can also use the integrations and chunkers from built by the community to get started quickly, or quickly build new integrations and write custom chunkers for different content types based on Sidekick's DataConnector and DataChunker specs.

Demo

Get an API key to test out a hosted version by joining our Slack community.. Post in the #api-keys channel to request a new key. You can test it out on some pre-ingested developer docs by tagging the Sidekick bot in the #sidekick-demo channel.

Demo Video

💎 Features

  • Scrape HTML pages and chunk them
  • Load Markdown files from a Github repo and chunk them
  • Connect to Weaviate vector store and load chunks
  • FastAPI endpoints to query vector store directly, or perform Q&A with OpenAI models
  • Slackbot interface to perform Q&A with OpenAI models

Upcoming

  • DataConnector and DataChunker abstractions to make it easier to contribute new connectors/chunkers
  • Connect to Pinecone, Milvus, and Qdrant vector stores

Getting Started - 15 min

To run Sidekick locally:

  1. Install Python 3.10, if not already installed.

  2. Clone the repository: git clone https://github.com/ai-sidekick/sidekick.git

  3. Navigate to the sidekick-server directory: cd /path/to/sidekick/sidekick-server

  4. Install poetry: pip install poetry

  5. Create a new virtual environment with Python 3.10: poetry env use python3.10

  6. Install poetry-dotenv: poetry self add poetry-dotenv

  7. Activate the virtual environment: poetry shell

  8. Install app dependencies: poetry install

  9. Set the required environment variables in a .env file in sidekick-server:

    DATASTORE=weaviate
    BEARER_TOKEN=<your_bearer_token> // Can be any string when running locally. e.g. 22c443d6-0653-43de-9490-450cd4a9836f
    OPENAI_API_KEY=<your_openai_api_key>
    WEAVIATE_HOST=<Your Weaviate instance host address> // Optional, defaults to http://127.0.0.1
    WEAVIATE_PORT=<Your Weaviate port number> // Optional, defaults to 8080. Should be set to 443 for Weaviate Cloud
    WEAVIATE_INDEX=<Your chosen Weaviate class/collection name to store your chunks> // e.g. MarkdownChunk
    

    Note that we currently only support weaviate as the data store. You can run Weaviate locally with Docker or set up a sandbox cluster to get a Weaviate host address.

  10. Create a file app_config.py in the sidekick-server directory. This should contain an object app_config which maps from each bearer token to a product_id

    app_config = {
      "22c443d6-0653-43de-9490-450cd4a9836f": {
        "product_id": "salesforce"
      }
    }
    

    The product_id should be a unique identifier for the source of your data.

  11. Run the API locally: poetry run start

  12. Access the API documentation at http://0.0.0.0:8000/docs and test the API endpoints (make sure to add your bearer token).

For support and questions, join our Slack community.

API Endpoints

The server is based on FastAPI so you can view the interactive API documentation at <local_host_url i.e. http://0.0.0.0:8000>/docs when you are running it locally.

These are the available API endpoints:

  • /upsert-web-data: This endpoint takes a url as input, uses Playwright to crawl through the webpage (and any linked webpages), and loads them into the vectorstore.

  • /query: Endpoint to query the vector database with a string. You can filter by source type (web, markdown, etc.) and set the max number of chunks returned.

  • /ask-llm: Endpoint to get an answer to a question from an LLM, based on the data in the vectorstore. In the response, you get back the sources used in the answer, the user's intent, and whether or not the question is answerable based on the content in your vectorstore.

Contributing

Sidekick is open for contribution! To add a new data connector, follow the outlined steps:

  1. Create a new folder under connectors named <data-source>-connector where <data-source> is the name of the source you are connecting to.
  2. This folder should contain a file load.py with a function load_data that returns List[DocumentChunk]
  3. Create a new endpoint in /server/main.py that calls load_data
  4. Add the new source type in models/models.py

Acknowledgments

About

Open source ETL framework for retrieval augmented generation (RAG). Sync data from your SaaS tools to a vector store, where they can be easily queried by GPT apps

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 86.9%
  • TypeScript 11.3%
  • HTML 1.2%
  • Other 0.6%