Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cohere-ai/quick-start-connectors: code for integrating workplace datastores with Cohere's LLMs to perform RAG #664

Open
1 task
irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Labels
data-validation Validating data structures and formats llm Large Language Models llm-applications Topics related to practical applications of Large Language Models in various fields llm-function-calling Function Calling with Large Language Models RAG Retrieval Augmented Generation for LLMs source-code Code snippets

Comments

@irthomasthomas
Copy link
Owner

TITLE

cohere-ai/quick-start-connectors: This open-source repository offers reference code for integrating workplace datastores with Cohere's LLMs, enabling developers and businesses to perform seamless retrieval-augmented generation (RAG) on their own data.

DESCRIPTION

Quick Start Connectors

Table of Contents

  • Overview
  • Features
  • Getting Started
  • Contributing

Overview
Cohere's Build-Your-Own-Connector framework allows you to integrate Cohere's Command LLM via the Chat api endpoint to any datastore/software that holds text information and has a corresponding search endpoint exposed in its API. This allows the Command model to generated responses to user queries that are grounded in proprietary information.

Some examples of the use-cases you can enable with this framework:

  • Generic question/answering around broad internal company docs
  • Knowledge working with specific sub-set of internal knowledge
  • Internal comms summary and search
  • Research using external providers of information, allowing researchers and writers to explore to information from 3rd parties

This open-source repository contains code that will allow you to get started integrating with some of the most popular datastores. There is also an empty template connector which you can expand to use any datasource. Note that different datastores may have different requirements or limitations that need to be addressed in order to to get good quality responses. While some of the quickstart code has been enhanced to address some of these limitations, others only provide the basics of the integration, and you will need to develop them further to fit your specific use-case and the underlying datastore limitations.

Please read more about our connectors framework here: https://docs.cohere.com/docs/connectors

URL

https://github.com/cohere-ai/quick-start-connectors

Suggested labels

{'label-name': 'data-integration', 'label-description': "Involves integrating Cohere's LLMs with various data sources for retrieval-augmented generation.", 'confidence': 63.24}

@irthomasthomas irthomasthomas added data-validation Validating data structures and formats New-Label Choose this option if the existing labels are insufficient to describe the content accurately RAG Retrieval Augmented Generation for LLMs source-code Code snippets labels Feb 28, 2024
@irthomasthomas
Copy link
Owner Author

Related issues

#553: Cohere llm api: Retrieval Augmented Generation (RAG)

### DetailsSimilarity score: 0.88 - [ ] [Retrieval Augmented Generation (RAG)](https://docs.cohere.com/docs/retrieval-augmented-generation-rag)

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a method for generating text using additional information fetched from an external data source. Providing relevant documents to the model can greatly increase the accuracy of the response. The Chat API in combination with the Command model makes it easy to generate text that is grounded on supplementary information.

For example, the code snippet below will produce an answer to "Where do the tallest penguins live?" along with inline citations based on the provided documents.

More about Retrieval Augmented Generation (RAG)

Suggested labels

{'label-name': 'text-generation-method', 'label-description': 'Method for generating text using external information', 'confidence': 54.99}

#396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API

### DetailsSimilarity score: 0.85 - [ ] [datastax/astra-assistants-api: A backend implementation of the OpenAI beta Assistants API](https://github.com/datastax/astra-assistants-api)

Astra Assistant API Service

A drop-in compatible service for the OpenAI beta Assistants API with support for persistent threads, files, assistants, messages, retrieval, function calling and more using AstraDB (DataStax's db as a service offering powered by Apache Cassandra and jvector).

Compatible with existing OpenAI apps via the OpenAI SDKs by changing a single line of code.

Getting Started

  1. Create an Astra DB Vector database
  2. Replace the following code:
client = OpenAI(
    api_key=OPENAI_API_KEY,
)

with:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key=OPENAI_API_KEY,
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
    }
)

Or, if you have an existing astra db, you can pass your db_id in a second header:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key=OPENAI_API_KEY,
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
        "astra-db-id": ASTRA_DB_ID
    }
)
  1. Create an assistant
assistant = client.beta.assistants.create(
  instructions="You are a personal math tutor. When asked a math question, write and run code to answer the question.",
  model="gpt-4-1106-preview",
  tools=[{"type": "retrieval"}]
)

By default, the service uses AstraDB as the database/vector store and OpenAI for embeddings and chat completion.

Third party LLM Support

We now support many third party models for both embeddings and completion thanks to litellm. Pass the api key of your service using api-key and embedding-model headers.

For AWS Bedrock, you can pass additional custom headers:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key="NONE",
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
        "embedding-model": "amazon.titan-embed-text-v1",
        "LLM-PARAM-aws-access-key-id": BEDROCK_AWS_ACCESS_KEY_ID,
        "LLM-PARAM-aws-secret-access-key": BEDROCK_AWS_SECRET_ACCESS_KEY,
        "LLM-PARAM-aws-region-name": BEDROCK_AWS_REGION,
    }
)

and again, specify the custom model for the assistant.

assistant = client.beta.assistants.create(
    name="Math Tutor",
    instructions="You are a personal math tutor. Answer questions briefly, in a sentence or less.",
    model="meta.llama2-13b-chat-v1",
)

Additional examples including third party LLMs (bedrock, cohere, perplexity, etc.) can be found under examples.

To run the examples using poetry:

  1. Create a .env file in this directory with your secrets.
  2. Run:
poetry install
poetry run python examples/completion/basic.py
poetry run python examples/retreival/basic.py
poetry run python examples/function-calling/basic.py

Coverage

See our coverage report here.

Roadmap

  • Support for other embedding models and LLMs
  • Function calling
  • Pluggable RAG strategies
  • Streaming support

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

#644: cohereai_classify table | CohereAI plugin | Steampipe Hub

### DetailsSimilarity score: 0.85 - [ ] [cohereai_classify table | CohereAI plugin | Steampipe Hub](https://hub.steampipe.io/plugins/mr-destructive/cohereai/tables/cohereai_classify)

TITLE: cohereai_classify table | CohereAI plugin | Steampipe Hub

DESCRIPTION:
Overview
8Tables
Versions
GitHub
steampipe plugin install mr-destructive/cohereai

cohereai_classify
cohereai_detect_language
cohereai_detokenize
cohereai_embed
cohereai_generation
cohereai_summaraize
cohereai_summarize
cohereai_tokenize

ON THIS PAGE
Examples
Schema

GET INVOLVED
Edit on GitHub
Discuss on Slack

Table: cohereai_classify

Get classification for a given input strings and examples.

Notes:

  • A inputs is a list of strings to classify.(max 96 strings)
  • A examples is a list of {"text": "apple", "label": "fruit"} structure of type Example
  • Minimum 2 examples should be provided and the maximum value is 2500 with each example of maximum of 512 tokens.

Examples

Basic classification with given set of inputs and examples

select
  classification
from
  cohereai_classify
where
  inputs = '["apple", "blue", "pineapple"]'
  and examples = '[{"text": "apple", "label": "fruit"}, {"text": "green", "label": "color"}, {"text": "grapes", "label": "fruit"}, {"text": "purple", "label": "color"}]';

Classification with specific settings(model, preset)

select
  classification
from
  cohereai_classify
where
  settings = '{
 "model": "embed - multilingual - v2.0" }'
  and inputs = '["Help!", "Call me when you can"]'
  and examples = '[{"text": "Help!", "label": "urgent"}, {"text": "SOS", "label": "urgent"}, {"text": "Call me when you can", "label": "not urgent"}, {"text": "Talk later?", "label": "not urgent"}]';

Email Spam Classification

select
  classification
from
  cohereai_classify
where
  inputs = '["Confirm your email address", "hey i need u to send some $"]'
  and examples = '[{"label": "Spam", "text": "Dermatologists don't like her!"}, {"label": "Spam", "text": "Hello, open to this?"}, {"label": "Spam", "text": "I need help please wire me $1000 right now"}, {"label": "Spam", "text": "Hot new investment, don't miss this!"}, {"label": "Spam", "text": "Nice to know you ;)"}, {"label": "Spam", "text": "Please help me?"}, {"label": "Not spam", "text": "Your parcel will be delivered today"}, {"label": "Not spam", "text": "Review changes to our Terms and Conditions"}, {"label": "Not spam", "text": "Weekly sync notes"}, {"label": "Not spam", "text": "Re: Follow up from today's meeting"}, {"label": "Not spam", "text": "Pre-read for tomorrow"}]';

Schema for cohereai_classify

Name Type Operators Description
_ctx jsonb Steampipe context in JSON form, e.g. connection_name.
classification text The classification results for the given input text(s).
confidence double precision The confidence score of the classification.
examples text The example text classified.
id text The ID of the classification.
inputs text The input text that was classified.
labels jsonb The labels of the classification.
settings jsonb Settings is a JSONB object that accepts any of the classify API request parameters.

URL: cohereai_classify table | CohereAI plugin | Steampipe Hub

Suggested labels

#315: A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog

### DetailsSimilarity score: 0.84 - [ ] [A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog](https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b)

A comprehensive RAG Cheat Sheet detailing motivations for RAG as well as techniques and strategies for progressing beyond Basic or Naive RAG builds. (high-resolution version)
It’s the start of a new year and perhaps you’re looking to break into the RAG scene by building your very first RAG system. Or, maybe you’ve built Basic RAG systems and are now looking to enhance them to something more advanced in order to better handle your user’s queries and data structures.
In either case, knowing where or how to begin may be a challenge in and of itself! If that’s true, then hopefully this blog post points you in the right direction for your next steps, and moreover, provides for you a mental model for you to anchor your decisions when building advanced RAG systems.
The RAG cheat sheet shared above was greatly inspired by a recent RAG survey paper (“Retrieval-Augmented Generation for Large Language Models: A Survey” Gao, Yunfan, et al. 2023).
Basic RAG
Mainstream RAG as defined today involves retrieving documents from an external knowledge database and passing these along with the user’s query to an LLM for response generation. In other words, RAG involves a Retrieval component, an External Knowledge database and a Generation component.
LlamaIndex Basic RAG Recipe:
from llama_index import SimpleDirectoryReader, VectorStoreIndex

load data

documents = SimpleDirectoryReader(input_dir="...").load_data()

build VectorStoreIndex that takes care of chunking documents

and encoding chunks to embeddings for future retrieval

index = VectorStoreIndex.from_documents(documents=documents)

The QueryEngine class is equipped with the generator

and facilitates the retrieval and generation steps

query_engine = index.as_query_engine()

Use your Default RAG

response = query_engine.query("A user's query")

Suggested labels

{ "key": "RAG-Building", "value": "Techniques and strategies for building advanced Retrieval Augmented Generation systems for language models" }

#311: Introduction | Mistral AI Large Language Models

### DetailsSimilarity score: 0.84 - [ ] [Introduction | Mistral AI Large Language Models](https://docs.mistral.ai/)

Mistral AI currently provides two types of access to Large Language Models:

An API providing pay-as-you-go access to our latest models,
Open source models available under the Apache 2.0 License, available on Hugging Face or directly from the documentation.
Where to start?

API Access

Our API is currently in beta to ramp up the load and provide good quality of service. Access the platform to join the waitlist. Once your subscription is active, you can immediately use our chat endpoint:

curl --location "https://api.mistral.ai/v1/chat/completions"
--header 'Content-Type: application/json'
--header 'Accept: application/json'
--header "Authorization: Bearer $MISTRAL_API_KEY"
--data '{
"model": "mistral-tiny",
"messages": [{"role": "user", "content": "Who is the most renowned French painter?"}]
}'

Or our embeddings endpoint:

curl --location "https://api.mistral.ai/v1/embeddings"
--header 'Content-Type: application/json'
--header 'Accept: application/json'
--header "Authorization: Bearer $MISTRAL_API_KEY"
--data '{
"model": "mistral-embed",
"input": ["Embed this sentence.", "As well as this one."]
}'

For a full description of the models offered on the API, head on to the model docs.

For more examples on how to use our platform, head on to our platform docs.

Raw model weights

Raw model weights can be used in several ways:

For self-deployment, on cloud or on premise, using either TensorRT-LLM or vLLM, head on to Deployment
For research, head-on to our reference implementation repository,
For local deployment on consumer grade hardware, check out the llama.cpp project or Ollama.
Get Help

Join our Discord community to discuss our models and talk to our engineers. Alternatively, reach out to our business team if you have enterprise needs, want more information about our products or if there are missing features you would like us to add.

Contributing

Mistral AI is committed to open source software development and welcomes external contributions. Please open a PR!

Suggested labels

{ "key": "llm-api", "value": "Accessing Large Language Models through the Mistral AI API" }

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

### DetailsSimilarity score: 0.84 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1)

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}

Source Data

@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}

@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

  • Open-Web (Common Crawl Foundation Terms of Use)
  • Books: the_pile_books3 license and pg19 license
  • ArXiv Terms of Use
  • Wikipedia License
  • StackExchange license on the Internet Archive

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

@irthomasthomas irthomasthomas changed the title cohere-ai/quick-start-connectors: This open-source repository offers reference code for integrating workplace datastores with Cohere's LLMs, enabling developers and businesses to perform seamless retrieval-augmented generation (RAG) on their own data. cohere-ai/quick-start-connectors: code for integrating workplace datastores with Cohere's LLMs to perform RAG Feb 28, 2024
@irthomasthomas irthomasthomas added llm Large Language Models llm-function-calling Function Calling with Large Language Models llm-applications Topics related to practical applications of Large Language Models in various fields and removed New-Label Choose this option if the existing labels are insufficient to describe the content accurately labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-validation Validating data structures and formats llm Large Language Models llm-applications Topics related to practical applications of Large Language Models in various fields llm-function-calling Function Calling with Large Language Models RAG Retrieval Augmented Generation for LLMs source-code Code snippets
Projects
None yet
Development

No branches or pull requests

1 participant