-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chroma/README.md at main · chroma-core/chroma #678
Comments
Related issues#325: simonw/llm pattern to migrate llm cli embeddings to chromadb### DetailsSimilarity score: 0.93 > **Note:** > > - [ ] [Support for plugins that implement vector indexes · Issue #216 · simonw/llm](https://github.com/simonw/llm/issues/216) > > @simonw happy to help here > > Here is how Chroma's roadmap aligns with your goals. > > The internals of Chroma are set up to have pluggable indexes on top of collections - we haven't yet exposed this to end users. But will fairly soon. We also plan to have a "smart index" that does KNN brute force and then a cutover to ANN. > > Indexes over multiple collections - while I do understand the use case - we've chosen to not prioritize this as it adds a lot of DX complexity. We instead encourage users to eat the "read amplification" and query multiple collections/indexes and then cull/rerank themselves client side. > > You may also enjoy reading this chroma proposal where we have put a lot of thought into the pipelines to support index/collection creation and access - [chroma-core/chroma#1110](https://github.com/chroma-core/chroma/issues/1110) > > @dave1010 mentioned this issue > > Add option for RAG-style augmentation [dave1010/clipea#1](https://github.com/dave1010/clipea/issues/1) > > Open > > @IvanVas commented > > If someone ever need to move data from llm to Chroma, below is a simple script to do so. Needs a little more work to productise it though. > > @simonw, hope it would help if you ever need to create smth like llm embed-multi migrate --chroma > > ```python > import sqlite3 > import struct > import chromadb > > def decode(binary): > if not isinstance(binary, bytes): > raise ValueError("Binary data must be of bytes type") > return struct.unpack("<" + "f" * (len(binary) // 4), binary) > > client = chromadb.PersistentClient(path="chroma.db") # FIXME > > collectionName = "collection" # FIXME > collection = client.get_or_create_collection(collectionName) > > # Path to your SQLite database > db_path = "/Users/username/Library/Application Support/io.datasette.llm/embeddings.db" # FIXME > > # Query to retrieve embeddings > llm_collection = "fixme" # FIXME > query = f""" > SELECT id, embedding, content FROM embeddings WHERE collection_id = ( > SELECT id FROM collections WHERE name = "{llm_collection}" > ) > """ > > def parse_id(id_str): > # FIXME parse metadata > > return { > "meta1": "meta1", > } > > def main(db_path, batch_size=1000): > # Connect to the SQLite database > conn = sqlite3.connect(db_path) > cursor = conn.cursor() > > # Query the embeddings table > cursor.execute(query) > rows = cursor.fetchall() # FIXME assumes there is not MUCH data, works fine for x100k records > > # Initialize batch variables > batch_embeddings = [] > batch_documents = [] # You'll need to adjust how you handle documents > batch_metadatas = [] # You'll need to adjust how you handle metadatas > batch_ids = [] > > for i, (id, embedding, content) in enumerate(rows): > idParsed = parse_id(id) > > # Decode the binary data > decoded_embedding = decode(embedding) > > # Append to the batch > batch_embeddings.append(list(decoded_embedding)) > batch_documents.append(content) > batch_metadatas.append({"meta1": idParsed['meta1']}) # FIXME > batch_ids.append(id) > > # When batch size is reached or end of rows > if len(batch_embeddings) == batch_size or i == len(rows) - 1: > collection.add( > embeddings=batch_embeddings, > documents=batch_documents, > metadatas=batch_metadatas, > ids=batch_ids > ) > > print(f"Added {len(batch_embeddings)} rows to chromaDb (Total: {i + 1})") > > # Reset batch variables > batch_embeddings = [] > batch_documents = [] > batch_metadatas = [] > batch_ids = [] > > # Close the database connection > conn.close() > > if __name__ == "__main__": > main(db_path) > ``` > > #### Suggested labels > > - { "key": "llm-inference-engines", "value": "Software and tools for running inference on Large Language Models (LLMs)" } > - { "key": "llm-quantization", "value": "All about quantized LLM models and their serving" }#625: unsloth/README.md at main · unslothai/unsloth### DetailsSimilarity score: 0.91 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)unsloth/README.md at main · unslothai/unsloth✨ Finetune for FreeAll notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
🦥 Unsloth.ai News
🔗 Links and Resources
⭐ Key Features
🥇 Performance Benchmarking
Suggested labels#396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API### DetailsSimilarity score: 0.89 - [ ] [datastax/astra-assistants-api: A backend implementation of the OpenAI beta Assistants API](https://github.com/datastax/astra-assistants-api)Astra Assistant API ServiceA drop-in compatible service for the OpenAI beta Assistants API with support for persistent threads, files, assistants, messages, retrieval, function calling and more using AstraDB (DataStax's db as a service offering powered by Apache Cassandra and jvector). Compatible with existing OpenAI apps via the OpenAI SDKs by changing a single line of code. Getting Started
client = OpenAI(
api_key=OPENAI_API_KEY,
) with: client = OpenAI(
base_url="https://open-assistant-ai.astra.datastax.com/v1",
api_key=OPENAI_API_KEY,
default_headers={
"astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
}
) Or, if you have an existing astra db, you can pass your db_id in a second header: client = OpenAI(
base_url="https://open-assistant-ai.astra.datastax.com/v1",
api_key=OPENAI_API_KEY,
default_headers={
"astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
"astra-db-id": ASTRA_DB_ID
}
)
assistant = client.beta.assistants.create(
instructions="You are a personal math tutor. When asked a math question, write and run code to answer the question.",
model="gpt-4-1106-preview",
tools=[{"type": "retrieval"}]
) By default, the service uses AstraDB as the database/vector store and OpenAI for embeddings and chat completion. Third party LLM SupportWe now support many third party models for both embeddings and completion thanks to litellm. Pass the api key of your service using For AWS Bedrock, you can pass additional custom headers: client = OpenAI(
base_url="https://open-assistant-ai.astra.datastax.com/v1",
api_key="NONE",
default_headers={
"astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
"embedding-model": "amazon.titan-embed-text-v1",
"LLM-PARAM-aws-access-key-id": BEDROCK_AWS_ACCESS_KEY_ID,
"LLM-PARAM-aws-secret-access-key": BEDROCK_AWS_SECRET_ACCESS_KEY,
"LLM-PARAM-aws-region-name": BEDROCK_AWS_REGION,
}
) and again, specify the custom model for the assistant. assistant = client.beta.assistants.create(
name="Math Tutor",
instructions="You are a personal math tutor. Answer questions briefly, in a sentence or less.",
model="meta.llama2-13b-chat-v1",
) Additional examples including third party LLMs (bedrock, cohere, perplexity, etc.) can be found under To run the examples using poetry:
poetry install
poetry run python examples/completion/basic.py
poetry run python examples/retreival/basic.py
poetry run python examples/function-calling/basic.py CoverageSee our coverage report here. Roadmap
Suggested labels{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face### DetailsSimilarity score: 0.88 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1)Getting StartedThe AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas! To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code: from datasets import load_dataset
import json
import numpy as np
# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)
# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)
# To process the entries:
for entry in ds:
embeddings = np.frombuffer(
entry['embeddings'], dtype=np.float32
).reshape(-1, 768)
text_chunks = json.loads(entry['text_chunks'])
metadata = json.loads(entry['metadata'])
print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
break A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch. LanguagesEnglish. Dataset StructureThe raw dataset structure is as follows: {
"url": ...,
"title": ...,
"metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
"text_chunks": ...,
"embeddings": ...,
"dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
} Dataset CreationThis dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets. To cite our work, please use the following: @software{SciPhi2023AgentSearch, Source Data@online{wikidump, @misc{paster2023openwebmath, @software{together2023redpajama, LicensePlease refer to the licenses of the data subsets you use.
Suggested labels{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }#74: Sqlite WASM and Github Pages### DetailsSimilarity score: 0.88 - [ ] [merge sqlite - Kagi Search](https://kagi.com/search?q=merge+sqlite&from_date=2021-01-01) - [ ] [sql - Fastest Way merge two SQLITE Databases - Stack Overflow](https://stackoverflow.com/questions/9349659/fastest-way-merge-two-sqlite-databases) - [ ] [simonw/sqlite-utils: Python CLI utility and library for manipulating SQLite databases](https://github.com/simonw/sqlite-utils) - [ ] [sqlite-utils](https://sqlite-utils.datasette.io/en/stable/) - [ ] [GitHub Pages | Websites for you and your projects, hosted directly from your GitHub repository. Just edit, push, and your changes are live.](https://pages.github.com/) - [ ] [sqlite wasm sql.js github pages - Kagi Search](https://kagi.com/search?q=sqlite+wasm+sql.js+github+pages) - [ ] [phiresky/sql.js-httpvfs: Hosting read-only SQLite databases on static file hosters like Github Pages](https://github.com/phiresky/sql.js-httpvfs) - [ ] [sqlite3 WebAssembly & JavaScript Documentation Index](https://sqlite.org/wasm/doc/trunk/index.md) - [ ] [phiresky/youtube-sponsorship-stats/](https://github.com/phiresky/youtube-sponsorship-stats/) - [ ] [phiresky/world-development-indicators-sqlite/](https://github.com/phiresky/world-development-indicators-sqlite/) - [ ] [Show HN: Fully-searchable Library Genesis on IPFS | Hacker News](https://news.ycombinator.com/item?id=28585208) - [ ] [nalgeon/sqlime: Online SQLite playground](https://github.com/nalgeon/sqlime) - [ ] [nalgeon/sqlean.js: Browser-based SQLite with extensions](https://github.com/nalgeon/sqlean.js) - [ ] [Sqlime - SQLite Playground](https://sqlime.org/) - [ ] [sqlite-wasm-http - npm](https://www.npmjs.com/package/sqlite-wasm-http) - [ ] [Postgres WASM by Snaplet and Supabase](https://supabase.com/blog/postgres-wasm) - [ ] [rhashimoto/wa-sqlite: WebAssembly SQLite with experimental support for browser storage extensions](https://github.com/rhashimoto/wa-sqlite) - [ ] [jlongster/absurd-sql: sqlite3 in ur indexeddb (hopefully a better backend soon)](https://github.com/jlongster/absurd-sql) - [ ] [help: basic example for the web browser · Issue #23 · phiresky/sql.js-httpvfs](https://github.com/phiresky/sql.js-httpvfs/issues/23) - [ ] [phiresky/sql.js-httpvfs: Hosting read-only SQLite databases on static file hosters like Github Pages](https://github.com/phiresky/sql.js-httpvfs#minimal-example-from-scratch) - [ ] [sql.js-httpvfs/example at master · phiresky/sql.js-httpvfs](https://github.com/phiresky/sql.js-httpvfs/tree/master/example) - [ ] [Code search results](https://github.com/search?utf8=%E2%9C%93&q=sql.js-httpvfs&type=code) - [ ] [knowledge/docs/databases/sqlite.md at 7c4bbc755c64368a82ca22b76566e9153cd2e377 · nikitavoloboev/knowledge](https://github.com/nikitavoloboev/knowledge/blob/7c4bbc755c64368a82ca22b76566e9153cd2e377/docs/databases/sqlite.md?plain=1#L96) - [ ] [static-wiki/src/sqlite.js at b0ae18ed02ca64263075c3516c38a504f46e10c4 · segfall/static-wiki](https://github.com/segfall/static-wiki/blob/b0ae18ed02ca64263075c3516c38a504f46e10c4/src/sqlite.js#L2)#134: marker: Convert PDF to markdown quickly with high accuracy### DetailsSimilarity score: 0.88 - [ ] [https://github.com/VikParuchuri/marker#readme](https://github.com/VikParuchuri/marker#readme)MarkerMarker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
More DetailsHow it worksMarker is a pipeline of deep learning models:
Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass. Examples
PerformanceThe above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000. See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks. LimitationsPDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
InstallationThis has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and poetry. First, clone the repo:
Linux
Mac
UsageFirst, some configuration:
Convert a single fileRun
Make sure the Convert multiple filesRun
Convert multiple files on multiple GPUsRun
BenchmarksBenchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison. Speed
Accuracy First 3 are non-arXiv books, last 3 are arXiv papers.
Peak GPU memory usage during the benchmark is Throughput Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000. Running your own benchmarksYou can benchmark the performance of marker on your machine. First, download the benchmark data here and unzip. Then run
This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each. Omit Commercial usageDue to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh. Here are the non-commercial/restrictive dependencies: Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript). ThanksThis work would not have been possible without amazing open source models and datasets, including (but not limited to):
Thank you to the authors of these models and datasets for making them available to the community! |
chroma/README.md at main · chroma-core/chroma
Chroma - the open-source embedding database.
The fastest way to build Python or JavaScript LLM apps with memory!
| | Docs | Homepage
|
The core API is only 4 functions (run our 💡 Google Colab or Replit template):
Features
🦜️🔗 LangChain
(python and js),🦙 LlamaIndex
and more soonUse case: ChatGPT for ______
For example, the
"Chat your data"
use case:GPT3
for additional summarization or analysis.Embeddings?
What are embeddings?
[1.2, 2.1, ....]
. This process makes documents "understandable" to a machine learning model.Embeddings databases (also known as vector databases) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database. By default, Chroma uses Sentence Transformers to embed for you but you can also use OpenAI embeddings, Cohere (multilingual) embeddings, or your own.
View on GitHub
Suggested labels
The text was updated successfully, but these errors were encountered: