GitHub - Oxen-AI/Oxen: Lightning fast data version control system for structured and unstructured machine learning datasets. We aim to make versioning datasets as easy as versioning code.

🐂 What is Oxen?

Oxen is a lightning fast data version control system for large datasets. We aim to make versioning data as easy as versioning code.

The interface mirrors git, but shines in many areas that git or git-lfs fall short. Oxen is built from the ground up for any data type, and is optimized to handle repositories with millions of files and scales to terrabytes of data.

oxen init
oxen add images/
oxen add annotations/*.parquet
oxen commit "Adding 200k images and their corresponding annotations"
oxen push origin main

Oxen is comprised of a command line interface, as well as bindings for Rust 🦀, Python 🐍, and HTTP interfaces 🌎 to make it easy to integrate into your workflow.

🌾 What kind of data?

Oxen is designed to efficiently manage large data in any format - including images, audio, video, text or tabular data like parquet files with millions of rows. Behind the scenes Oxen can store any blob type, but has specialized metadata extractors for certain filetypes and caches this information in the merkle tree for fast access later.

🚀 Built for speed

One of the main reasons datasets are hard to maintain is the pure performance of indexing the data and transferring the data over the network. We wanted to be able to index hundreds of thousands of images, videos, audio files, and text files in seconds.

Watch below as we version hundreds of thousands of images in seconds 🔥

But speed is only the beginning.

✅ Features

Oxen is built around ergonomics, ease of use, and it is easy to learn. If you know how to use git, you know how to use Oxen.

🔥 Fast (efficient indexing and syncing of data)
🧠 Easy to learn (same commands as git)
💪 Handles large files (images, videos, audio, text, parquet, arrow, json, models, etc)
🗄️ Index lots of files (millions of images? no problem)
📊 Native DataFrame processing (index, compare and serve up DataFrames)
📈 Tracks changes over time (never worry about losing the state of your data)
🤝 Collaborate with your team (sync to an oxen-server)
🌎 Workspaces to interact with the data without downloading it
👀 Better data visualization on OxenHub

🐮 Learn The Basics

To learn what everything Oxen can do, the full documentation can be found at https://docs.oxen.ai.

🧑‍💻 Getting Started

You can install through homebrew or pip or from our releases page.

🐂 Install Command Line Tool

Install via Homebrew:

brew install oxen

🐍 Install Python Library

pip install oxenai

⬇️ Clone Dataset

Clone your first Oxen repository from the OxenHub.

oxen clone https://hub.oxen.ai/ox/CatDogBBox

🤝 Support

If you have any questions, comments, suggestions, or just want to get in contact with the team, feel free to email us at hello@oxen.ai

👥 Contributing

This repository contains the Python library that wraps the core Rust codebase. We would love help extending out the python interfaces, the documentation, or the core rust library.

Code bases to contribute to:

If you are building anything with Oxen.ai or have any questions we would love to hear from you in our discord.

Build 🔨

Set up virtual environment:

# Set up your python virtual environment
$ python -m venv ~/.venv_oxen # could be python3
$ source ~/.venv_oxen/bin/activate
$ pip install -r requirements.txt

# Install rust
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Run maturin
$ cd oxen
$ maturin develop

Test

$ pytest -s tests/

Why build Oxen?

Oxen was build by a team of machine learning engineers, who have spent countless hours in their careers managing datasets. We have used many different tools, but none of them were as easy to use and as ergonomic as we would like.

If you have ever tried git lfs to version large datasets and became frustrated, we feel your pain. Solutions like git-lfs are too slow when it comes to the scale of data we need for machine learning.

If you have ever uploaded a large dataset of images, audio, video, or text to a cloud storage bucket with the name:

s3://data/images_july_2022_final_2_no_really_final.tar.gz

We built Oxen to be the tool we wish we had.

Why the name Oxen?

"Oxen" 🐂 comes from the fact that the tooling will plow, maintain, and version your data like a good farmer tends to their fields 🌾. Let Oxen take care of the grunt work of your infrastructure so you can focus on the higher-level problems that matter to your product.

Name		Name	Last commit message	Last commit date
Latest commit History 4,938 Commits
.github/workflows		.github/workflows
examples/rag		examples/rag
images		images
oxen-python		oxen-python
oxen-rust		oxen-rust
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Performance.md		Performance.md
README.md		README.md
ReleaseNotes.md		ReleaseNotes.md
SelfHosting.md		SelfHosting.md
remote_repo.py		remote_repo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🐂 What is Oxen?

🌾 What kind of data?

🚀 Built for speed

✅ Features

🐮 Learn The Basics

🧑‍💻 Getting Started

🐂 Install Command Line Tool

🐍 Install Python Library

⬇️ Clone Dataset

🤝 Support

👥 Contributing

Build 🔨

Test

Why build Oxen?

Why the name Oxen?

About

Uh oh!

Releases 81

Packages

Uh oh!

Contributors 16

Languages

License

Oxen-AI/Oxen

Folders and files

Latest commit

History

Repository files navigation

🐂 What is Oxen?

🌾 What kind of data?

🚀 Built for speed

✅ Features

🐮 Learn The Basics

🧑‍💻 Getting Started

🐂 Install Command Line Tool

🐍 Install Python Library

⬇️ Clone Dataset

🤝 Support

👥 Contributing

Build 🔨

Test

Why build Oxen?

Why the name Oxen?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 81

Packages 0

Uh oh!

Contributors 16

Languages

Packages