GitHub - Datalore-ai/datalore.ai: Developement with research

Overview

Datalore is a platform for creating synthetic datasets using AI. It supports multiple file formats including PDF, DOCX, TXT, JSON, and images, automatically extracting and organizing content into text-to-text datasets in the desired format. Users can also generate datasets directly from websites, making it easier to gather and structure information from online sources.

Datalore also includes a deep research workflow that expands simple user ideas into structured datasets by building meaningful context through multi-step research. This allows the generation of high-quality datasets even without existing content.

Features

Schema Suggestion Based on Query: Automatically suggests an editable schema by analyzing the user's query for the kind of dataset they want.
Customizable Subtopics: Provides a list of related subtopics that users can edit, nest, or remove to structure the dataset generation process.
Sample Data Generation: Generates up to 30 representative sample rows per subtopic to simulate realistic datasets.
Flexible Data Export: Allows users to download the final dataset with schema, subtopics, and sample rows in preferred formats.
File-Based Dataset Generation: Accepts uploaded files (PDF, DOCX, etc.), analyzes the content, suggests a schema, and generates structured data from the document.
Link-Based Dataset Generation: Analyzes content from pasted URLs to suggest a schema and generate relevant datasets.
Deep Research Mode: Gathers and synthesizes relevant data from across the web, then structures it into a meaningful dataset with an auto-generated schema.
Auto Documentation: Generates a concise, editable explanation of the dataset structure and content for easy understanding and integration.

Project Setup

You can either use the hosted version of this project or follow the steps below to set it up locally on your machine.

Try It Online (No Setup Needed)

Access the live version here:
🔗 https://dataloreai.eastus2.cloudapp.azure.com/

Local Setup Instructions

Follow the steps below to run the project locally.

1. Clone the Repository

git clone https://github.com/Datalore-ai/Datalore.ai
cd Datalore.ai

2. Create Virtual Environment (Optional but Recommended)

python -m venv venv
source venv/bin/activate  # For Linux/Mac
# OR
venv\Scripts\activate  # For Windows

3. Install Dependencies

Make sure you have Python 3.8+ installed.

pip install -r requirements.txt

4. Setup Environment Variables

Create a .env file in the root directory:

cp .env.example .env

Add your environment variables in .env:

5. Pull Docker Image (If using Docker)

docker pull your-dockerhub-username/your-image-name:latest

To run the container:

docker run -d -p 8000:8000 --env-file .env your-dockerhub-username/your-image-name:latest

6. Run the Project

To run the main script:

python main.py

Or if using a framework like FastAPI or Flask:

uvicorn app.main:app --reload  # FastAPI example
# or
python app.py  # Flask example

✅ You're all set!

The project should now be running. Check logs or the console output for further instructions.

How to Use

Choose your path

Choose Vanilla Mode if you want a simple flow with direct AI-based sample data generation.
Choose Research Mode if you need advanced options like document/image input or deep web research for dataset creation.

Input Options

Let’s move to Research Mode, Vanilla Mode is equally easy to use
Choose how you want to generate the dataset: describe your idea then upload a file, paste a link, or use deep research mode
Hit send to move to the schema editor and start building your dataset

Schema Editor

The schema is generated based on your idea and is fully editable so you can rename remove or add fields as needed
Once you are happy with the structure proceed to generate the dataset with a single click

Create Datasets

Your dataset is now created and ready to explore in table or JSON view
The documentation tab gives a short explanation of the dataset and how to use it
You can export the dataset as JSON, CSV, TXT or download all at once

Create Datasets (Vanilla mode)

In Vanilla Mode AI generates subtopics based on your idea which you can edit or add more if needed
For each subtopic you can generate dataset samples independently
You can download samples for each subtopic or the complete dataset when ready

Contributing

If something here could be improved, please open an issue or submit a pull request.

License

This project is licensed under the Apache 2 License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
backend		backend
experiments		experiments
frontend		frontend
nginx		nginx
redis_data		redis_data
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
generate-ssl-cert.sh		generate-ssl-cert.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Features

Project Setup

Try It Online (No Setup Needed)

Local Setup Instructions

1. Clone the Repository

2. Create Virtual Environment (Optional but Recommended)

3. Install Dependencies

4. Setup Environment Variables

5. Pull Docker Image (If using Docker)

6. Run the Project

✅ You're all set!

How to Use

Choose your path

Input Options

Schema Editor

Create Datasets

Create Datasets (Vanilla mode)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Datalore-ai/datalore.ai

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Project Setup

Try It Online (No Setup Needed)

Local Setup Instructions

1. Clone the Repository

2. Create Virtual Environment (Optional but Recommended)

3. Install Dependencies

4. Setup Environment Variables

5. Pull Docker Image (If using Docker)

6. Run the Project

✅ You're all set!

How to Use

Choose your path

Input Options

Schema Editor

Create Datasets

Create Datasets (Vanilla mode)

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages