Skip to content

Datalore-ai/datalore.ai

Repository files navigation

Description

Overview

Datalore is a platform for creating synthetic datasets using AI. It supports multiple file formats including PDF, DOCX, TXT, JSON, and images, automatically extracting and organizing content into text-to-text datasets in the desired format. Users can also generate datasets directly from websites, making it easier to gather and structure information from online sources.

Datalore also includes a deep research workflow that expands simple user ideas into structured datasets by building meaningful context through multi-step research. This allows the generation of high-quality datasets even without existing content.

Features

  • Schema Suggestion Based on Query: Automatically suggests an editable schema by analyzing the user's query for the kind of dataset they want.
  • Customizable Subtopics: Provides a list of related subtopics that users can edit, nest, or remove to structure the dataset generation process.
  • Sample Data Generation: Generates up to 30 representative sample rows per subtopic to simulate realistic datasets.
  • Flexible Data Export: Allows users to download the final dataset with schema, subtopics, and sample rows in preferred formats.
  • File-Based Dataset Generation: Accepts uploaded files (PDF, DOCX, etc.), analyzes the content, suggests a schema, and generates structured data from the document.
  • Link-Based Dataset Generation: Analyzes content from pasted URLs to suggest a schema and generate relevant datasets.
  • Deep Research Mode: Gathers and synthesizes relevant data from across the web, then structures it into a meaningful dataset with an auto-generated schema.
  • Auto Documentation: Generates a concise, editable explanation of the dataset structure and content for easy understanding and integration.

Project Setup

You can either use the hosted version of this project or follow the steps below to set it up locally on your machine.


Try It Online (No Setup Needed)

Access the live version here:
🔗 https://dataloreai.eastus2.cloudapp.azure.com/


Local Setup Instructions

Follow the steps below to run the project locally.


1. Clone the Repository

git clone https://github.com/Datalore-ai/Datalore.ai
cd Datalore.ai

2. Create Virtual Environment (Optional but Recommended)

python -m venv venv
source venv/bin/activate  # For Linux/Mac
# OR
venv\Scripts\activate  # For Windows

3. Install Dependencies

Make sure you have Python 3.8+ installed.

pip install -r requirements.txt

4. Setup Environment Variables

Create a .env file in the root directory:

cp .env.example .env

Add your environment variables in .env:


5. Pull Docker Image (If using Docker)

docker pull your-dockerhub-username/your-image-name:latest

To run the container:

docker run -d -p 8000:8000 --env-file .env your-dockerhub-username/your-image-name:latest

6. Run the Project

To run the main script:

python main.py

Or if using a framework like FastAPI or Flask:

uvicorn app.main:app --reload  # FastAPI example
# or
python app.py  # Flask example

✅ You're all set!

The project should now be running. Check logs or the console output for further instructions.


How to Use

Choose your path

  1. Choose Vanilla Mode if you want a simple flow with direct AI-based sample data generation.
  2. Choose Research Mode if you need advanced options like document/image input or deep web research for dataset creation.

Input Options

  1. Let’s move to Research Mode, Vanilla Mode is equally easy to use
  2. Choose how you want to generate the dataset: describe your idea then upload a file, paste a link, or use deep research mode
  3. Hit send to move to the schema editor and start building your dataset

Schema Editor

  1. The schema is generated based on your idea and is fully editable so you can rename remove or add fields as needed
  2. Once you are happy with the structure proceed to generate the dataset with a single click

Create Datasets

  1. Your dataset is now created and ready to explore in table or JSON view
  2. The documentation tab gives a short explanation of the dataset and how to use it
  3. You can export the dataset as JSON, CSV, TXT or download all at once

Create Datasets (Vanilla mode)

  1. In Vanilla Mode AI generates subtopics based on your idea which you can edit or add more if needed
  2. For each subtopic you can generate dataset samples independently
  3. You can download samples for each subtopic or the complete dataset when ready

Contributing

If something here could be improved, please open an issue or submit a pull request.

License

This project is licensed under the Apache 2 License. See the LICENSE file for more details.

About

Developement with research

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •