Datalore is a platform for creating synthetic datasets using AI. It supports multiple file formats including PDF, DOCX, TXT, JSON, and images, automatically extracting and organizing content into text-to-text datasets in the desired format. Users can also generate datasets directly from websites, making it easier to gather and structure information from online sources.
Datalore also includes a deep research workflow that expands simple user ideas into structured datasets by building meaningful context through multi-step research. This allows the generation of high-quality datasets even without existing content.
- Schema Suggestion Based on Query: Automatically suggests an editable schema by analyzing the user's query for the kind of dataset they want.
- Customizable Subtopics: Provides a list of related subtopics that users can edit, nest, or remove to structure the dataset generation process.
- Sample Data Generation: Generates up to 30 representative sample rows per subtopic to simulate realistic datasets.
- Flexible Data Export: Allows users to download the final dataset with schema, subtopics, and sample rows in preferred formats.
- File-Based Dataset Generation: Accepts uploaded files (PDF, DOCX, etc.), analyzes the content, suggests a schema, and generates structured data from the document.
- Link-Based Dataset Generation: Analyzes content from pasted URLs to suggest a schema and generate relevant datasets.
- Deep Research Mode: Gathers and synthesizes relevant data from across the web, then structures it into a meaningful dataset with an auto-generated schema.
- Auto Documentation: Generates a concise, editable explanation of the dataset structure and content for easy understanding and integration.
You can either use the hosted version of this project or follow the steps below to set it up locally on your machine.
Access the live version here:
🔗 https://dataloreai.eastus2.cloudapp.azure.com/
Follow the steps below to run the project locally.
git clone https://github.com/Datalore-ai/Datalore.ai
cd Datalore.ai
python -m venv venv
source venv/bin/activate # For Linux/Mac
# OR
venv\Scripts\activate # For Windows
Make sure you have Python 3.8+ installed.
pip install -r requirements.txt
Create a .env
file in the root directory:
cp .env.example .env
Add your environment variables in .env
:
docker pull your-dockerhub-username/your-image-name:latest
To run the container:
docker run -d -p 8000:8000 --env-file .env your-dockerhub-username/your-image-name:latest
To run the main script:
python main.py
Or if using a framework like FastAPI or Flask:
uvicorn app.main:app --reload # FastAPI example
# or
python app.py # Flask example
The project should now be running. Check logs or the console output for further instructions.
![]() |
- Choose Vanilla Mode if you want a simple flow with direct AI-based sample data generation.
- Choose Research Mode if you need advanced options like document/image input or deep web research for dataset creation.
![]() |
- Let’s move to Research Mode, Vanilla Mode is equally easy to use
- Choose how you want to generate the dataset: describe your idea then upload a file, paste a link, or use deep research mode
- Hit send to move to the schema editor and start building your dataset
![]() |
- The schema is generated based on your idea and is fully editable so you can rename remove or add fields as needed
- Once you are happy with the structure proceed to generate the dataset with a single click
![]() |
- Your dataset is now created and ready to explore in table or JSON view
- The documentation tab gives a short explanation of the dataset and how to use it
- You can export the dataset as JSON, CSV, TXT or download all at once
![]() |
- In Vanilla Mode AI generates subtopics based on your idea which you can edit or add more if needed
- For each subtopic you can generate dataset samples independently
- You can download samples for each subtopic or the complete dataset when ready
If something here could be improved, please open an issue or submit a pull request.
This project is licensed under the Apache 2 License. See the LICENSE
file for more details.