Quick Start

Vault Scripts Repository This repository contains a collection of Python scripts for automating research workflows, paper summaries, and data synchronization.

Quick Start

MinerU

MinerU is a tool designed to automate the processing of a pdf and turn it into markdown. It gets all the layouts, including images, latex formula and tables.
See minerU for more information.

Build the docker image:

docker build -t mineru:latest .

Modify the ROOT_DIR in the run_minerU.sh to adapt at your architecture containing our pdf. The filepath will be relative to this root dir.
Run the docker container:

  bash run_minerU.sh <filepath> <output_dir>

Install the dependencies

pip install -r requirements.txt

AiSearch

Based on GPT-Researcher. It get the content of a markdown file and generate a report on it with the help of a LLM and search on Internet with Tavily.

To run it

python aisearch.py <dir_path> <markdown_file> <report_type>

It will append the report to the markdown file and generate a log report in the same directory under report folder.

it can be using in Obsidian with the plugin python scripter adding a argument for the report type and passing Vault path and active file path.

You must file the ai.env file (using the example file) with the API keys for Gemini (default) and Tavily. For small query, Ollama is used by default but if you have less memory you can try to use a smaller LLM though it will produce less accurate results.

For gemini you can get free api access on aistudio.

You can change model in aisearch but take care GPT-Researcher use lanchain-llm models that can be incompatible with aisuite. (eg: google-genai vs genai)

Summary

The summary.py script is designed to process PDF documents and Markdown files, generating summaries and extracting relevant tags. It can also compile detailed analyses of academic papers. I use minerU to extract the pdf and then I use aisuite to generate the summary.

Setup

Ensure you have the required dependencies installed. You can do this by running:

pip install -r requirements.txt

Build the Docker image for MinerU:

docker build -t mineru:latest .

Modify the .env to add you API Keys for Gemini or any other LLM you want to use.
You can use the example file (.env.example) as a starting point:

cp .env.example .env

Modify the utils/summary_path.py to setup the path of your structure. All the paths are relative to the vault_path from the command line.
Modify the utils/summary_prompt to adapt the prompt to your needs. By default there is a set of tags, tasks and rules to follow. It is mostly setup of computer vision but you can adapt it to your needs.
You need to give a template for the summary. You can use the example file demo as a starting point. The first part, of the file is some metadata in an Obsidian format so you can change it if you need.

To run

python summary.py <vault_path> <markdown_path>

Scrap arxiv

The scrapt_arxiv.py script is designed to fetch recent papers from arXiv using specific queries. It performs relevance scoring, GitHub repository detection, and data filtering to ensure the retrieved papers are relevant to your research interests.

Setup

Ensure you have the required dependencies installed. You can do this by running:

pip install -r requirements.txt

Modify the .env file to add your API Keys for OpenAI, Anthropic, Tavily, DeepSeek, and Gemini.
You can use the example file (.env.example) as a starting point:

cp .env.example .env

Modify the .env also to setup the path of your structure. As the command doesn't ask input everything is setup in the .env. You will also require a query file that will contains the differents settings of each newsletter. see demo for an example. You can add as many queries as you want using a different title as a name. The script will fetch the papers from arxiv and filter them with the queries in the config file.

Run

python scrapt_arxiv.py <nb_days>

with nb_days is the number of days you want to go back in the past. It will fetch all the papers from arxiv in the last nb_days days and filter them with the queries in the config file.

Key Files

aisearch.py
Contains functionality to conduct AI research using custom prompt templates and interact with large language models (LLMs) to generate research reports.
summary.py
Processes PDF documents and Markdown files to generate summaries, extract tags, and compile detailed analyses of academic papers.
md_format.py
Provides functions to format Markdown reports including generating ASCII histograms and scatter plots visualizing paper score distributions.
scrapt_arxiv.py
Fetches recent papers from arXiv using queries and performs relevance scoring, GitHub repository detection, and data filtering.

Environment Setup

There are two environment configuration files:

.env
Contains API keys and configuration for various services (OpenAI, Anthropic, Tavily, DeepSeek, Gemini).
You can use the example file (.env.example) as a starting point:
cp .env.example .env
Mostly used for arxiv_scrap.py and summary.py
.ai.env
Contains AI API endpoints and model configurations such as OpenAI and Ollama.
Similarly, use .ai.env.example as your template:
cp .ai.env.example .ai.env Mostly used for aisearch.py

.ai.env Information
- OPENAI_API_KEY, OPENAI_API_BASE: API key and base URL for OpenAI services.
- OLLAMA_BASE_URL: Base URL for accessing Ollama API endpoints.
- GOOGLE_API_KEY, GEMINI_API_KEY: API keys for Google services and Gemini.
- FAST_LLM, SMART_LLM, STRATEGIC_LLM: Identifiers for fast, smart, and strategic LLM models.
- EMBEDDING: Model identifier for embedding services.
- TAVILY_API_KEY: API key for Tavily services.
Ensure that both .env and .ai.env are updated with the correct keys and endpoints for your environment.

Quickstart

Clone the repository.
Install the dependencies listed in requirements.txt.
Install the aisuite fork with cd packages/aisuite && pip install .

Update the .env and .ai.env files with your credentials.
Run the desired script (e.g., aisearch.py, summary.py) as needed.

Additional Notes

Each script contains its own error handling and logging.
You may adjust query parameters, API endpoints, or file paths depending on your use case.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
__pycache__		__pycache__
demo		demo
deprecated		deprecated
inProgress		inProgress
utils		utils
.ai.env.example		.ai.env.example
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
VoxelPose.epub		VoxelPose.epub
aisearch.py		aisearch.py
pdf_to_epub.py		pdf_to_epub.py
readme.md		readme.md
requirements.txt		requirements.txt
run_minerU.sh		run_minerU.sh
scrapt_arxiv.py		scrapt_arxiv.py
summary.py		summary.py
update_clean_news.py		update_clean_news.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

MinerU

Install the dependencies

AiSearch

Summary

Setup

To run

Scrap arxiv

Setup

Run

Key Files

Environment Setup

.ai.env Information

Quickstart

Additional Notes

About

Releases

Packages

Languages

piromagnus/researchNautomation

Folders and files

Latest commit

History

Repository files navigation

Quick Start

MinerU

Install the dependencies

AiSearch

Summary

Setup

To run

Scrap arxiv

Setup

Run

Key Files

Environment Setup

.ai.env Information

Quickstart

Additional Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages