A comprehensive Python package for extracting composition-property data from scientific articles for building databases
ComProScanner is a multi-agent framework designed to extract composition-property relationships from scientific articles in materials science. It automates the entire workflow from metadata collection to data extraction, evaluation, and visualization.
Key Features:
- π Multi-publisher support (Elsevier, Springer, Wiley, IOP, local PDFs)
- π€ Agentic extraction using CrewAI framework
- π RAG-powered context retrieval for cost effective automation with accuracy
- π Comprehensive evaluation and visualization tools
- π― Customizable extraction workflows
- π Knowledge graph generation
Install from PyPI:
pip install comproscannerOr install from source:
git clone https://github.com/slimeslab/ComProScanner.git
cd comproscanner
pip install -e .Here's a complete example extracting piezoelectric coefficient (
from comproscanner import ComProScanner
# Initialize scanner
scanner = ComProScanner(main_property_keyword="piezoelectric")
# Collect metadata
scanner.collect_metadata(
base_queries=["piezoelectric", "piezoelectricity"],
extra_queries=["ceramics", "applications"]
)
# Process articles
property_keywords = {
"exact_keywords": ["d33"],
"substring_keywords": [" d 33 "]
}
scanner.process_articles(
property_keywords=property_keywords,
source_list=["elsevier", "springer"]
)
# Extract composition-property data
scanner.extract_composition_property_data(
main_extraction_keyword="d33"
)The ComProScanner workflow consists of four main stages:
- Metadata Retrieval - Find relevant scientific articles
- Article Collection - Extract full-text from various publishers
- Information Extraction - Use LLM agents to extract structured data
- Post Processing & Dataset Creation - Evaluate, clean, and visualize results
π Full documentation is available at slimeslab.github.io/ComProScanner
- Elsevier (via TDM API)
- Springer Nature (via TDM API)
- Wiley (via TDM API)
- IOP Publishing (via SFTP bulk access)
- Local PDFs (any publication)
- Composition-property relationships
- Material families
- Synthesis methods and precursors
- Characterization techniques
- Synthesis steps
- Semantic Evaluation - Using semantic similarity measures
- Agentic Evaluation - LLM-powered contextual analysis
- Data Visualization
- Evaluation Visualization
scanner.process_articles(
property_keywords=property_keywords,
source_list=["elsevier", "springer", "wiley"]
)scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_chat_model="gemini-2.5-pro",
rag_max_tokens=2048,
rag_top_k=5
)from comproscanner import data_visualizer, eval_visualizer
# Create knowledge graph
data_visualizer.create_knowledge_graph(result_file="results.json")
# Plot evaluation metrics
eval_visualizer.plot_multiple_radar_charts(
result_sources=["model1.json", "model2.json"],
model_names=["GPT-4o", "Claude-3.5"]
)- Python 3.12 or 3.13
- TDM API keys for desired publishers (Elsevier, Springer, Wiley)
- LLM API keys (OpenAI, Anthropic, Google, etc.)
- Optional: Neo4j for knowledge graph visualization
If you use ComProScanner in your research, please cite:
@misc{roy2025comproscannermultiagentbasedframework,
title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
year={2025},
eprint={2510.20362},
archivePrefix={arXiv},
primaryClass={physics.comp-ph},
url={https://arxiv.org/abs/2510.20362},
}See the CHANGELOG for details on what has changed in each version.
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright Β© 2025 SLIMES Lab
Author: Aritra Roy
- π Website: aritraroy.live
- π§ Email: contact@aritraroy.live
- π GitHub: @aritraroy24
- π Twitter: @aritraroy24
Project Links:
- π¦ PyPI: pypi.org/project/comproscanner
- π Documentation: slimeslab.github.io/ComProScanner
- π Issues: github.com/slimeslab/ComProScanner/issues
Made with β€οΈ by SLIMES Lab

