SciSciGPT is an AI collaborator designed to advance human-AI collaboration in the science of science (SciSci) field. As scientific research becomes increasingly complex and data-intensive, researchers face growing technical barriers that limit accessibility and efficiency. SciSciGPT addresses these challenges by leveraging recent advances in large language models and AI agents to create a comprehensive research assistant.
The system offers both a public chat interface at sciscigpt.com and a fully open-source implementation on GitHub. SciSciGPT integrates multiple research capabilities: retrieving relevant SciSci publications, extracting data from complex databases, conducting advanced analytics, creating visualizations, and evaluating its own outputs. This seamless AI-powered workflow lowers technical barriers, enhances research efficiency, and enables new modes of human-AI collaboration.
SciSciGPT is a multi-agent AI system that includes a ResearchManager and four specialist agents, each designed to handle distinct aspects of the science of science research process:
The system comprises five specialized agents:
- ResearchManager - Orchestrates workflows by breaking complex questions into tasks and coordinating specialist agents
- LiteratureSpecialist - Uses retrieval-augmented generation (RAG) to process SciSci publications with metadata filters and progressive analysis of abstracts, methodology, and results
- DatabaseSpecialist - Navigates the SciSci data lake using SQL tools and embedding-based entity matching to extract and preprocess scholarly data
- AnalyticsSpecialist - Conducts analysis using Python, R, and Julia in isolated sandboxes to generate insights and visualizations
- EvaluationSpecialist - Performs multi-level quality assessments of tools, visualizations, and workflows, assigning reward scores to guide system improvements and ensure reliable research outputs
Software Service Ecosystem: The SciSciGPT backend is built on a comprehensive software stack that integrates multiple services and technologies to enable seamless AI-powered research workflows. The entire SciSciGPT system is developed and deployed on Ubuntu 24.04 LTS with Anaconda 2025.06 version, providing a robust foundation for multi-agent coordination and scientific computing.
- AI Orchestration: LangChain & LangGraph for multi-agent workflow coordination
- LLM Provider: Google Cloud Vertex AI with Claude models (Anthropic)
- Text Embedding: Text embeddings for vector search in LiteratureSpecialist and DatabaseSpecialist
- Application Framework: FastAPI with LangServe for API endpoints
- Vector Database: Pinecone for semantic search in RAG
- Relational Database: Google BigQuery for hosting SciSci scholarly relational database
- File System: Google Cloud Storage for file uploads, artifacts, and visualizations
- Python: Jupyter kernel with pre-installed scientific packages
- R: Comprehensive statistical computing environment with 250+ packages
- Julia: High-performance scientific computing
The system executes code in isolated sandboxes with persistent state management within sessions.
- Memory: 16GB+ RAM recommended for full functionality
- Storage: 100GB+ for dependencies and datasets
- Network: Internet connection for API calls to LLM providers
Our dependency installation is divided into two parts:
The minimal setup is based on Anaconda and includes only the essential packages needed to launch and run SciSciGPT, including LangChain-related packages for agent orchestration, Google BigQuery client libraries for database access, Pinecone client for vector database operations, and R and Julia environments configured via conda-forge.
Within the Anaconda environment, the AnalyticsSpecialist uses a unified jupyter_client interface with rpy2 and juliacall, which allows the AnalyticsSpecialist to execute R and Julia code within the same Python-based environment, enabling seamless multi-language scientific computing. Completing Part 1 is sufficient to get SciSciGPT up and running.
To install the minimal dependencies, navigate to the backend directory and run:
cd backend
conda create -n sciscigpt python=3.11 -y
conda activate sciscigpt
# Install Python dependencies
pip install -r requirements.txt
# Install R using conda-forge
conda install -c conda-forge r-essentials r-base -y
conda install rpy2 -y
# Install Julia using conda-forge
conda install -c conda-forge julia -y
pip install juliacallThe second part aims to ensure the AnalyticsSpecialist can handle a wide variety of computational tasks. Since the AnalyticsSpecialist may generate arbitrary code during execution, we want to minimize the likelihood of runtime failures due to missing packages. Therefore, we install a comprehensive collection of Python, R, and Julia packages that cover most common scientific computing scenarios.
For users who simply want to verify that SciSciGPT's architecture works correctly or for demonstration purposes, Part 2 dependencies are optional. However, for production use where the AnalyticsSpecialist needs to handle diverse analytical tasks, installing the full dependency set is recommended.
To install the comprehensive dependencies, navigate to the backend directory and run:
cd backend
bash setup-sandbox.shThis script will automatically install all packages within the existing sciscigpt environment:
- Python packages from
requirements-sandbox-python.txt - R packages from
requirements-sandbox-r.R - Julia packages from
requirements-sandbox-julia.jl
SciSciGPT leverages three core Google Cloud Platform services to provide comprehensive research capabilities:
- Vertex AI provides the backbone LLM API service, enabling SciSciGPT to utilize Anthropic Claude models through Google's Vertex AI platform for natural language understanding, reasoning, and generation across all specialist agents
- BigQuery serves as the data warehouse for SciSciNet, allowing SciSciGPT's DatabaseSpecialist to view, read, and execute SQL queries against large-scale scholarly datasets containing publications, citations, authors, and institutional data
- Cloud Storage hosts and manages file artifacts including BigQuery query results, AnalyticsSpecialist's intermediate computational outputs, and generated visualizations, enabling seamless file communication between frontend and backend systems through public URLs
GCP Project: Create a Google Cloud Platform Project (GCP_PROJECT_ID)
BigQuery Database: Enable the BigQuery API in your GCP project and create a BigQuery dataset (BIGQUERY_DATASET_ID) to store your SciSci research data.
Cloud Storage:
- Create a Google Cloud Storage bucket (
GCS_BUCKET_NAME) in your GCP project - Configure the bucket for public access (refer to Make data public for more details):
- Go to Cloud Storage > Browser in GCP Console
- Select your bucket > Permissions tab
- Add principal:
allUsersand assign role:Storage Object Viewer
- Files, including SQL query results, intermediate outputs, and visualizations will be uploaded to GCS and served via public URLs
Vertex AI: Enable the Vertex AI API in GCP_PROJECT_ID project and enable Anthropic Claude models (for example, claude-3-5-sonnet-v2@20241022) of required version in Vertex AI's Model Garden.
**Service Account: **Create a service account (refer to Create service accounts) and grant the necessary IAM roles (refer to Grant an IAM role). The service account requires the following roles:
BigQuery Data Viewer: Permission to read datasets and tables in Google BigQuery;BigQuery Job User: Permission to create query jobs in Google BigQuery API;BigQuery Read Session User: Permission to use BigQuery Storage API;Storage Object Admin: Permissions for file read and uploads for GCS bucket;Vertex AI User: Permission for LLM querying and textual embedding;
Then, create and download the service account JSON credential key file, and save as .google_application_credentials.json in the backend directory, should be in the format of the example google_application_credentials.json key.
SciSciGPT uses Pinecone as a vector database to host the SciSciCorpus, a comprehensive collection of scientific literature. The corpus is indexed using vector embeddings to enable semantic search capabilities, allowing the LiteratureSpecialist agent to perform Retrieval-Augmented Generation (RAG) operations for Science of Science research. Moreover, SciSciGPT uses Pinecone as a vector database to host standardized institution and field names from SciSciNet, enabling semantic search capabilities for name disambiguation and standardization. This allows the DatabaseSpecialist agent to perform fuzzy matching and find the most appropriate standardized names when users provide variations or partial matches.
Setup Steps
- Create a Pinecone account at https://www.pinecone.io;
- Create an index named
NAME_SEARCH_INDEXwith the following configuration:- Dimensions: 1536 (compatible with OpenAI's text-embedding-3-small model)
- Metric: Cosine similarity
- Create an index named
SCISCICORPUS_INDEXwith the following configuration:- Dimensions: 3072 (compatible with OpenAI's text-embedding-3-large model)
- Metric: Cosine similarity
- Generate a Pinecone API key with full permissions from your Pinecone dashboard (Manage API keys);
- Obtain an OpenAI API key for text embedding generation (OpenAI API keys).
Within the backend directory, create a .env file using .env.example as template:
GOOGLE_APPLICATION_CREDENTIALS=.google_application_credentials.json
GOOGLE_BIGQUERY_URI=bigquery://GCP_PROJECT_ID/BIGQUERY_DATASET_ID
LOCAL_STORAGE_PATH=/tmp/sandbox
GCS_BUCKET_NAME=GCS_BUCKET_NAME
GCS_BUCKET_URL=https://storage.googleapis.com/GCS_BUCKET_NAME
PINECONE_API_KEY=PINECONE_API_KEY
OPENAI_API_KEY=OPENAI_API_KEY
SCISCICORPUS_INDEX=SCISCICORPUS_INDEX
SCISCICORPUS_NAMESPACE=SCISCICORPUS_NAMESPACE
NAME_SEARCH_INDEX=NAME_SEARCH_INDEXExecute the backend/SciSciNet-Relational.ipynb notebook, which will automatically download all SciSciNet tables from the SciSciGPT-SciSciNet HuggingFace repository and upload them to Google BigQuery with proper schema definitions, table descriptions, and column descriptions. The final dataset comprises 19 comprehensive tables: authors, fields, institutions, nct, newsfeed, nih, nsf, paper_author_affiliations, paper_citations, paper_fields, paper_nct, paper_newsfeed, paper_nih, paper_nsf, paper_patents, paper_twitter, papers, patents, and twitter. This collection provides extensive coverage of the innovation ecosystem, enabling comprehensive Science of Science research capabilities.
Execute the backend/SciSciNet-Vector.ipynb notebook, which will automatically download the fields and institutions tables from the SciSciNet SciSciGPT-SciSciNet HuggingFace repository. The notebook will automatically generate embeddings for field_name and institution_name using OpenAI's text-embedding-3-small model, indexing each entity with its embedding vector along with other metadata. The resulting NAME_SEARCH_INDEX vector database serves the name disambiguation and standardization capabilities of the DatabaseSpecialist.
Execute the backend/SciSciCorpus.ipynb notebook to build your vector database. The notebook downloads the SciSciCorpus dataset from the SciSciGPT-SciSciCorpus HuggingFace repository, generates vector embeddings for research paper abstracts using OpenAI's text-embedding-3-large model, indexes each document with its embedding vector along with metadata like full textual content and bibliometric information, and uploads the indexed data to your Pinecone vector database. The resulting SCISCICORPUS_INDEX vector database will contain semantically searchable scientific literature with associated metadata, enabling the LiteratureSpecialist to perform sophisticated RAG-based research queries and literature discovery.
Deploy the Backend of SciSciGPT:
bash start.shThe API will be available at:
- LangServe API endpoints:
http://localhost:8080 - Main SciSciGPT agent:
http://localhost:8080/sciscigpt - Individual tools:
http://localhost:8080/redoc
The backend implements SciSciGPT and using langserve to package SciSciGPT's agent as an FastAPI which could be connect and invoked remotely using Langserve and support event streaming. http://localhost:8080/sciscigpt include the entry of the main SciSciGPT agent, while the entries of all other tools could be found at http://localhost:8080/redoc.
Software Service Ecosystem: This is the frontend for SciSciGPT, an advanced LLM agent research assistant for the field of science of science (SciSci). This project is adapted from the Vercel AI Chatbot and is designed to be integrated with a LLM agent API backend supported by LangServe.
Features:
- User-friendly interface for interacting with the SciSciGPT agent
- Designed to connect with a LangServe-powered backend for LLM agent capabilities
- Streaming output from SciSciGPT with live response updates
- Accessing and downloading generated research materials
- Downloading generated artifacts including tables, code outputs, and visualizations
- Chat history management with persistent conversation
Create a Redis KV database using Upstash at https://console.upstash.com/redis. The frontend code will automatically handle user authentication and chat history management. SciSciGPT uses Upstash's Redis database, which provides a hash table-based database for storing and retrieving information based on user ID and chat history ID.
Follow these steps:
- Visit https://console.upstash.com/redis and create a new Redis database
- Obtain the API credentials (URL and token) from your Upstash dashboard
-
Update your environment variables (
KV_URL,KV_REST_API_URL,KV_REST_API_TOKEN,KV_REST_API_READ_ONLY_TOKEN) in the.env.localfile (using.env.exampleas a template) with the credentials provided by Upstash -
Set up the LangServe backend API endpoint:
LANGSERVE_URL: The URL where your LangServe backend is running (default:http://localhost:8080/sciscigptfor local development)LOCAL_STORAGE_PATH: The local path of the workspace of SciSciGPT, which should be aligned with the configuration of backend
-
Configure authentication and deployment URLs:
NEXTAUTH_SECRET: Secret key for NextAuth.js authentication (default:sciscigpt)NEXTAUTH_URL: The URL where your frontend will be deployed (default:http://localhost:3000for local development)- For production deployment, update to your actual domain:
https://your-domain.com
KV_REST_API_READ_ONLY_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_xx-xxxxxxxxxxxxxx"
KV_REST_API_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
KV_REST_API_URL="https://abcde-fghij-12345.upstash.io"
KV_URL="redis://default:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx@abcde-fghij-12345.upstash.io:1234"
LANGSERVE_URL="http://localhost:8080/sciscigpt"
LOCAL_STORAGE_PATH="/tmp/sandbox"
NEXTAUTH_SECRET="sciscigpt"
NEXTAUTH_URL="http://localhost:3000"
ENABLE_REDIRECT="false"
REDIRECT_TARGET="https://localhost:3000"Install Node.js and pnpm, and all other required packages (in frontend/package.json) within sciscigpt environment using conda-forge.
cd frontend
bash setup.shDeploy SciSciGPT Frontend:
cd frontend
bash start.shThe user chat interface will be available at http://localhost:3000
Important: Do not modify the versions of packages that are explicitly specified in
backend/requirements.txtandfrontend/package.json. These version constraints are critical for system stability:
Changing of BigQuery-related packages versions may cause loss of access to BigQuery databases, incomplete table displays (especially missing table column descriptions), and DatabaseSpecialist functionality failures. Modifying Pinecone package versions may result in the vector database connection issues or formatting issues and cause malfunctions of name search tool and literature search tool. For packages without explicit version constraints, versions can be adjusted based on your system requirements and compatibility needs.
If the DatabaseSpecialist is not functioning properly, verify that Google BigQuery Service Account has proper IAM permissions, which should at least include all IAM roles mentioned above. Also, ensure that the .google_application_credentials.json of the corresponding roles is properly placed in the backend directory.
Google Cloud Storage Configuration: Ensure that your Google Cloud Storage bucket has public access enabled by granting access to allUsers for your storage bucket, which allows the frontend to display figures generated by SciSciGPT backend via public URLs, enables download links for generated tables and results, and without public access, generated visualizations and downloadable content will not be accessible from the frontend.
LOCAL_STORAGE_PATH Configuration: Ensure that the LOCAL_STORAGE_PATH environment variable is set to the same path in both the frontend and backend configurations. This synchronization is critical for proper file sharing and workspace management between the frontend interface and backend processing components.
SciSciGPT dependencies are built on Anaconda, the setup processes described above can be replicated across different operating systems. We have successfully tested the above installation procedures and run successfully on Ubuntu 24.04 LTS, CentOS Stream 10, macOS Sequoia 15.5, and Windows 11 using Anaconda. SciSciGPT operates normally across these tested platforms.
If you use SciSciGPT in your research, please cite:
@article{sciscigpt2025,
title={*SciSciGPT*: Advancing Human-AI Collaboration in the Science of Science},
author={Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, Dashun Wang},
journal={arXiv preprint arXiv:2504.05559},
year={2025}
}