Auto Paper Report

An automated research paper and AI news digest pipeline that collects, deduplicates, ranks, and renders daily reports from multiple sources.

Features

Multi-Source Data Collection

arXiv - Academic papers via RSS and API
RSS/Atom Feeds - Blog posts and news from any RSS source
GitHub Releases - Track releases from repositories
Hugging Face - Model releases by organization
OpenReview - Conference paper submissions
Papers With Code - Trending papers and implementations
HTML Scraping - Custom HTML list and profile extraction

Intelligent Processing

Story Linking - Automatically links related items across sources
Deduplication - Identifies and merges duplicate content
Entity Matching - Associates items with tracked entities (companies, labs, researchers)
Topic Matching - Categorizes content by configurable topic patterns

Smart Ranking

Configurable Scoring - Weight factors for tier, recency, entity relevance, and topic hits
Quota Management - Control output distribution across sections
Section Assignment - Organizes content into Top 5, Model Releases, Papers, and Radar sections

Static Site Generation

Responsive HTML - Mobile-friendly daily digest pages
Archive Pages - Historical daily reports
Source Status - Health monitoring dashboard for all sources
JSON API - Machine-readable daily output

Automation & Deployment

GitHub Actions - Automated daily pipeline execution
GitHub Pages - Zero-config static site deployment
State Persistence - SQLite database with incremental updates
Structured Logging - JSON logs with run context for observability

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Configuration                             │
│              (sources.yaml, entities.yaml, topics.yaml)         │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Collectors                               │
│   ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐  │
│   │  arXiv  │ │   RSS   │ │ GitHub  │ │   HF    │ │  HTML   │  │
│   └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘  │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Story Linker                                │
│            (Deduplication, Entity Matching, Linking)            │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Ranker                                   │
│           (Scoring, Quota Filtering, Section Assignment)        │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                        Renderer                                  │
│              (HTML Templates, JSON API, Archive)                │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Output                                   │
│                  (GitHub Pages / Static Files)                  │
└─────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Python 3.13+
uv package manager

Installation

# Clone the repository
git clone https://github.com/DennySORA/auto_paper_report.git
cd auto_paper_report

# Install dependencies
uv sync

Configuration

Create your configuration files:

sources.yaml - Define data sources

version: "1.0"
defaults:
  max_items: 50

sources:
  - id: openai-blog
    name: OpenAI Blog
    url: https://openai.com/blog/rss.xml
    tier: 0
    method: rss_atom
    kind: blog
    timezone: America/Los_Angeles

  - id: arxiv-cs-ai
    name: arXiv cs.AI
    url: https://rss.arxiv.org/rss/cs.AI
    tier: 1
    method: rss_atom
    kind: paper
    timezone: UTC

entities.yaml - Define tracked entities

version: "1.0"
entities:
  - id: openai
    name: OpenAI
    aliases: ["OpenAI", "open-ai"]
    prefer_links: [official, github, arxiv]

topics.yaml - Define topic patterns and scoring

version: "1.0"
topics:
  - id: llm
    name: Large Language Models
    patterns: ["LLM", "language model", "GPT", "transformer"]

Running the Pipeline

# Validate configuration
uv run python main.py validate \
    --config config/sources.yaml \
    --entities config/entities.yaml \
    --topics config/topics.yaml

# Run the full pipeline
uv run python main.py run \
    --config config/sources.yaml \
    --entities config/entities.yaml \
    --topics config/topics.yaml \
    --state state.sqlite \
    --out public \
    --tz Asia/Taipei

CLI Commands

Command	Description
`run`	Execute the full digest pipeline
`validate`	Validate configuration files
`render`	Render static pages from test data
`db-stats`	Display state database statistics

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=html

# Run specific test file
uv run pytest tests/unit/test_ranker/test_scorer.py

Code Quality

# Linting
uv run ruff check .
uv run ruff check . --fix

# Formatting
uv run ruff format .

# Type checking
uv run mypy .

# Security scanning
uv run bandit -r src/

GitHub Actions Deployment

The project includes a GitHub Actions workflow for automated daily execution:

Fork this repository
Enable GitHub Pages in repository settings
Configure secrets (if using authenticated APIs):
- HF_TOKEN - Hugging Face API token
- OPENREVIEW_TOKEN - OpenReview API token
The workflow runs daily at 07:00 Asia/Taipei time

Project Structure

auto_paper_report/
├── src/
│   ├── cli/            # Command-line interface
│   ├── collectors/     # Data source collectors
│   │   ├── arxiv/      # arXiv API and RSS
│   │   ├── platform/   # GitHub, HuggingFace, OpenReview
│   │   └── html_profile/  # HTML scraping profiles
│   ├── config/         # Configuration loading and schemas
│   ├── evidence/       # Audit trail capture
│   ├── fetch/          # HTTP client with caching
│   ├── linker/         # Story linking and deduplication
│   ├── ranker/         # Scoring and ranking
│   ├── renderer/       # HTML/JSON generation
│   ├── status/         # Source health monitoring
│   └── store/          # SQLite state persistence
├── tests/
│   ├── unit/           # Unit tests
│   ├── integration/    # Integration tests
│   └── fixtures/       # Test data
├── public/             # Generated static site
└── .github/workflows/  # CI/CD pipelines

Configuration Reference

Source Methods

Method	Description
`rss_atom`	RSS/Atom feed parsing
`arxiv_api`	arXiv API queries
`github_releases`	GitHub repository releases
`hf_org`	Hugging Face organization models
`hf_daily_papers`	Hugging Face Daily Papers
`openreview_venue`	OpenReview venue submissions
`papers_with_code`	Papers With Code trending
`html_list`	HTML page link extraction

Source Tiers

Tier	Description
0	Primary sources (official blogs, releases)
1	Secondary sources (aggregators, news)
2	Tertiary sources (social media, forums)

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please read the CLAUDE.md file for coding guidelines and development standards.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
docs		docs
features		features
frontend		frontend
public		public
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto Paper Report

Features

Multi-Source Data Collection

Intelligent Processing

Smart Ranking

Static Site Generation

Automation & Deployment

Architecture

Quick Start

Prerequisites

Installation

Configuration

Running the Pipeline

CLI Commands

Development

Running Tests

Code Quality

GitHub Actions Deployment

Project Structure

Configuration Reference

Source Methods

Source Tiers

License

Contributing

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

DennySORA/Daily-Paper-Report

Folders and files

Latest commit

History

Repository files navigation

Auto Paper Report

Features

Multi-Source Data Collection

Intelligent Processing

Smart Ranking

Static Site Generation

Automation & Deployment

Architecture

Quick Start

Prerequisites

Installation

Configuration

Running the Pipeline

CLI Commands

Development

Running Tests

Code Quality

GitHub Actions Deployment

Project Structure

Configuration Reference

Source Methods

Source Tiers

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages