jdkdb-scraper - JDK Metadata DB Scraper

A Java-based application for scraping JDK metadata from various vendors. This project replaces the original bash scripts with a robust, parallel Java implementation.

This project is based on Joschi's Java Metadata project and incorporates ideas from the Foojay's Disco API project.

Features

Parallel Execution: Run multiple vendor scrapers concurrently for improved performance
Selective Scraping: Run all scrapers or select specific vendors
Central Reporting: Thread-safe progress reporting with real-time status updates
Extensible Architecture: Easy to add new vendor scrapers
Generic Base Classes: Reduces code duplication for similar vendors (e.g., Semeru versions, Trava versions)
Comprehensive Logging: SLF4J/Logback integration with both console and file output
Multi-command CLI: Separate commands for updating metadata, generating indexes, downloading checksums, and cleaning up old releases
Archive Extraction: Automatically extracts release information from JDK archives

Running using JBang

Running the released version of the scraper application can simply be done with JBang:

# Show available commands
jbang scraper@jbangdev/jdkdb-scraper --help

# Update: Run all scrapers
jbang scraper@jbangdev/jdkdb-scraper update

# Update: List available scrapers
jbang scraper@jbangdev/jdkdb-scraper update --list

# Update: Run specific scrapers
jbang scraper@jbangdev/jdkdb-scraper update --scrapers microsoft,semeru,temurin

# Update: Specify custom directories and control parallelism
jbang scraper@jbangdev/jdkdb-scraper update \
--metadata-dir /path/to/metadata \
--checksum-dir /path/to/checksums \
--threads 4

# Update: Scrape from start (ignore existing metadata)
jbang scraper@jbangdev/jdkdb-scraper update --from-start

# Update: Limit progress for testing
jbang scraper@jbangdev/jdkdb-scraper update --limit-progress 5

# Index: Generate all.json files for all vendors
jbang scraper@jbangdev/jdkdb-scraper index

# Index: Regenerate all.json for specific vendors
jbang scraper@jbangdev/jdkdb-scraper index --vendors temurin,zulu

# Download: Download and compute missing checksums for all vendors
jbang scraper@jbangdev/jdkdb-scraper download

# Download: Process specific vendors
jbang scraper@jbangdev/jdkdb-scraper download --vendors microsoft

# Download: Show statistics only (dry-run)
jbang scraper@jbangdev/jdkdb-scraper download --stats-only

# Clean: Remove incomplete metadata files and prune old EA releases (dry-run)
jbang scraper@jbangdev/jdkdb-scraper clean --dry-run

# Clean: Actually remove incomplete files and prune EA releases older than 6 months
jbang scraper@jbangdev/jdkdb-scraper clean --remove-incomplete=all --prune-ea=6m

# Clean: Remove orphaned checksum files
jbang scraper@jbangdev/jdkdb-scraper clean --prune-checksums

Usage

The application provides four main commands:

update - Scrape JDK metadata from various vendors and update metadata files
index - Generate aggregated all.json files for vendor directories
download - Download and compute checksums for metadata files with missing checksums
clean - Clean up metadata by removing incomplete files and pruning old EA releases

GitHub Authentication

Many scrapers fetch data from GitHub APIs, which have rate limits. To avoid hitting these limits, the application supports GitHub authentication through:

Environment Variable: Set GITHUB_TOKEN to your GitHub personal access token
GitHub CLI: If you have the GitHub CLI (gh) installed and authenticated, the token will be automatically obtained from gh auth token

# Option 1: Using environment variable
export GITHUB_TOKEN=your_github_token_here
java -jar build/libs/jdkdb-scraper-1.0.0-SNAPSHOT-standalone.jar update

# Option 2: Using GitHub CLI (automatic if gh is installed and authenticated)
gh auth login
java -jar build/libs/jdkdb-scraper-1.0.0-SNAPSHOT-standalone.jar update

The application checks for tokens in this order: environment variable first, then GitHub CLI.

Typical usage

You can simply run update in the root of the data repository (where the docs/ folder is located) and let it do its work. It will scrape all the vendor sites, obtain the latest metadata, download the jdk distributions, calculate checksums and update all the indices. Nothing else to be done. But this can take some time.
You can split the work into two steps:

You run update --no-download which will do the scraping and will make sure that we have all the latest distributions cataloged. It will write all the metadata but with missing checksums (and release info).
You then run the much heavier and time-consuming download process that will look for any metadata file that has missing checksums (or release info) and download the archive to do the calculations. You can limit this step to only process a maximum amount of downloads before stopping. And then you can repeat it after a certain wait time to avoid rate limits and such.

The index command can be run any time you think the indices might be out of date and it will generate them all from scratch. NB: if you use the second of the two strategies mentioned above then you will have metadata files with missing checksums, in that case you will need to add the option --allow-incomplete or they won't show up in the indices!

And finally the clean command can be used to get rid of any invalid or orphaned data. If you run it without any flags it will give an overview of everything that could be cleaned up without actually doing any of the cleaning yet.

Command Line Options

Main Command

Usage: jdkdb-scraper [-hV] [COMMAND]
Scrapes JDK metadata from various vendors and generates index files
-h, --help      Show this help message and exit.
-V, --version   Print version information and exit.
Commands:
update    Scrape JDK metadata from various vendors and update metadata files
index     Generate all.json files for vendor directories by aggregating
			individual metadata files
download  Download and compute checksums for metadata files that have missing
			checksum values
clean     Clean up metadata by removing incomplete files and pruning old EA
			releases

Update Command

Usage: jdkdb-scraper update [-hlV] [--from-start] [--no-download] [--no-index]
							[-c=<checksumDir>]
							[--limit-progress=<limitProgress>]
							[-m=<metadataDir>] [--max-failures=<maxFailures>]
							[--skip-ea=<skipEa>] [-t=<maxThreads>]
							[-s=<scraperIds>[,<scraperIds>...]]...

Scrape JDK metadata from various vendors and update metadata files

Options:
-c, --checksum-dir=<checksumDir>
					Directory to store checksum files (default: docs/checksums)
	--from-start   Ignore existing metadata files and scrape all items from
					the start
-h, --help         Show this help message and exit.
-l, --list         List all available scraper IDs and exit
	--limit-progress=<limitProgress>
					Maximum number of metadata items to process per scraper
					before aborting (default: unlimited)
-m, --metadata-dir=<metadataDir>
					Directory to store metadata files (default: docs/metadata)
	--max-failures=<maxFailures>
					Maximum number of allowed failures per scraper before
					aborting that scraper (default: 10)
	--no-download  Skip downloading files and only generate metadata (for
					testing/dry-run)
	--no-index     Skip generating index files (for testing/dry-run)
-s, --scrapers=<scraperIds>[,<scraperIds>...]
					Comma-separated list of scraper IDs to run (if not
					specified, all scrapers run)
	--skip-ea=<skipEa>
					Skip early access (EA) releases older than the specified
					duration (e.g., '6m' for 6 months, '1y' for 1 year)
					(default: 6m)
-t, --threads=<maxThreads>
					Maximum number of parallel scraper threads (default:
					number of processors)
-V, --version      Print version information and exit.

Index Command

Usage: jdkdb-scraper index [-hV] [--allow-incomplete] [-m=<metadataDir>]
						[-v=<vendorNames>[,<vendorNames>...]]...

Generate all.json files for vendor directories by aggregating individual
metadata files

Options:
	--allow-incomplete
					Allow incomplete metadata files (missing checksums) to be
					included
-h, --help         Show this help message and exit.
-m, --metadata-dir=<metadataDir>
					Directory containing metadata files (default:
					docs/metadata)
-v, --vendors=<vendorNames>[,<vendorNames>...]
					Comma-separated list of vendor names to regenerate
					all.json for (if not specified, all vendors are
					processed)
-V, --version      Print version information and exit.

Download Command

Usage: jdkdb-scraper download [-hV] [--stats-only] [-c=<checksumDir>]
							[--limit-progress=<limitProgress>]
							[-m=<metadataDir>] [-t=<maxThreads>]
							[-v=<vendorNames>[,<vendorNames>...]]...

Download and compute checksums for metadata files that have missing checksum
values

Options:
-c, --checksum-dir=<checksumDir>
					Directory to store checksum files (default: docs/checksums)
-h, --help         Show this help message and exit.
	--limit-progress=<limitProgress>
					Maximum number of metadata items to process per scraper
					before aborting (default: unlimited)
-m, --metadata-dir=<metadataDir>
					Directory containing metadata files (default:
					docs/metadata)
	--stats-only   Skip downloading files and only show statistics (for
					testing/dry-run)
-t, --threads=<maxThreads>
					Maximum number of parallel download threads (default:
					number of processors)
-v, --vendors=<vendorNames>[,<vendorNames>...]
					Comma-separated list of vendor names to process (if not
					specified, all vendors are processed)
-V, --version      Print version information and exit.

Clean Command

Usage: jdkdb-scraper clean [-hV] [--dry-run] [--prune-checksums]
						[--remove-invalid] [-c=<checksumDir>]
						[-m=<metadataDir>] [--prune-ea=<pruneEa>]
						[--remove-incomplete=<removeIncomplete>]

Clean up metadata by removing incomplete files and pruning old EA releases

Options:
-c, --checksum-dir=<checksumDir>
					Directory containing checksum files (default:
					docs/checksums)
	--dry-run      Show statistics without actually deleting files
-h, --help         Show this help message and exit.
-m, --metadata-dir=<metadataDir>
					Directory containing metadata files (default:
					docs/metadata)
	--prune-checksums
					Remove orphaned checksum files that don't have a matching
					metadata file
	--prune-ea=<pruneEa>
					Prune EA releases older than specified duration (e.g.,
					30d, 3w, 6m, 1y). Duration format: [number][d|w|m|y]
	--remove-incomplete=<removeIncomplete>
					Remove metadata files with incomplete data. Options:
					checksums (missing checksums), release-info (missing
					release info), all (either missing checksums or release
					info) (default: all)
	--remove-invalid
					Remove metadata files that fail validation
					(MetadataUtils.isValidMetadata)
-V, --version      Print version information and exit.

Development

Requirements

Needs Java 21 or higher

Building

This project uses Gradle for dependency management and building.

# Build the project
./gradlew spotlessApply build

# This creates two jars:
# - jdkdb-scraper-1.0.0-SNAPSHOT.jar (regular jar)
# - jdkdb-scraper-1.0.0-SNAPSHOT-standalone.jar (fat jar with all dependencies)

Running using the standalone JAR

# Show available commands
java -jar build/libs/jdkdb-scraper-1.0.0-SNAPSHOT-standalone.jar --help

# Update: Run all scrapers
java -jar build/libs/jdkdb-scraper-1.0.0-SNAPSHOT-standalone.jar update

Architecture

Core Components

Main: Entry point with Picocli command dispatcher
UpdateCommand: Scrapes JDK metadata from vendors and updates files
IndexCommand: Aggregates individual metadata files into all.json files
DownloadCommand: Downloads JDK files to compute missing checksums
CleanCommand: Cleans up incomplete metadata and prunes old EA releases
ProgressReporter: Central reporting thread that receives and logs progress events from all scrapers
ScraperConfig: Configuration record encapsulating metadata directory, checksum directory, logger, etc
BaseScraper: Abstract base class for all scrapers with common functionality (downloading, hashing, metadata saving)
GitHubReleaseScraper: Specialized base class for scrapers that fetch releases from GitHub
AdoptiumMarketplaceScraper: Specialized base class for scrapers using Adoptium Marketplace API
JavaNetBaseScraper: Specialized base class for scrapers that fetch from java.net archives
ScraperFactory: Factory class that uses ServiceLoader to dynamically discover and instantiate scrapers
Scraper.Discovery: Service provider interface for scraper registration via Java ServiceLoader
DownloadManager: Interface for downloading JDK files (with default and no-op implementations)

Vendor Scrapers

The project includes 35 vendor scrapers, supporting all major JDK distributions:

Scraper IDs and Vendors

Scraper ID	Vendor	Notes
`adoptopenjdk`	AdoptOpenJDK	Legacy
`bisheng`	Bisheng	Huawei
`corretto`	Amazon Corretto
`debian`	Debian
`dragonwell`	Alibaba Dragonwell
`gluon-graalvm`	Gluon GraalVM
`graalvm-ce`	GraalVM CE	Early Access
`graalvm-community`	GraalVM Community
`graalvm-community-ea`	GraalVM Community	Early Access
`graalvm-legacy`	GraalVM	Legacy versions
`ibm`	IBM JDK
`java-se-ri`	Java SE RI	Reference Implementation
`jetbrains`	JetBrains Runtime
`kona`	Tencent Kona
`liberica`	BellSoft Liberica
`liberica-native`	BellSoft Liberica Native	Native Image Kit
`mandrel`	Mandrel	Red Hat's GraalVM
`microsoft`	Microsoft	Microsoft Build of OpenJDK
`openjdk`	OpenJDK
`openjdk-leyden`	OpenJDK Leyden	Project Leyden
`openjdk-loom`	OpenJDK Loom	Project Loom
`openjdk-valhalla`	OpenJDK Valhalla	Project Valhalla
`openlogic`	OpenLogic
`oracle`	Oracle
`oracle-graalvm`	Oracle GraalVM
`oracle-graalvm-ea`	Oracle GraalVM	Early Access
`redhat`	Red Hat
`sapmachine`	SAP SapMachine
`semeru`	IBM Semeru
`semeru-certified`	IBM Semeru Certified
`temurin`	Eclipse Temurin	Eclipse Adoptium
`trava-8`	TravaOpenJDK	Java 8
`trava-11`	TravaOpenJDK	Java 11
`zulu`	Azul Zulu
`zulu-prime`	Azul Zulu Prime

Total: 35 scrapers

Each scraper is registered via Java's ServiceLoader mechanism in META-INF/services.

Adding New Scrapers

Create a new class in src/main/java/dev/jbang/jdkdb/scraper/impl/ extending BaseScraper, GitHubReleaseScraper, AdoptiumMarketplaceScraper, or JavaNetBaseScraper
Implement required abstract methods
Add an inner Discovery class implementing Scraper.Discovery
Register the discovery class in src/main/resources/META-INF/services/dev.jbang.jdkdb.scraper.Scraper$Discovery

Example:

package dev.jbang.jdkdb.scraper.impl;

public class NewVendor extends BaseScraper {
	public NewVendor(ScraperConfig config) {
		super(config);
	}

	@Override
	protected List<JdkMetadata> scrape() throws Exception {
		List<JdkMetadata> allMetadata = new ArrayList<>();

		try {
			// Figure out what items to scrape here...
			var allItems = ...;
			for (item in allItems) {
				try {
					JdkMetadata metadata = scrapeItem(item);
					allMetadata.add(metadata);
				} catch (InterruptedProgressException|TooManyFailuresException e) {
					// We don't want these exceptions to mark the item as failed
					throw e;
				} catch (Exception e) {
					// For recoverable problems we call fail():
					fail(filename, e);
				}
			}
		} catch (InterruptedProgressException ex) {
			// When we receive this exception we stop scraping
			// but we return whatever is already processed
		}

		return allMetadata;
	}

	private JdkMetadata scrapeItem(...) throws Exception {
		// Here we scrape a single item
		JdkMetadata metadata = ...;
		// For each successful item created we call:
		saveMetadataFile(metadata);
		success(filename);
		return metadata;
	}

	// ServiceLoader discovery
	public static class Discovery implements Scraper.Discovery {
		@Override
		public String name() {
			return "new-vendor";
		}

		@Override
		public String vendor() {
			return "New Vendor";
		}

		@Override
		public Scraper create(ScraperConfig config) {
			return new NewVendor(config);
		}
	}
}

Project Structure

src/
├── main/
│   ├── java/
│   │   └── dev/jbang/jdkdb/
│   │       ├── Main.java                          # CLI application entry point
│   │       ├── UpdateCommand.java                 # Update command (scrape metadata)
│   │       ├── IndexCommand.java                  # Index command (generate all.json)
│   │       ├── DownloadCommand.java               # Download command (compute checksums)
│   │       ├── CleanCommand.java                  # Clean command (cleanup metadata)
│   │       ├── model/
│   │       │   └── JdkMetadata.java              # Data model for JDK metadata
│   │       ├── reporting/
│   │       │   ├── ProgressEvent.java            # Progress event types
│   │       │   ├── ProgressReporter.java         # Central reporting thread
│   │       │   └── ProgressReporterLogger.java   # Logger adapter for scrapers
│   │       ├── scraper/
│   │       │   ├── Scraper.java                  # Scraper interface with Discovery SPI
│   │       │   ├── ScraperConfig.java            # Configuration record
│   │       │   ├── ScraperResult.java            # Result wrapper
│   │       │   ├── BaseScraper.java              # Base class for all scrapers
│   │       │   ├── GitHubReleaseScraper.java     # Base for GitHub-based scrapers
│   │       │   ├── AdoptiumMarketplaceScraper.java # Base for Adoptium Marketplace
│   │       │   ├── JavaNetBaseScraper.java       # Base for java.net archives
│   │       │   ├── ScraperFactory.java           # Factory using ServiceLoader
│   │       │   ├── DownloadManager.java          # Interface for downloading JDKs
│   │       │   ├── DefaultDownloadManager.java   # Default implementation
│   │       │   ├── NoOpDownloadManager.java      # No-op for testing
│   │       │   ├── DownloadResult.java           # Download result wrapper
│   │       │   ├── PaginatedIterator.java        # GitHub pagination helper
│   │       │   ├── InterruptedProgressException.java # Exception types
│   │       │   ├── TooManyFailuresException.java
│   │       │   └── impl/                         # Vendor scraper implementations
│   │       │       ├── Temurin.java
│   │       │       ├── Zulu.java
│   │       │       ├── ZuluPrime.java
│   │       │       ├── Liberica.java
│   │       │       ├── LibericaNative.java
│   │       │       ├── Microsoft.java
│   │       │       ├── Corretto.java
│   │       │       ├── SapMachine.java
│   │       │       ├── Semeru.java
│   │       │       ├── SemeruBaseScraper.java
│   │       │       ├── SemeruCertified.java
│   │       │       ├── Oracle.java
│   │       │       ├── OracleGraalVm.java
│   │       │       ├── OracleGraalVmEa.java
│   │       │       ├── Mandrel.java
│   │       │       ├── Dragonwell.java
│   │       │       ├── Kona.java
│   │       │       ├── Bisheng.java
│   │       │       ├── Redhat.java
│   │       │       ├── AdoptOpenJdk.java
│   │       │       ├── Ibm.java
│   │       │       ├── JavaSeReference.java
│   │       │       ├── Jetbrains.java
│   │       │       ├── GraalVmLegacy.java
│   │       │       ├── GraalVmCe.java
│   │       │       ├── GraalVmCommunity.java
│   │       │       ├── GraalVmCommunityEa.java
│   │       │       ├── GluonGraalVm.java
│   │       │       ├── OpenLogic.java
│   │       │       ├── OpenJdk.java
│   │       │       ├── OpenJdkLeyden.java
│   │       │       ├── OpenJdkLoom.java
│   │       │       ├── OpenJdkValhalla.java
│   │       │       ├── Debian.java
│   │       │       ├── TravaBaseScraper.java
│   │       │       ├── TravaJdk8.java
│   │       │       └── TravaJdk11.java
│   │       └── util/
│   │           ├── ArchiveUtils.java             # Archive extraction utilities
│   │           ├── FileUtils.java                # File operations
│   │           ├── HashUtils.java                # Hash computation
│   │           ├── HtmlUtils.java                # HTML parsing utilities
│   │           ├── HttpUtils.java                # HTTP operations
│   │           ├── MetadataUtils.java            # Metadata validation/utilities
│   │           ├── VendorLoggerDiscriminator.java # Logging configuration
│   │           └── VersionComparator.java        # Version comparison
│   └── resources/
│       ├── logback.xml                           # Logging configuration
│       └── META-INF/
│           └── services/
│               └── dev.jbang.jdkdb.scraper.Scraper$Discovery
└── test/
	└── java/
		└── dev/jbang/jdkdb/
			├── scraper/
			│   ├── BaseScraperTest.java
			│   ├── DummyScraperTest.java
			│   ├── ScraperFactoryTest.java
			│   └── ScraperResultTest.java
			└── util/
				├── ArchiveUtilsTest.java
				└── MetadataUtilsTest.java

Dependencies

Jackson: JSON processing (2.16.1)
Java HttpClient (java.net.http): HTTP operations (built-in)
SLF4J/Logback: Logging (SLF4J 2.0.7, Logback 1.4.14)
Picocli: Command-line interface (4.7.5)
Apache Commons Compress: Archive handling (1.27.1)
JUnit 5: Testing (5.10.1)
AssertJ: Fluent assertions for testing (3.27.6)

Output

The scrapers generate structured output in the docs/ directory:

Metadata Files (`docs/metadata/`)

Top-level aggregated indexes:

all.json - All JDK releases across all vendors
ga.json - General Availability (stable) releases only
ea.json - Early Access releases only
latest.json - Latest releases per vendor

Organized by release type (all/, ea/, ga/):

OS-specific files: linux.json, macosx.json, windows.json, aix.json, solaris.json
Architecture-specific subdirectories with further breakdowns

Vendor-specific metadata (vendor/<vendor-name>/):

Individual .json files for each JDK release
all.json file combining all releases for that vendor

Checksum Files (`docs/checksums/`)

Stored in vendor-specific directories: docs/checksums/<vendor-name>/
Contains MD5, SHA1, SHA256, and SHA512 checksum files
Organized to match the corresponding metadata files

Logging

Logs are written to:

Console (STDOUT) - Real-time progress
File (logs/jdkdb-scraper.log) - Detailed execution log

The logging configuration can be customized in src/main/resources/logback.xml.

Requirements

Java 21 or higher
Gradle 8.x (included via wrapper)

License

Same as the original project (see LICENSE file)

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
.github		.github
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
jbang-catalog.json		jbang-catalog.json
openapi.yaml		openapi.yaml

Uh oh!

License

jbangdev/jdkdb-scraper

Folders and files

Latest commit

History

Repository files navigation

jdkdb-scraper - JDK Metadata DB Scraper

Features

Running using JBang

Usage

GitHub Authentication

Typical usage

Command Line Options

Main Command

Update Command

Index Command

Download Command

Clean Command

Development

Requirements

Building

Running using the standalone JAR

Architecture

Core Components

Vendor Scrapers

Scraper IDs and Vendors

Adding New Scrapers

Project Structure

Dependencies

Output

Metadata Files (docs/metadata/)

Checksum Files (docs/checksums/)

Logging

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 16

Sponsor this project

Uh oh!

Packages 0

Contributors 3

Uh oh!

Languages

Metadata Files (`docs/metadata/`)

Checksum Files (`docs/checksums/`)

Packages