[common] Refactor CatalogContext to separate Hadoop dependencies #6653

tchivs · 2025-11-21T13:37:38Z

Purpose

This PR refactors CatalogContext to separate Hadoop dependencies, enabling Paimon to work in Hadoop-free environments.

Closes #6654

Background and Motivation

Trino Plugin Development Requirement

This change is essential for developing the Trino-Paimon connector. Trino explicitly does not allow connectors to have mandatory Hadoop dependencies:

🔗 Trino Policy: trinodb/trino#15921
🔗 Paimon-Trino Issue: apache/paimon-trino#96

The previous paimon-trino implementation was affected by this issue, causing deployment problems in Trino environments where Hadoop is not available or desired.

Quote from Trino Policy (trinodb/trino#15921):

Trino connectors should not have a hard dependency on Hadoop. Connectors must work without Hadoop on the classpath.

Problem Statement

Currently, CatalogContext has a hard dependency on Hadoop Configuration, causing NoClassDefFoundError in environments where Hadoop is not needed:

Use Cases Affected

1. 🎯 Trino-Paimon Connector (Primary Use Case)

Critical Blocker: Trino's connector architecture prohibits mandatory Hadoop dependencies
Current implementation violates Trino's design principles
Blocks integration with Trino's cloud-native deployment model
Reference: trinodb/trino#15921
Previous issue: apache/paimon-trino#96

2. Windows Development Environment
When using Flink CDC with Paimon sink to MinIO S3 on Windows, the application fails with:

Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
    at org.apache.paimon.catalog.CatalogContext.<init>(CatalogContext.java:53)
    at org.apache.paimon.catalog.CatalogContext.create(CatalogContext.java:73)
    at org.apache.paimon.flink.FlinkCatalogFactory.createPaimonCatalog(FlinkCatalogFactory.java:81)
    at org.apache.flink.cdc.connectors.paimon.sink.v2.bucket.BucketAssignOperator.open(BucketAssignOperator.java:103)
    ...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration

3. Lightweight Deployments

Local FileIO usage shouldn't require Hadoop
Cloud-native deployments using native S3/OSS clients
Embedded environments

Solution

Refactor CatalogContext using class hierarchy to separate concerns:

New Architecture

CatalogContext - Base class without Hadoop dependency
- Contains common catalog configuration (options, classloader)
- Can be used by FileIO and other components in Hadoop-free environments
- Enables Trino connector compatibility
HadoopAware - Interface for Hadoop functionality
- Isolates Hadoop-specific methods
- Only implemented when Hadoop support is needed
- Components can check for this interface at runtime
CatalogHadoopContext - Hadoop implementation
- Extends CatalogContext and implements HadoopAware
- Provides Hadoop Configuration access when available

Architecture Changes

Before (Hard Dependency):

public class CatalogContext {
    private final Configuration hadoopConf; // Always required - BLOCKS TRINO
    
    public CatalogContext(..., Configuration hadoopConf) {
        this.hadoopConf = hadoopConf; // Mandatory Hadoop dependency
    }
}

After (Optional Dependency):

public class CatalogContext {
    // No Hadoop dependency - works in Trino ✅
    protected final Options options;
    protected final ClassLoader classLoader;
}

public interface HadoopAware {
    Configuration hadoopConf();
}

public class CatalogHadoopContext extends CatalogContext implements HadoopAware {
    private final Configuration hadoopConf;
    // Hadoop support when needed (Flink, Spark)
}

Factory Pattern

Factory methods automatically detect whether Hadoop Configuration is needed and return the appropriate type:

// Factory methods in CatalogContext
public static CatalogContext create(Options options) {
    // Returns CatalogContext or CatalogHadoopContext based on needs
}

Updated Components

FileIO implementations: Use CatalogContext instead of requiring Hadoop
- LocalFileIO: Works without Hadoop ✅
- HadoopFileIO: Checks for HadoopAware interface dynamically
- ResolvingFileIO: Supports both modes
SecurityContext: Gracefully handles absence of Hadoop
Catalog factories: Updated to support both contexts

Changes Summary

Core Changes (paimon-common):
- Modified: CatalogContext.java - Refactored to base class without Hadoop
- New: CatalogHadoopContext.java (169 lines) - Hadoop-aware extension
- New: HadoopAware.java (45 lines) - Interface for Hadoop functionality
FileIO Updates:
- FileIOUtils.java: Handle both CatalogContext and HadoopAware
- ResolvingFileIO.java: Support Hadoop-free initialization
- HadoopFileIO.java: Check for HadoopAware dynamically
- LocalFileIO.java: Use CatalogContext only
Integration Updates (paimon-core, paimon-hive, paimon-spark):
- Updated catalog factories and related code

Total: 26 files changed, 571 insertions(+), 80 deletions(-)

Benefits

✅ Enables Trino-Paimon connector - Complies with Trino's no-Hadoop policy
- Resolves trinodb/trino#15921 concern
- Fixes apache/paimon-trino#96
✅ Fixes Windows development issues with Flink CDC + Paimon + MinIO
✅ Reduces dependency footprint for cloud-native deployments
✅ Better architecture following separation of concerns principle
✅ Backward compatible - existing code continues to work

Testing

✅ All existing unit tests pass
✅ FileIOTest and ResolvingFileIOTest verified
✅ No behavior changes for existing functionality
✅ LocalFileIO works without Hadoop on classpath
✅ Tested in Hadoop environments (Flink, Spark) - works as before

Affected Modules

paimon-common (core classes)
paimon-core (catalog implementations)
paimon-hive (Hive integration)
paimon-spark (Spark integration)

Compatibility

This is a backward-compatible change:

✅ Existing code using CatalogContext continues to work
✅ Factory methods automatically return appropriate type
✅ No API breaking changes
✅ Flink and Spark integrations unaffected
✅ Enables future Trino integration

Related Issues

Closes [common] CatalogContext requires Hadoop dependency even in non-Hadoop environments #6654
Related to trinodb/trino#15921
Fixes apache/paimon-trino#96

- CatalogContext provides basic catalog context for Hadoop-free environments - CatalogHadoopContext extends CatalogContext for Hadoop integration - Add HadoopAware interface to isolate Hadoop dependency - Update FileIO and related classes to use CatalogContext - Factory methods automatically detect and return appropriate type - Simplify CatalogContext factory methods to reduce code duplication

tchivs · 2025-11-25T06:13:34Z

@JingsongLi PTAL

tchivs force-pushed the refactor-catalog-context-interface branch 12 times, most recently from fe514b3 to 83ec87b Compare November 22, 2025 03:55

tchivs changed the title ~~[common] Refactor CatalogContext to use interface segregation pattern~~ [common] Refactor CatalogContext to separate Hadoop dependencies Nov 22, 2025

tchivs force-pushed the refactor-catalog-context-interface branch 3 times, most recently from eda02c8 to 92a40fd Compare November 22, 2025 16:04

tchivs force-pushed the refactor-catalog-context-interface branch from 92a40fd to fa13eb6 Compare November 23, 2025 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[common] Refactor CatalogContext to separate Hadoop dependencies #6653

[common] Refactor CatalogContext to separate Hadoop dependencies #6653

Uh oh!

tchivs commented Nov 21, 2025 •

edited

Loading

Uh oh!

tchivs commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[common] Refactor CatalogContext to separate Hadoop dependencies #6653

Are you sure you want to change the base?

[common] Refactor CatalogContext to separate Hadoop dependencies #6653

Uh oh!

Conversation

tchivs commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background and Motivation

Trino Plugin Development Requirement

Quote from Trino Policy (trinodb/trino#15921):

Problem Statement

Use Cases Affected

Solution

New Architecture

Architecture Changes

Factory Pattern

Updated Components

Changes Summary

Benefits

Testing

Affected Modules

Compatibility

Related Issues

Uh oh!

tchivs commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tchivs commented Nov 21, 2025 •

edited

Loading