Skip to content

Conversation

@tchivs
Copy link

@tchivs tchivs commented Nov 21, 2025

Purpose

This PR refactors CatalogContext to separate Hadoop dependencies, enabling Paimon to work in Hadoop-free environments.

Closes #6654

Background and Motivation

Trino Plugin Development Requirement

This change is essential for developing the Trino-Paimon connector. Trino explicitly does not allow connectors to have mandatory Hadoop dependencies:

The previous paimon-trino implementation was affected by this issue, causing deployment problems in Trino environments where Hadoop is not available or desired.

Quote from Trino Policy (trinodb/trino#15921):

Trino connectors should not have a hard dependency on Hadoop. Connectors must work without Hadoop on the classpath.

Problem Statement

Currently, CatalogContext has a hard dependency on Hadoop Configuration, causing NoClassDefFoundError in environments where Hadoop is not needed:

Use Cases Affected

1. 🎯 Trino-Paimon Connector (Primary Use Case)

  • Critical Blocker: Trino's connector architecture prohibits mandatory Hadoop dependencies
  • Current implementation violates Trino's design principles
  • Blocks integration with Trino's cloud-native deployment model
  • Reference: trinodb/trino#15921
  • Previous issue: apache/paimon-trino#96

2. Windows Development Environment
When using Flink CDC with Paimon sink to MinIO S3 on Windows, the application fails with:

Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
    at org.apache.paimon.catalog.CatalogContext.<init>(CatalogContext.java:53)
    at org.apache.paimon.catalog.CatalogContext.create(CatalogContext.java:73)
    at org.apache.paimon.flink.FlinkCatalogFactory.createPaimonCatalog(FlinkCatalogFactory.java:81)
    at org.apache.flink.cdc.connectors.paimon.sink.v2.bucket.BucketAssignOperator.open(BucketAssignOperator.java:103)
    ...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration

3. Lightweight Deployments

  • Local FileIO usage shouldn't require Hadoop
  • Cloud-native deployments using native S3/OSS clients
  • Embedded environments

Solution

Refactor CatalogContext using class hierarchy to separate concerns:

New Architecture

  1. CatalogContext - Base class without Hadoop dependency

    • Contains common catalog configuration (options, classloader)
    • Can be used by FileIO and other components in Hadoop-free environments
    • Enables Trino connector compatibility
  2. HadoopAware - Interface for Hadoop functionality

    • Isolates Hadoop-specific methods
    • Only implemented when Hadoop support is needed
    • Components can check for this interface at runtime
  3. CatalogHadoopContext - Hadoop implementation

    • Extends CatalogContext and implements HadoopAware
    • Provides Hadoop Configuration access when available

Architecture Changes

Before (Hard Dependency):

public class CatalogContext {
    private final Configuration hadoopConf; // Always required - BLOCKS TRINO
    
    public CatalogContext(..., Configuration hadoopConf) {
        this.hadoopConf = hadoopConf; // Mandatory Hadoop dependency
    }
}

After (Optional Dependency):

public class CatalogContext {
    // No Hadoop dependency - works in Trino ✅
    protected final Options options;
    protected final ClassLoader classLoader;
}

public interface HadoopAware {
    Configuration hadoopConf();
}

public class CatalogHadoopContext extends CatalogContext implements HadoopAware {
    private final Configuration hadoopConf;
    // Hadoop support when needed (Flink, Spark)
}

Factory Pattern

Factory methods automatically detect whether Hadoop Configuration is needed and return the appropriate type:

// Factory methods in CatalogContext
public static CatalogContext create(Options options) {
    // Returns CatalogContext or CatalogHadoopContext based on needs
}

Updated Components

  • FileIO implementations: Use CatalogContext instead of requiring Hadoop

    • LocalFileIO: Works without Hadoop ✅
    • HadoopFileIO: Checks for HadoopAware interface dynamically
    • ResolvingFileIO: Supports both modes
  • SecurityContext: Gracefully handles absence of Hadoop

  • Catalog factories: Updated to support both contexts

Changes Summary

  • Core Changes (paimon-common):

    • Modified: CatalogContext.java - Refactored to base class without Hadoop
    • New: CatalogHadoopContext.java (169 lines) - Hadoop-aware extension
    • New: HadoopAware.java (45 lines) - Interface for Hadoop functionality
  • FileIO Updates:

    • FileIOUtils.java: Handle both CatalogContext and HadoopAware
    • ResolvingFileIO.java: Support Hadoop-free initialization
    • HadoopFileIO.java: Check for HadoopAware dynamically
    • LocalFileIO.java: Use CatalogContext only
  • Integration Updates (paimon-core, paimon-hive, paimon-spark):

    • Updated catalog factories and related code

Total: 26 files changed, 571 insertions(+), 80 deletions(-)

Benefits

  1. Enables Trino-Paimon connector - Complies with Trino's no-Hadoop policy
  2. Fixes Windows development issues with Flink CDC + Paimon + MinIO
  3. Reduces dependency footprint for cloud-native deployments
  4. Better architecture following separation of concerns principle
  5. Backward compatible - existing code continues to work

Testing

  • ✅ All existing unit tests pass
  • ✅ FileIOTest and ResolvingFileIOTest verified
  • ✅ No behavior changes for existing functionality
  • ✅ LocalFileIO works without Hadoop on classpath
  • ✅ Tested in Hadoop environments (Flink, Spark) - works as before

Affected Modules

  • paimon-common (core classes)
  • paimon-core (catalog implementations)
  • paimon-hive (Hive integration)
  • paimon-spark (Spark integration)

Compatibility

This is a backward-compatible change:

  • ✅ Existing code using CatalogContext continues to work
  • ✅ Factory methods automatically return appropriate type
  • ✅ No API breaking changes
  • ✅ Flink and Spark integrations unaffected
  • ✅ Enables future Trino integration

Related Issues

@tchivs tchivs force-pushed the refactor-catalog-context-interface branch 12 times, most recently from fe514b3 to 83ec87b Compare November 22, 2025 03:55
@tchivs tchivs changed the title [common] Refactor CatalogContext to use interface segregation pattern [common] Refactor CatalogContext to separate Hadoop dependencies Nov 22, 2025
@tchivs tchivs force-pushed the refactor-catalog-context-interface branch 3 times, most recently from eda02c8 to 92a40fd Compare November 22, 2025 16:04
- CatalogContext provides basic catalog context for Hadoop-free environments
- CatalogHadoopContext extends CatalogContext for Hadoop integration
- Add HadoopAware interface to isolate Hadoop dependency
- Update FileIO and related classes to use CatalogContext
- Factory methods automatically detect and return appropriate type
- Simplify CatalogContext factory methods to reduce code duplication
@tchivs tchivs force-pushed the refactor-catalog-context-interface branch from 92a40fd to fa13eb6 Compare November 23, 2025 22:14
@tchivs
Copy link
Author

tchivs commented Nov 25, 2025

@JingsongLi PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant