Skip to content

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Oct 27, 2025

Summary

This PR refactors the metadata validation system to use URL-based schema loading instead of vendoring YAML schema files. The changes enable the CDK to fetch the canonical metadata schema directly from the monorepo, eliminating the need to maintain duplicate schema files.

Key changes:

  • Removed 30+ vendored YAML schema files from airbyte_cdk/models/connector_metadata/resources/
  • Removed schema generation script (bin/generate_connector_metadata_files.py)
  • Added get_metadata_schema() function that loads JSON schema from URL or file path
  • Enhanced validate_metadata_file() to use JSON schema validation (via jsonschema library)
  • Added new CLI command: airbyte-cdk metadata validate --file <path> [--schema <url|path>] [--format json|text]
  • Reorganized module structure: connector_metadata.pyconnector_metadata/metadata_file.py (maintains backward compatibility via __init__.py)
  • Added comprehensive unit tests (12 tests covering validation scenarios)

Schema URL: Currently pinned to commit 61048d88... in the monorepo with a TODO to update to master branch after the associated monorepo PR merges.

Review & Testing Checklist for Human

  • Test CLI with real metadata.yaml files - Unit tests use a minimal mock schema. Verify the CLI works with actual connector metadata files from the monorepo (e.g., airbyte-integrations/connectors/source-*/metadata.yaml)
  • Verify network behavior - Test what happens when the schema URL is unreachable (disconnect network, invalid URL). Confirm timeout handling (10s) and error messages are reasonable
  • Check the pinned schema URL - Manually verify the URL https://raw.githubusercontent.com/airbytehq/airbyte/61048d88.../ConnectorMetadataDefinitionV0.json returns a valid JSON schema
  • Test backward compatibility - Verify existing code that imports from airbyte_cdk.models.connector_metadata still works after the module reorganization
  • Plan TODO follow-up - Ensure there's a plan/issue to update the schema URL to use the master branch once the monorepo PR merges

Test Plan

# Test with a real connector metadata file
airbyte-cdk metadata validate --file /path/to/airbyte/airbyte-integrations/connectors/source-postgres/metadata.yaml

# Test JSON output
airbyte-cdk metadata validate --file metadata.yaml --format json

# Test with custom schema
airbyte-cdk metadata validate --file metadata.yaml --schema /path/to/schema.json

Notes

  • The schema URL is pinned to a specific commit for stability. After the monorepo PR merges, update the TODO comment and switch to the master branch URL
  • Tests pass locally (12/12) and use mocks to avoid network dependencies during testing
  • The jsonschema library is already a dependency (version >=4.17.3,<5.0 in pyproject.toml)

Link to Devin run: https://app.devin.ai/sessions/0078df9742174c169753b2c4c4d247da
Requested by: AJ Steers (@aaronsteers)

- Vendor metadata schema YAML files from monorepo into CDK resources directory
- Include fixes for SecretStore required fields and ConnectorBreakingChanges const usage
- Add validate_metadata_file() function with Pydantic validation
- Add 'airbyte-cdk metadata validate' CLI command with JSON and text output
- Update generation script to use vendored YAML files instead of cloning from GitHub
- Regenerate metadata_schema.json with fixed YAML files
- Reorganize connector_metadata module into package structure

This makes the CDK self-contained for metadata validation and removes dependency
on the monorepo for schema files.

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin find the pending Python CDK PR 807 which adds a new JSON schema file for manifest.yaml files. 

Use the JSON schema file from that PR to try to validate our own connectors current files. Start with 10 random certified connectors and report back. We want to see if there are valid reasons they won't/can't pass validation in their current state - either due to problems in the new JSON file or in problems with the files, or problems with the spec that the JSON files are generated from
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1761600622971549?thread_ts=1761600622.971549

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added the enhancement New feature or request label Oct 27, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@vendor-metadata-yaml-and-add-cli#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch vendor-metadata-yaml-and-add-cli

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@github-actions
Copy link

github-actions bot commented Oct 27, 2025

PyTest Results (Fast)

3 829 tests  +12   3 817 ✅ +12   6m 13s ⏱️ -18s
    1 suites ± 0      12 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit f94e803. ± Comparison against base commit 3c2a4f8.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 27, 2025

PyTest Results (Full)

3 832 tests  +12   3 820 ✅ +12   11m 8s ⏱️ +21s
    1 suites ± 0      12 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit f94e803. ± Comparison against base commit 3c2a4f8.

♻️ This comment has been updated with latest results.

devin-ai-integration bot and others added 2 commits October 27, 2025 23:56
Mirror the change from the monorepo PR to keep schemas in sync. The previous required fields (name, secretStore) were invalid and pointed to non-existent properties, meaning the generated Pydantic models effectively treated all fields as optional. Making both type and alias explicitly optional maintains backward compatibility with existing behavior.

All 304 connectors with secretStore entries already include both type and alias fields, so this change is purely for backward compatibility.

Co-Authored-By: AJ Steers <aj@airbyte.io>
Replace vendored YAML schema files with URL-based schema loading from the
monorepo. This removes the dependency on vendoring schemas and allows the
CDK to fetch the latest schema from a canonical source.

Changes:
- Remove vendored YAML files from airbyte_cdk/models/connector_metadata/resources/
- Remove bin/generate_connector_metadata_files.py schema generation script
- Add get_metadata_schema() utility function to fetch schema from URL or file path
- Refactor validate_metadata_file() to use JSON schema validation via jsonschema library
- Add --schema flag to CLI for custom schema path/URL (defaults to monorepo URL)
- Add comprehensive unit tests with local schema (no network dependency)
- Add TODO comment to update URL to master branch after associated PR merges

The default schema URL points to a specific commit in the monorepo:
https://raw.githubusercontent.com/airbytehq/airbyte/61048d88732df93c50bd3da490de8d3cc1aa66b0/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/generated/ConnectorMetadataDefinitionV0.json

This will be updated to use the master branch URL after the associated PR merges.

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration devin-ai-integration bot changed the title feat(metadata): vendor YAML schemas and add CLI validation command feat(metadata): add CLI validation with URL-based schema loading Oct 28, 2025
Add explicit type casts to json.loads() return values to satisfy MyPy's
no-any-return check. The json.loads() function returns Any, but we know
the schema should be a dict[str, Any], so we use cast() to inform the
type checker.

Co-Authored-By: AJ Steers <aj@airbyte.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants