Skip to content

feat: sdg component#88

Open
beatsmonster wants to merge 3 commits intokubeflow:mainfrom
beatsmonster:main
Open

feat: sdg component#88
beatsmonster wants to merge 3 commits intokubeflow:mainfrom
beatsmonster:main

Conversation

@beatsmonster
Copy link

Description of your changes:

This is a mirror design documentation of SDG KFP Component for collecting feedbacks.

Checklist:

Pre-Submission Checklist

Additional Checklist Items for New or Updated Components/Pipelines

  • metadata.yaml includes fresh lastVerified timestamp
  • All required files
    are present and complete
  • OWNERS file lists appropriate maintainers
  • README provides clear documentation with usage examples
  • Component follows snake_case naming convention
  • No security vulnerabilities in dependencies
  • Containerfile included if using a custom base image

@google-oss-prow google-oss-prow bot requested a review from HumairAK January 29, 2026 14:26
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mprahl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an SDG Hub–backed Kubeflow Pipelines (KFP) component plus supporting documentation, examples, and tests, along with test fixtures for validating transform-only and LLM flows.

Changes:

  • Introduce sdg_hub KFP component implementation with flow selection, optional model configuration, metrics, and optional PVC export.
  • Add component documentation (README + architecture design) and example pipelines/local runner script.
  • Add unit + local/integration-style tests and SDG Hub flow/prompt test fixtures; update packaging config to include new component package.

Reviewed changes

Copilot reviewed 15 out of 17 changed files in this pull request and generated no comments.

Show a summary per file
File Description
components/sdg/ARCHITECTURE.md Architecture/design documentation for the SDG component.
components/sdg/sdg_hub/component.py New KFP component implementation wrapping SDG Hub SDK.
components/sdg/sdg_hub/README.md Component usage documentation and local dev/testing instructions.
components/sdg/sdg_hub/example_pipelines.py Example KFP pipelines demonstrating component usage patterns.
components/sdg/sdg_hub/run_local.py Local execution script for running an LLM test flow and printing outputs.
components/sdg/sdg_hub/metadata.yaml Component metadata (deps/tags/verification timestamp).
components/sdg/sdg_hub/OWNERS Ownership/maintainer declarations for the component.
components/sdg/sdg_hub/tests/test_component_unit.py Unit tests with SDG Hub SDK mocked to validate component logic.
components/sdg/sdg_hub/tests/test_component_local.py Local runner/integration tests using real flows (LLM test gated by env var).
test_data/sdg_hub/* Sample input + minimal transform/LLM flows and prompt fixture for tests.
pyproject.toml Adds deps for tests and registers new component packages for distribution.
Comments suppressed due to low confidence (9)

components/sdg/sdg_hub/component.py:13

  • The component imports and uses pandas, but packages_to_install only includes sdg-hub. Unless sdg-hub guarantees pandas as a dependency, this can fail at runtime when KFP installs the component dependencies. Add pandas to packages_to_install (or remove the direct dependency on pandas).
@dsl.component(
    packages_to_install=["sdg-hub"],
)

components/sdg/sdg_hub/run_local.py:29

  • run_local.py executes its logic at import time (top-level with tempfile.TemporaryDirectory()...). This makes it unsafe to import and can cause unexpected side effects in tooling. Wrap execution in an if __name__ == "__main__": guard (and optionally move logic into a main() function).
with tempfile.TemporaryDirectory() as tmp_dir:
    output_artifact = Artifact(os.path.join(tmp_dir, "output.jsonl"))
    output_metrics = Artifact(os.path.join(tmp_dir, "metrics.json"))

    print(f"Input:   {INPUT_PATH}")

components/sdg/sdg_hub/README.md:44

  • The README import paths use from components.sdg.sdg_hub import sdg, but this repo is packaged as kfp_components (see pyproject.toml package-dir mapping). These examples likely won’t work for users who pip install kfp-components; consider switching examples to from kfp_components.components.sdg.sdg_hub import sdg (or whatever the intended public import path is).
```python
from components.sdg.sdg_hub import sdg
from kfp import dsl
from kfp_kubernetes import mount_pvc, use_secret_as_env

components/sdg/sdg_hub/example_pipelines.py:75

  • The secret key name in the example (OPENAI_APIKEY) conflicts with the generic secret key naming described elsewhere in this PR (e.g., api_key/api_base in ARCHITECTURE.md). To avoid confusing users, align the examples/documentation on one secret schema (either update this mapping to api_key or clarify that any key name works as long as it’s mapped to LLM_API_KEY).
    use_secret_as_env(
        task=sdg_task,
        secret_name="llm-credentials",
        secret_key_to_env={"OPENAI_APIKEY": "LLM_API_KEY"},
    )

pyproject.toml:38

  • pyproject.toml adds new test dependencies (sdg-hub, pandas) but the committed uv.lock does not appear to include them. If this repo relies on uv.lock for reproducible environments/CI, it should be regenerated/updated to match the new dependency set.
    "semver",
    "pip",
    "sdg-hub",
    "pandas",
]

components/sdg/ARCHITECTURE.md:5

  • PR description says this is a mirror design doc for feedback, but the PR also adds a full component implementation, tests, and test data. Please update the PR description/title or split the PR so reviewers understand whether they’re reviewing design docs only vs. production code changes.
# SDG Hub KFP Component Architecture Design

## Document Information

| Field | Value |

components/sdg/sdg_hub/component.py:67

  • logging.basicConfig(level=getattr(logging, log_level.upper())) will raise AttributeError if an invalid log_level is provided (e.g., typo), causing the component to crash before doing any work. Consider validating log_level against allowed values and/or using a safe fallback (e.g., default to INFO) when the attribute is missing.
    logging.basicConfig(
        level=getattr(logging, log_level.upper()),
        format="%(asctime)s - %(levelname)s - %(message)s",
    )

components/sdg/sdg_hub/README.md:20

  • The Parameters table uses double leading/trailing pipes (e.g., || Parameter | ... |), which does not render as a standard Markdown table in many renderers. Use a normal Markdown table row format with single | delimiters.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input_artifact` | `dsl.Input[dsl.Dataset]` | `None` | KFP Dataset artifact from upstream component |
| `input_pvc_path` | `str` | `""` | Path to JSONL input file on a mounted PVC |
| `flow_id` | `str` | `""` | Built-in flow ID from the SDG Hub registry |

components/sdg/ARCHITECTURE.md:1496

  • This section states packages_to_install was rejected in favor of a pre-baked image, but the actual component implementation in this PR uses @dsl.component(packages_to_install=[...]). Either update the design doc to match the chosen implementation, or switch the component to a base_image approach as described here to avoid runtime installs.
**Why pre-baked (not runtime install):**

| Alternative | Why Not Selected |
|-------------|------------------|
| **`packages_to_install`** | 30-60s pip install overhead per run; network dependency; version resolution risk |

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@beatsmonster beatsmonster force-pushed the main branch 3 times, most recently from 0cde74f to b9120fd Compare February 25, 2026 15:24
@beatsmonster beatsmonster changed the title feat: sdg component design feat: sdg component Feb 25, 2026
beatsmonster and others added 2 commits February 26, 2026 11:30
Signed-off-by: Yi Zheng <237498169+beatsmonster@users.noreply.github.com>
Add a Kubeflow Pipelines component that wraps the SDG Hub SDK for synthetic
data generation within KFP pipelines.

Component features:
- Input from KFP artifact (Input[Dataset]) or PVC path
- Built-in flow selection via flow_id or custom YAML
- LLM model configuration with K8s secret injection
- Checkpointing for resumable long-running jobs
- Optional PVC export with timestamped directory structure
- KFP metrics output (input_rows, output_rows, execution_time)

Testing:
- 26 unit tests with mocked SDG Hub SDK
- Integration test with real transform flow (no LLM)
- E2E test with OpenAI gpt-4o-mini (skipped when LLM_API_KEY unset)

Also includes example pipelines, local runner script, and test data
with transform-only and LLM flow definitions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yi Zheng <237498169+beatsmonster@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (2)

components/sdg/sdg_hub/README.md:45

  • The README examples import sdg from components.sdg.sdg_hub, but this repo packages components under kfp_components.components (via tool.setuptools.package-dir). Use the installable import path in examples (e.g., from kfp_components.components.sdg.sdg_hub import sdg) so copy/paste works for users.
from components.sdg.sdg_hub import sdg
from kfp import dsl
from kfp_kubernetes import mount_pvc, use_secret_as_env

components/sdg/sdg_hub/component.py:66

  • getattr(logging, log_level.upper()) will raise AttributeError for invalid log_level values, which turns a user input error into an unexpected crash. Validate log_level (or provide a default like logging.INFO) and raise a clear ValueError listing allowed levels.
    logging.basicConfig(
        level=getattr(logging, log_level.upper()),
        format="%(asctime)s - %(levelname)s - %(message)s",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +9 to +13
dataset_requirements:
required_columns:
- document
- domain
description: Requires document and domain columns.
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required_columns list items are not indented under required_columns: (they're aligned with the key), which changes the YAML structure and likely breaks dataset validation. Indent the - document / - domain entries under required_columns:.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +28
block_name: concat_fields
input_cols:
- document
- domain
output_cols:
- combined
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_cols / output_cols are intended to be lists, but the - document, - domain, and - combined entries are not indented under their respective keys. As written, this will parse incorrectly (or fail) and can prevent the flow from loading.

Copilot uses AI. Check for mistakes.
version: 1.0.0
author: SDG Hub Team
tags:
- testing
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same YAML indentation issue as in the transform flow: the tags list item is not indented under metadata.tags, so it will not be part of the metadata section. Please indent list items under tags:.

Suggested change
- testing
- testing

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +12
- document
- domain
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required_columns list items are not indented under required_columns: here either, which changes the YAML structure and can cause flow dataset validation to behave incorrectly. Indent the list items under the key.

Suggested change
- document
- domain
- document
- domain

Copilot uses AI. Check for mistakes.
from kfp import dsl
from kfp_kubernetes import mount_pvc, use_secret_as_env

from .component import sdg
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is loaded directly from its file path by scripts/validate_examples, so the relative import from .component import sdg will fail (no package context). Use an absolute import that works both in-repo and when installed (e.g., from kfp_components.components.sdg.sdg_hub import sdg).

Suggested change
from .component import sdg
from kfp_components.components.sdg.sdg_hub import sdg

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +34
output_cols: question_prompt
prompt_config_path: prompts/generate_question.yaml
- block_type: LLMChatBlock
block_config:
block_name: generate_question
input_cols: question_prompt
output_cols: raw_question
max_tokens: 256
temperature: 0.7
async_mode: false
- block_type: LLMResponseExtractorBlock
block_config:
block_name: extract_question
input_cols: raw_question
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_cols is written as a list but the - document entry is not indented under input_cols:. Also output_cols is a scalar (question_prompt) whereas other examples in this PR use a list (e.g., output_cols: ["messages"]). Please align these with the expected SDG Hub flow schema to avoid Flow.from_yaml() validation failures.

Suggested change
output_cols: question_prompt
prompt_config_path: prompts/generate_question.yaml
- block_type: LLMChatBlock
block_config:
block_name: generate_question
input_cols: question_prompt
output_cols: raw_question
max_tokens: 256
temperature: 0.7
async_mode: false
- block_type: LLMResponseExtractorBlock
block_config:
block_name: extract_question
input_cols: raw_question
output_cols:
- question_prompt
prompt_config_path: prompts/generate_question.yaml
- block_type: LLMChatBlock
block_config:
block_name: generate_question
input_cols:
- question_prompt
output_cols:
- raw_question
max_tokens: 256
temperature: 0.7
async_mode: false
- block_type: LLMResponseExtractorBlock
block_config:
block_name: extract_question
input_cols:
- raw_question

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +16
from kfp_kubernetes import mount_pvc, use_secret_as_env

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kfp_kubernetes is imported at module scope, but the repo's CI environment (uv sync --extra test) does not currently install kfp-kubernetes. This will break example pipeline compilation in scripts/validate_examples. Either add kfp-kubernetes to the appropriate dependency group or make the import optional (and document the requirement).

Suggested change
from kfp_kubernetes import mount_pvc, use_secret_as_env
try:
from kfp_kubernetes import mount_pvc, use_secret_as_env
except ImportError: # pragma: no cover
def mount_pvc(*args, **kwargs):
"""No-op fallback for PVC mounting.
The `kfp-kubernetes` package is not installed. PVC mounting will be
skipped in the compiled pipeline. To enable PVC mounting support,
install `kfp-kubernetes`.
"""
# Preserve existing call patterns by returning the first argument,
# which is expected to be the task.
return args[0] if args else None
def use_secret_as_env(*args, **kwargs):
"""No-op fallback for injecting Kubernetes secrets as environment variables.
The `kfp-kubernetes` package is not installed. Secret injection will be
skipped in the compiled pipeline. To enable this support, install
`kfp-kubernetes`.
"""
return args[0] if args else None

Copilot uses AI. Check for mistakes.


@dsl.component(
packages_to_install=["sdg-hub"],
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sdg.python_func imports and uses pandas, but the component only installs sdg-hub via packages_to_install. Unless sdg-hub guarantees pandas as a dependency, this can fail at runtime. Consider adding pandas to packages_to_install (or otherwise ensuring it is present in the component execution environment).

Suggested change
packages_to_install=["sdg-hub"],
packages_to_install=["sdg-hub", "pandas"],

Copilot uses AI. Check for mistakes.
Comment on lines +226 to +229
flow_name = flow_id if flow_id else "custom"
timestamp = time.strftime("%Y%m%d_%H%M%S")
export_dir = os.path.join(export_path, flow_name, timestamp)

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flow_name is derived from the user-provided flow_id and then joined into export_dir. A flow_id containing path separators (e.g., ../) can escape export_path and write files outside the intended directory. Sanitize flow_name (e.g., allowlist characters or use os.path.basename plus replacement) before building the export path.

Copilot uses AI. Check for mistakes.
version: 1.0.0
author: SDG Hub Team
tags:
- testing
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAML indentation is invalid here: the list item for tags is at the same indentation level as metadata, so it will not be parsed as metadata.tags (and may break SDG Hub flow loading). Indent list items under tags: (and keep them within metadata).

Suggested change
- testing
- testing

Copilot uses AI. Check for mistakes.
... by applying a monkey patch to the kfp executor due to the kfp local sdk does not support Input[Dataset] locally.

Signed-off-by: Yi Zheng <237498169+beatsmonster@users.noreply.github.com>
@google-oss-prow
Copy link

The following users are mentioned in OWNERS file(s) but are untrusted for the following reasons. One way to make the user trusted is to add them as members of the kubeflow org. You can then trigger verification by writing /verify-owners in a comment.

  • beatsmonster
    • User is not a member of the org. Satisfy at least one of these conditions to make the user trusted.
    • components/sdg/sdg_hub/OWNERS
  • shivchander
    • User is not a member of the org. Satisfy at least one of these conditions to make the user trusted.
    • components/sdg/sdg_hub/OWNERS
  • eshwarprasadS
    • User is not a member of the org. Satisfy at least one of these conditions to make the user trusted.
    • components/sdg/sdg_hub/OWNERS
  • ashtarkb
    • User is not a member of the org. Satisfy at least one of these conditions to make the user trusted.
    • components/sdg/sdg_hub/OWNERS

api_key = os.environ.get("LLM_API_KEY", "")
api_base = os.environ.get("LLM_API_BASE", "")

model_kwargs = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default values temperature=0.7 and max_tokens=2048 here may always override the YAML values, if I understand this correctly. So, I like that these are default values, but the presence of these defaults may unintentionaly override the actual values that may be present in the YAML configs (the yaml definition of blocks can have temperature and max token values, which will be ignored now).

I think we should respect the yaml settings over the defaults of the component here (although I really like that the parameters allow overriding at the component level).

One fix I was able to think of is using some meaningless placeholders for defaults like -1:

temperature: float = -1.0,  # -1 -> use flow default
max_tokens: int = -1,       # -1 -> use flow default

Then we can only include them in kwargs when explicitly set:

model_kwargs = {}
if temperature >= 0:
    model_kwargs["temperature"] = temperature
if max_tokens > 0:
    model_kwargs["max_tokens"] = max_tokens

Theres probably other elegant ways as well, this was just one way that came to my mind.

]
}
with open(output_metrics.path, "w") as f:
json.dump(metrics_data, f)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious if this is the most appropriate pattern for logging metrics in KFP? I dont have full context but I was referring back to the ARCHITECTURE.md and found that over there the metrics logging was in the format of:

metrics.log_metric("total_input_rows", 1000)
metrics.log_metric("total_output_rows", 850)
metrics.log_metric("execution_time_seconds", 3456.7)
metrics.log_metric("successful_blocks", 8)
metrics.log_metric("failed_blocks", 0)

in the current implementation, a hand-crafted JSON dict is written out to output_metrics.path. If this is the correct pattern, then maybe we can amend the architecture, or if the architecture pattern is more appropriate, could we adopt the same?



@dsl.component(
packages_to_install=["sdg-hub"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should our library install here be verison pinned? I am not sure of the convention here in KFP components, but I assume at least a constrained range could be useful to ensure reproducibility of runs.

@dsl.component(
packages_to_install=["sdg-hub"],
)
def sdg(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely understand the motivation behind not porting every parameter available to the flow.generate() api (https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/6d773011cfffdd00e4d12365f8891c553e15bd7b/src/sdg_hub/core/flow/base.py#L114)

but I believe runtime_params which facilitates per block overrides is a high value setting that can be considered to be included here, what do you think?

just from a practice perspective, I have noticed that users of sdg_hub often tend to use runtime_params to manipulate block level settings, per-run.

@@ -0,0 +1,244 @@
"""Example KFP pipelines demonstrating the SDG Hub component.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't recommend adding pipeline to the components package, can you either move this to pipelines package or add as custom content to the readme if you don;t want to commit but show users how to create pipelines out of it

@@ -0,0 +1,103 @@
"""Run the SDG component locally via KFP LocalRunner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think with the new change where a shared folder is allowed, you can move these utilities to that directory. Or you can use/enhance test_component_local to perform the functionality provided in this particular file as well

@@ -0,0 +1,1800 @@
# SDG Hub KFP Component Architecture Design
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also go under the custom content in README

"semver",
"pip",
"setuptools",
"sdg-hub",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have been an accidental checkin? you probably added it for your local testing, and didn;t mean to check it in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants