WIP: (feat) Add meta synthetic data kit as an inline provider #2311

alinaryan · 2025-05-29T20:28:43Z

What does this PR do?

Adds comprehensive test suite for the synthetic data kit provider implementation, including both unit and integration tests. This ensures the provider's functionality, configuration handling, and error cases are properly validated according to Llama Stack's testing guidelines.

Test Plan

Unit Tests (tests/unit/providers/inline/synthetic_data_generation/test_synthetic_data_kit.py):
```
pytest tests/unit/providers/inline/synthetic_data_generation/test_synthetic_data_kit.py -v
```
Verifies:
- Configuration initialization and validation
- Environment variable handling via sample_run_config()
- Basic synthetic data generation
- Filtering functionality
- Custom model specification
Integration Tests (tests/integration/providers/inline/synthetic_data_generation/test_synthetic_data_kit_integration.py):
```
# Start vLLM server on port 8000 first
python -m vllm.entrypoints.api_server --model meta-llama/Llama-3.2-3B-Instruct --port 8000

# Then run integration tests
pytest tests/integration/providers/inline/synthetic_data_generation/test_synthetic_data_kit_integration.py -v
```
Verifies:
- End-to-end provider functionality with LlamaStackAsLibraryClient
- Error handling for invalid inputs
- Environment configuration integration
- Response format and content validation

Prerequisites:

vLLM server running locally on port 8000
Access to meta-llama/Llama-3.2-3B-Instruct model
Python environment with test dependencies installed

This establishes the API contract and prepares for provider integration in a future commit. Signed-off-by: Alina Ryan <aliryan@redhat.com>

…_generation API The synthetic_data_kit provider integration enables high-quality synthetic dataset generation for fine-tuning LLMs. This commit sets up the initial provider registration and fixes provider resolution to properly handle type casting and imports, ensuring proper integration with llama-stack's provider system. Implementation of the actual provider functionality will follow in a subsequent commit. Signed-off-by: Alina Ryan <aliryan@redhat.com>

These tests follow Llama Stack's provider testing guidelines to validate: - Configuration handling and environment variables work as expected - Provider implementation behaves correctly in both unit and integration scenarios - Error cases are properly handled - Integration with Llama Stack's client SDK functions properly Signed-off-by: Alina Ryan <aliryan@redhat.com>

alinaryan requested review from ashwinb, yanxi0830, hardikjshah, raghotham, ehhuang, terrytangyuan, leseb and bbrowning as code owners May 29, 2025 20:28

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 29, 2025

alinaryan marked this pull request as draft May 29, 2025 20:29

alinaryan changed the title ~~WIP: (feat) Add meta synthetic data kit~~ WIP: (feat) Add meta synthetic data kit as an inline provider May 29, 2025

alinaryan added 3 commits May 30, 2025 12:14

feat: add synthetic_data_generation API scaffolding (no provider)

e867501

This establishes the API contract and prepares for provider integration in a future commit. Signed-off-by: Alina Ryan <aliryan@redhat.com>

alinaryan force-pushed the add-meta-synthetic-data-kit branch from 23803e9 to cc03093 Compare May 30, 2025 16:14

leseb added the new-in-tree-provider label Jul 3, 2025

leseb mentioned this pull request Jul 3, 2025

feat: Add synthetic-data-kit for file_search doc conversion #2484

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: (feat) Add meta synthetic data kit as an inline provider #2311

WIP: (feat) Add meta synthetic data kit as an inline provider #2311

alinaryan commented May 29, 2025

Uh oh!

Uh oh!

WIP: (feat) Add meta synthetic data kit as an inline provider #2311

Are you sure you want to change the base?

WIP: (feat) Add meta synthetic data kit as an inline provider #2311

Conversation

alinaryan commented May 29, 2025

What does this PR do?

Test Plan

Uh oh!

Uh oh!