fix : Remove pkl support for reading datasets#233
Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Pull request overview
This PR removes pickle (.pkl) dataset loading support to reduce the risk of arbitrary code execution, migrating tests, docs, and templates to use JSONL instead and adding a JSONL test dataset artifact.
Changes:
- Replace
.pklreferences with.jsonlacross unit/integration tests and documentation. - Remove
.pklloader support fromDatasetFormat/ dataset loading implementation. - Add a JSONL dummy dataset and update the OpenOrca predefined dataset flow to cache as JSONL.
Reviewed changes
Copilot reviewed 27 out of 31 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/dataset_manager/test_data_loader.py | Updates dataset loader unit tests from pickle reader fixtures to JSONL. |
| tests/unit/config/test_yaml_loader.py | Updates YAML loader tests to reference .jsonl dataset paths. |
| tests/unit/config/test_schema.py | Updates schema tests to use .jsonl dataset paths. |
| tests/unit/commands/test_util_commands.py | Updates CLI validation tests to use .jsonl dataset paths. |
| tests/unit/commands/test_benchmark.py | Updates benchmark tests to use .jsonl dataset strings/paths. |
| tests/integration/test_server_roundtrip.py | Switches integration fixture usage from pickle dataset path to JSONL dataset path. |
| tests/integration/test_end_to_end_oracle.py | Switches end-to-end oracle tests to use JSONL dataset path fixture. |
| tests/integration/commands/test_benchmark_command.py | Updates integration benchmark command tests to use JSONL dataset fixture. |
| tests/datasets/Readme.md | Updates test dataset documentation to JSONL naming/format. |
| tests/datasets/dummy_1k.jsonl | Adds a JSONL dummy dataset for local CLI/testing. |
| tests/conftest.py | Renames dataset fixtures from pickle to JSONL and updates oracle server fixture wiring. |
| src/inference_endpoint/dataset_manager/README.md | Removes .pkl from the documented supported formats list. |
| src/inference_endpoint/dataset_manager/predefined/open_orca/init.py | Converts upstream OpenOrca pickle artifact into a local JSONL cache. |
| src/inference_endpoint/dataset_manager/factory.py | Updates factory documentation to remove pickle support mention. |
| src/inference_endpoint/dataset_manager/dataset.py | Removes pickle loader and related enum entries; updates docs/comments accordingly. |
| src/inference_endpoint/config/utils.py | Updates dataset string parsing examples away from .pkl. |
| src/inference_endpoint/config/templates/submission_template.yaml | Updates template dataset paths and formats to JSONL. |
| src/inference_endpoint/config/templates/offline_template.yaml | Updates offline template dataset path/format to JSONL. |
| src/inference_endpoint/config/templates/eval_template.yaml | Updates eval template dataset path/format to JSONL. |
| src/inference_endpoint/config/templates/concurrency_template.yaml | Updates concurrency template dataset path/format to JSONL. |
| scripts/create_dummy_dataset.py | Updates dummy dataset generator to output JSONL instead of pickle. |
| README.md | Updates CLI usage examples to reference the JSONL dummy dataset. |
| examples/02_ServerBenchmarking/README.md | Updates example command to use JSONL dummy dataset. |
| docs/LOCAL_TESTING.md | Updates local testing docs to use JSONL datasets and revised supported formats text. |
| docs/DEVELOPMENT.md | Updates development docs to reference JSONL test datasets. |
| docs/CLI_QUICK_REFERENCE.md | Updates CLI reference examples and defaults to JSONL datasets. |
| docs/CLI_DESIGN.md | Updates CLI design examples to reference JSONL datasets. |
| AGENTS.md | Updates architecture/testing docs to reflect JSONL-based test data and supported formats. |
Comments suppressed due to low confidence (2)
docs/CLI_QUICK_REFERENCE.md:59
- The CLI example uses
format=jsonl, but the code path that castsDataset.formatintoDatasetFormatexpects enum values like.jsonl(leading dot) orhuggingface. As written, this example will raise a ValueError at runtime when creating the loader; update the docs (and templates/tests) to use.jsonlor omitformatand rely on extension auto-detection.
```bash
--dataset data.jsonl # simple path
--dataset acc:eval.jsonl # accuracy dataset
--dataset data.csv,samples=500,parser.prompt=article # with options
--dataset perf:data.jsonl,format=jsonl,parser.prompt=text # explicit format + remap
**src/inference_endpoint/dataset_manager/factory.py:38**
* The factory docstring lists supported formats, but it omits formats that are actually implemented (`csv` and `json` loaders exist in `dataset.py`). Please update this list so it stays aligned with the real loader registry and doesn't mislead users.
class DataLoaderFactory:
"""Factory for creating dataset loaders based on format.
Supports:
- parquet: Parquet format (ParquetReader)
- jsonl: JSON Lines format (JsonlReader)
- hf: HuggingFace datasets (HFDataLoader)
"""
</details>
---
💡 <a href="/mlcommons/endpoints/new/main?filename=.github/instructions/*.instructions.md" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Add Copilot custom instructions</a> for smarter, more guided reviews. <a href="https://docs.github.com/en/copilot/customizing-copilot/adding-repository-custom-instructions-for-github-copilot" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Learn how to get started</a>.
src/inference_endpoint/config/templates/concurrency_template.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Code Review
This pull request migrates the project's dataset format from pickle (.pkl) to JSONL (.jsonl). Key changes include removing pickle-related loaders from the DatasetManager, updating the dummy dataset generation script, and comprehensively revising documentation, examples, and test suites to use the new format. Review feedback identifies several documentation inconsistencies, such as the continued listing of the removed npy format, outdated docstring references for JSON structures, and incorrect class names in the DataLoaderFactory documentation.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 31 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the use of Pickle-based datasets with JSONL across the entire project, including documentation, configuration templates, and test suites. The PickleListLoader has been removed from the dataset manager, and a conversion utility was implemented for the OpenOrca dataset to migrate existing pickle artifacts to JSONL. Feedback suggests addressing a security vulnerability in the OpenOrca conversion logic, where pd.read_pickle is still used, by implementing integrity verification such as checksums for the downloaded files to prevent arbitrary code execution.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 31 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
tests/integration/test_server_roundtrip.py:33
- The test docstring still says it is testing the "PickleReader", but the fixture and loader now use a JSONL dataset path. Please update the docstring to avoid confusion while debugging failures.
"""
Test the PickleReader by performing a roundtrip request through a mock HTTP Oracle server.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 31 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (2)
tests/integration/test_server_roundtrip.py:33
- The test docstring still says "PickleReader" even though this test now uses the JSONL dataset fixture. Please update the wording to avoid confusion about which formats are supported and what this test is validating.
"""
Test the PickleReader by performing a roundtrip request through a mock HTTP Oracle server.
src/inference_endpoint/dataset_manager/dataset.py:57
- Pickle support was removed from DatasetFormat, but there’s no unit test asserting that loading a .pkl dataset now fails with a clear error. Adding a regression test (e.g., Dataset.load_from_file('x.pkl') raising ValueError) would help ensure pickle inputs are consistently rejected.
class DatasetFormat(Enum):
"""Enum defining possible supported formats for accuracy datasets to be saved. The value of the enum
defines the file extension to be saved as.
"""
CSV = ".csv"
"""Comma-separated values file with a column header."""
PARQUET = ".parquet"
"""Apache Parquet file."""
JSON = ".json"
"""JSON file containing a list of records (dictionaries), where keys are column names."""
JSONL = ".jsonl"
"""JSON Lines file. Each line is a JSON object where the keys are the column names. It is assumed that
every row has the same keys."""
HF = "huggingface"
"""HuggingFace dataset."""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py
Outdated
Show resolved
Hide resolved
src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
viraatc
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Claude (Codex unavailable due to sandbox restrictions) | Depth: thorough
Found 1 new issue across 1 file (25 existing inline comments were already on this PR — all other findings were already covered):
- 0 critical/high
- 0 medium
- 1 low
| # | File | Line | Severity | Category | Summary |
|---|---|---|---|---|---|
| 1 | tests/integration/test_server_roundtrip.py |
32 | low | design | Stale "PickleReader" docstring not updated to match fixture rename |
Additional note: Testing gap
After removing DatasetFormat.PICKLE, .PANDAS_DF, and .NUMPY_ARRAY from the enum, there is no negative test verifying that these formats are now rejected (e.g., that DatasetFormat(".pkl") raises ValueError). Consider adding a parametrized test to confirm removed formats are properly rejected — this guards against accidental re-introduction.
What was already well-covered by existing reviewers
The 25 existing comments from Copilot, Gemini, and nv-alicheng thoroughly covered:
- Double SHA-256 verification in
open_orca/__init__.py(Copilot, nv-alicheng) - Security implications of residual
pd.read_pickle()usage (Copilot, Gemini) - Atomic write pattern explanation (nv-alicheng, author)
- Factory docstring accuracy (Gemini, Copilot)
- Variable naming with format-specific prefixes (nv-alicheng)
Overall this is a clean, well-structured PR that correctly removes pickle support from the dataset pipeline while preserving a carefully guarded conversion path for the upstream OpenOrca artifact.
viraatc
left a comment
There was a problem hiding this comment.
Review Council — Supplementary Findings
4 additional issues found by the Claude agent (completing the thorough review):
- 0 critical
- 3 medium
- 1 low
Note: open_orca/__init__.py:142 has a dead except subprocess.CalledProcessError clause (unreachable because check=False), but the line is outside the diff so it cannot be posted inline.
src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py
Outdated
Show resolved
Hide resolved
Review Council — Codex AddendumCodex review completed (run separately due to sandbox restrictions). Codex found 2 issues:
Cross-reference with Claude findings
Updated severity summary (all reviewers combined)
|
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 31 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 31 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
src/inference_endpoint/dataset_manager/predefined/open_orca/init.py:138
- This code downloads a shell script from GitHub and (later) executes it. Even with a pinned commit hash, executing remote code at runtime is a high-risk supply-chain vector; consider vendoring the script, or verifying its integrity (e.g., hardcoded SHA-256) before execution, or replacing it with a native downloader implementation.
COMMIT_HASH = "27da4421877f2831eeb615b43ee5098c4b70be7e"
downloader_url = f"https://raw.githubusercontent.com/mlcommons/r2-downloader/{COMMIT_HASH}/mlc-r2-downloader.sh"
download_dir = cache_dir
script_path = cache_dir / "mlc-r2-downloader.sh"
r = requests.get(downloader_url, timeout=30)
r.raise_for_status()
script_path.write_bytes(r.content)
script_path.chmod(0o755)
tests/integration/test_server_roundtrip.py:33
- The docstring still says this test validates the “PickleReader”, but the fixture and dataset path were renamed to JSONL. Please update the wording so the test description matches the new dataset reader behavior.
"""
Test the PickleReader by performing a roundtrip request through a mock HTTP Oracle server.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 33 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What does this PR do?
Pickle support can enable arbitrary code execution hence we would like to discourage it.
Note that open-orca only exists in pickle format from MLC - so we convert it internally to jsonl.
Type of change
Related issues
closes #184
Testing
Checklist