Skip to content

fix : Remove pkl support for reading datasets#233

Merged
arekay-nv merged 11 commits intomainfrom
arekay/remove_pickle_support
Apr 1, 2026
Merged

fix : Remove pkl support for reading datasets#233
arekay-nv merged 11 commits intomainfrom
arekay/remove_pickle_support

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

What does this PR do?

Pickle support can enable arbitrary code execution hence we would like to discourage it.
Note that open-orca only exists in pickle format from MLC - so we convert it internally to jsonl.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

closes #184

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv requested a review from a team as a code owner April 1, 2026 02:08
Copilot AI review requested due to automatic review settings April 1, 2026 02:08
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes pickle (.pkl) dataset loading support to reduce the risk of arbitrary code execution, migrating tests, docs, and templates to use JSONL instead and adding a JSONL test dataset artifact.

Changes:

  • Replace .pkl references with .jsonl across unit/integration tests and documentation.
  • Remove .pkl loader support from DatasetFormat / dataset loading implementation.
  • Add a JSONL dummy dataset and update the OpenOrca predefined dataset flow to cache as JSONL.

Reviewed changes

Copilot reviewed 27 out of 31 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/unit/dataset_manager/test_data_loader.py Updates dataset loader unit tests from pickle reader fixtures to JSONL.
tests/unit/config/test_yaml_loader.py Updates YAML loader tests to reference .jsonl dataset paths.
tests/unit/config/test_schema.py Updates schema tests to use .jsonl dataset paths.
tests/unit/commands/test_util_commands.py Updates CLI validation tests to use .jsonl dataset paths.
tests/unit/commands/test_benchmark.py Updates benchmark tests to use .jsonl dataset strings/paths.
tests/integration/test_server_roundtrip.py Switches integration fixture usage from pickle dataset path to JSONL dataset path.
tests/integration/test_end_to_end_oracle.py Switches end-to-end oracle tests to use JSONL dataset path fixture.
tests/integration/commands/test_benchmark_command.py Updates integration benchmark command tests to use JSONL dataset fixture.
tests/datasets/Readme.md Updates test dataset documentation to JSONL naming/format.
tests/datasets/dummy_1k.jsonl Adds a JSONL dummy dataset for local CLI/testing.
tests/conftest.py Renames dataset fixtures from pickle to JSONL and updates oracle server fixture wiring.
src/inference_endpoint/dataset_manager/README.md Removes .pkl from the documented supported formats list.
src/inference_endpoint/dataset_manager/predefined/open_orca/init.py Converts upstream OpenOrca pickle artifact into a local JSONL cache.
src/inference_endpoint/dataset_manager/factory.py Updates factory documentation to remove pickle support mention.
src/inference_endpoint/dataset_manager/dataset.py Removes pickle loader and related enum entries; updates docs/comments accordingly.
src/inference_endpoint/config/utils.py Updates dataset string parsing examples away from .pkl.
src/inference_endpoint/config/templates/submission_template.yaml Updates template dataset paths and formats to JSONL.
src/inference_endpoint/config/templates/offline_template.yaml Updates offline template dataset path/format to JSONL.
src/inference_endpoint/config/templates/eval_template.yaml Updates eval template dataset path/format to JSONL.
src/inference_endpoint/config/templates/concurrency_template.yaml Updates concurrency template dataset path/format to JSONL.
scripts/create_dummy_dataset.py Updates dummy dataset generator to output JSONL instead of pickle.
README.md Updates CLI usage examples to reference the JSONL dummy dataset.
examples/02_ServerBenchmarking/README.md Updates example command to use JSONL dummy dataset.
docs/LOCAL_TESTING.md Updates local testing docs to use JSONL datasets and revised supported formats text.
docs/DEVELOPMENT.md Updates development docs to reference JSONL test datasets.
docs/CLI_QUICK_REFERENCE.md Updates CLI reference examples and defaults to JSONL datasets.
docs/CLI_DESIGN.md Updates CLI design examples to reference JSONL datasets.
AGENTS.md Updates architecture/testing docs to reflect JSONL-based test data and supported formats.
Comments suppressed due to low confidence (2)

docs/CLI_QUICK_REFERENCE.md:59

  • The CLI example uses format=jsonl, but the code path that casts Dataset.format into DatasetFormat expects enum values like .jsonl (leading dot) or huggingface. As written, this example will raise a ValueError at runtime when creating the loader; update the docs (and templates/tests) to use .jsonl or omit format and rely on extension auto-detection.
```bash
--dataset data.jsonl                                         # simple path
--dataset acc:eval.jsonl                                     # accuracy dataset
--dataset data.csv,samples=500,parser.prompt=article         # with options
--dataset perf:data.jsonl,format=jsonl,parser.prompt=text    # explicit format + remap
**src/inference_endpoint/dataset_manager/factory.py:38**
* The factory docstring lists supported formats, but it omits formats that are actually implemented (`csv` and `json` loaders exist in `dataset.py`). Please update this list so it stays aligned with the real loader registry and doesn't mislead users.

class DataLoaderFactory:
"""Factory for creating dataset loaders based on format.

Supports:
- parquet: Parquet format (ParquetReader)
- jsonl: JSON Lines format (JsonlReader)
- hf: HuggingFace datasets (HFDataLoader)
"""
</details>



---

💡 <a href="/mlcommons/endpoints/new/main?filename=.github/instructions/*.instructions.md" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Add Copilot custom instructions</a> for smarter, more guided reviews. <a href="https://docs.github.com/en/copilot/customizing-copilot/adding-repository-custom-instructions-for-github-copilot" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Learn how to get started</a>.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the project's dataset format from pickle (.pkl) to JSONL (.jsonl). Key changes include removing pickle-related loaders from the DatasetManager, updating the dummy dataset generation script, and comprehensively revising documentation, examples, and test suites to use the new format. Review feedback identifies several documentation inconsistencies, such as the continued listing of the removed npy format, outdated docstring references for JSON structures, and incorrect class names in the DataLoaderFactory documentation.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 31 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the use of Pickle-based datasets with JSONL across the entire project, including documentation, configuration templates, and test suites. The PickleListLoader has been removed from the dataset manager, and a conversion utility was implemented for the OpenOrca dataset to migrate existing pickle artifacts to JSONL. Feedback suggests addressing a security vulnerability in the OpenOrca conversion logic, where pd.read_pickle is still used, by implementing integrity verification such as checksums for the downloaded files to prevent arbitrary code execution.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 1, 2026 17:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 31 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

tests/integration/test_server_roundtrip.py:33

  • The test docstring still says it is testing the "PickleReader", but the fixture and loader now use a JSONL dataset path. Please update the docstring to avoid confusion while debugging failures.
    """
    Test the PickleReader by performing a roundtrip request through a mock HTTP Oracle server.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 31 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

tests/integration/test_server_roundtrip.py:33

  • The test docstring still says "PickleReader" even though this test now uses the JSONL dataset fixture. Please update the wording to avoid confusion about which formats are supported and what this test is validating.
    """
    Test the PickleReader by performing a roundtrip request through a mock HTTP Oracle server.

src/inference_endpoint/dataset_manager/dataset.py:57

  • Pickle support was removed from DatasetFormat, but there’s no unit test asserting that loading a .pkl dataset now fails with a clear error. Adding a regression test (e.g., Dataset.load_from_file('x.pkl') raising ValueError) would help ensure pickle inputs are consistently rejected.
class DatasetFormat(Enum):
    """Enum defining possible supported formats for accuracy datasets to be saved. The value of the enum
    defines the file extension to be saved as.
    """

    CSV = ".csv"
    """Comma-separated values file with a column header."""

    PARQUET = ".parquet"
    """Apache Parquet file."""

    JSON = ".json"
    """JSON file containing a list of records (dictionaries), where keys are column names."""

    JSONL = ".jsonl"
    """JSON Lines file. Each line is a JSON object where the keys are the column names. It is assumed that
    every row has the same keys."""

    HF = "huggingface"
    """HuggingFace dataset."""

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@viraatc viraatc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable due to sandbox restrictions) | Depth: thorough

Found 1 new issue across 1 file (25 existing inline comments were already on this PR — all other findings were already covered):

  • 0 critical/high
  • 0 medium
  • 1 low
# File Line Severity Category Summary
1 tests/integration/test_server_roundtrip.py 32 low design Stale "PickleReader" docstring not updated to match fixture rename

Additional note: Testing gap

After removing DatasetFormat.PICKLE, .PANDAS_DF, and .NUMPY_ARRAY from the enum, there is no negative test verifying that these formats are now rejected (e.g., that DatasetFormat(".pkl") raises ValueError). Consider adding a parametrized test to confirm removed formats are properly rejected — this guards against accidental re-introduction.

What was already well-covered by existing reviewers

The 25 existing comments from Copilot, Gemini, and nv-alicheng thoroughly covered:

  • Double SHA-256 verification in open_orca/__init__.py (Copilot, nv-alicheng)
  • Security implications of residual pd.read_pickle() usage (Copilot, Gemini)
  • Atomic write pattern explanation (nv-alicheng, author)
  • Factory docstring accuracy (Gemini, Copilot)
  • Variable naming with format-specific prefixes (nv-alicheng)

Overall this is a clean, well-structured PR that correctly removes pickle support from the dataset pipeline while preserving a carefully guarded conversion path for the upstream OpenOrca artifact.

Copy link
Copy Markdown
Collaborator

@viraatc viraatc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Supplementary Findings

4 additional issues found by the Claude agent (completing the thorough review):

  • 0 critical
  • 3 medium
  • 1 low

Note: open_orca/__init__.py:142 has a dead except subprocess.CalledProcessError clause (unreachable because check=False), but the line is outside the diff so it cannot be posted inline.

@viraatc
Copy link
Copy Markdown
Collaborator

viraatc commented Apr 1, 2026

Review Council — Codex Addendum

Codex review completed (run separately due to sandbox restrictions). Codex found 2 issues:

# Priority File Lines Summary
1 P1 dataset.py 49-56 Removing DatasetFormat.PICKLE breaks all existing .pkl dataset configs from main
2 P2 open_orca/__init__.py 104-112 force=True not honored when a cached pickle exists

Cross-reference with Claude findings

  • P2 (force=True) — already flagged by Claude in the supplementary review on line 108. Both reviewers independently identified this as a bug.
  • P1 (breaking change) — This is the intended behavior of the PR (removing pickle support). However, both Codex and Claude noted the error message at dataset.py:126 could be more actionable (suggest JSONL conversion instead of just "Unsupported file extension: .pkl").

Updated severity summary (all reviewers combined)

Severity Count Issues
Medium 3 force=True bug (Both), temp file cleanup (Claude), format dots in docs (Claude)
Low 3 Stale "PickleReader" docstring (Claude), format: vs format= docs (Claude), actionable .pkl error msg (Both)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 1, 2026 21:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 31 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 1, 2026 22:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 31 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

src/inference_endpoint/dataset_manager/predefined/open_orca/init.py:138

  • This code downloads a shell script from GitHub and (later) executes it. Even with a pinned commit hash, executing remote code at runtime is a high-risk supply-chain vector; consider vendoring the script, or verifying its integrity (e.g., hardcoded SHA-256) before execution, or replacing it with a native downloader implementation.
        COMMIT_HASH = "27da4421877f2831eeb615b43ee5098c4b70be7e"
        downloader_url = f"https://raw.githubusercontent.com/mlcommons/r2-downloader/{COMMIT_HASH}/mlc-r2-downloader.sh"
        download_dir = cache_dir
        script_path = cache_dir / "mlc-r2-downloader.sh"
        r = requests.get(downloader_url, timeout=30)
        r.raise_for_status()
        script_path.write_bytes(r.content)
        script_path.chmod(0o755)

tests/integration/test_server_roundtrip.py:33

  • The docstring still says this test validates the “PickleReader”, but the fixture and dataset path were renamed to JSONL. Please update the wording so the test description matches the new dataset reader behavior.
    """
    Test the PickleReader by performing a roundtrip request through a mock HTTP Oracle server.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 1, 2026 22:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 33 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@arekay-nv arekay-nv merged commit c00652d into main Apr 1, 2026
10 checks passed
@arekay-nv arekay-nv deleted the arekay/remove_pickle_support branch April 1, 2026 22:34
@github-actions github-actions bot locked and limited conversation to collaborators Apr 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Remove pickle support

4 participants