Skip to content

Conversation

@nikita-savelyevv
Copy link
Contributor

@nikita-savelyevv nikita-savelyevv commented Dec 17, 2025

What does this PR do?

Sequence length during quantization calibration data collection can now be provided as a string like "wikitext2:seq_len=128". The motivation is that sometimes it makes sense to adjust default sequence length value, and this way it will be possible to configure it in those particular cases. For example for configs inside _DEFAULT_4BIT_WQ_CONFIGS.

Based on PR by Copilot: nikita-savelyevv#4

Ticket: CVS-176623

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Copilot AI and others added 2 commits December 17, 2025 14:58
- Add dataset_kwargs attribute and parsing logic to OVQuantizationConfigBase
- Parse "dataset:seq_len=value" syntax with validation
- Rename seqlen parameter to seq_len in _prepare_causal_lm_calibration_data
- Thread dataset_kwargs through all calibration helpers
- Add comprehensive unit and integration tests

Co-authored-by: nikita-savelyevv <23343961+nikita-savelyevv@users.noreply.github.com>

Apply black and ruff formatting to changes

Co-authored-by: nikita-savelyevv <23343961+nikita-savelyevv@users.noreply.github.com>

Revert unrelated formatting change to minimize diff

Co-authored-by: nikita-savelyevv <23343961+nikita-savelyevv@users.noreply.github.com>

Address PR feedback: merge tests, update docstrings, use explicit seq_len parameter

- Merge TestDatasetIntegration into TestDatasetParsing and move to test_quantization.py
- Remove standalone test files (test_dataset_parsing.py, test_dataset_integration.py)
- Update dataset docstrings in all OVQuantizationConfigBase child classes
- Update CLI --dataset help text in openvino.py
- Update dataset documentation in export.mdx
- Change from **dataset_kwargs unpacking to explicit seq_len= parameter
- Remove unnecessary comment about integrating seq_len

Co-authored-by: nikita-savelyevv <23343961+nikita-savelyevv@users.noreply.github.com>

Add test cases for new dataset format with seq_len option

- Add test case to test_exporters_cli.py using wikitext2:seq_len=64
- Add test case to test_quantization.py using c4:seq_len=64

Co-authored-by: nikita-savelyevv <23343961+nikita-savelyevv@users.noreply.github.com>

Address final PR feedback: remove empty line, adjust test samples, remove redundant tests

- Remove empty line between seq_len assignment and tokenizer initialization
- Change test to use num_samples=1 instead of 100 for faster execution
- Add num_samples=1 to test configuration in test_quantization.py
- Remove redundant integration test methods (causal_lm, gsm8k, text_to_text, text_encoder)
- Remove redundant list_dataset_backward_compatibility test

Co-authored-by: nikita-savelyevv <23343961+nikita-savelyevv@users.noreply.github.com>

Fix serialization and seq_len parameter passing issues

- Handle dataset_kwargs deserialization for backward compatibility by extracting it from kwargs
- Only pass seq_len to helper functions when it exists in dataset_kwargs to avoid overriding defaults with None
- Use conditional kwargs construction for all calibration helpers

Co-authored-by: nikita-savelyevv <23343961+nikita-savelyevv@users.noreply.github.com>
@nikita-savelyevv nikita-savelyevv changed the title [OpenVINO] Introduce extended dataset options like 'wikitext:seq_len=128' [OpenVINO] Introduce extended quantization dataset options like 'wikitext:seq_len=128' Dec 17, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@nikita-savelyevv nikita-savelyevv changed the title [OpenVINO] Introduce extended quantization dataset options like 'wikitext:seq_len=128' [OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' Dec 17, 2025
@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review December 18, 2025 13:11
Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition!

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the great PR @nikita-savelyevv

@echarlaix echarlaix merged commit 1b30257 into main Jan 5, 2026
54 of 56 checks passed
@echarlaix echarlaix deleted the ns/seq-len-dataset-option branch January 5, 2026 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants