Open
Description
Instantiating DoclingPredictionProvider
with do_visualization=False
as follows:
docling_provider = DoclingPredictionProvider(
do_visualization=False, ignore_missing_predictions=False
)
Will result in error when creating prediction dataset:
poetry run pytest -v -s tests/test_tables_docling.py
================================================================================================================= test session starts ==================================================================================================================
platform darwin -- Python 3.11.7, pytest-7.4.4, pluggy-1.5.0 -- /Users/wai25/.pyenv/versions/3.11.7/envs/quality/bin/python
cachedir: .pytest_cache
rootdir: /Users/wai25/git/docling-eval
plugins: anyio-4.9.0, dependency-0.6.0, xdist-3.6.1
collected 1 item
Processing FinTabNet dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 87.08it/s]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1805.55ba/s]
Generating test split: 4 examples [00:00, 1349.52 examples/s]
Creating predictions: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:08<00:00, 2.13s/it]
FAILED
======================================================================================================================= FAILURES =======================================================================================================================
______________________________________________________________________________________________________________ test_run_fintabnet_builder ______________________________________________________________________________________________________________
def test_run_fintabnet_builder():
target_path = Path(f"./scratch/{BenchMarkNames.FINTABNET.value}_docling/")
docling_provider = DoclingPredictionProvider(
do_visualization=False, ignore_missing_predictions=False
)
dataset = FintabNetDatasetBuilder(
target=target_path / "gt_dataset",
begin_index=1,
end_index=5,
)
dataset.save_to_disk() # does all the job of iterating the dataset, making GT+prediction records, and saving them in shards as parquet.
> docling_provider.create_prediction_dataset(
name=dataset.name,
gt_dataset_dir=target_path / "gt_dataset",
target_dataset_dir=target_path / "eval_dataset",
)
tests/test_tables_docling.py:42:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
docling_eval/prediction_providers/base_prediction_provider.py:397: in create_prediction_dataset
save_shard_to_disk(
docling_eval/utils/utils.py:436: in save_shard_to_disk
batch.to_parquet(output_file)
../../.pyenv/versions/3.11.7/envs/quality/lib/python3.11/site-packages/datasets/arrow_dataset.py:5103: in to_parquet
).write()
../../.pyenv/versions/3.11.7/envs/quality/lib/python3.11/site-packages/datasets/io/parquet.py:93: in write
written = self._write(file_obj=buffer, batch_size=batch_size, **self.parquet_writer_kwargs)
../../.pyenv/versions/3.11.7/envs/quality/lib/python3.11/site-packages/datasets/io/parquet.py:107: in _write
writer = pq.ParquetWriter(file_obj, schema=schema, **parquet_writer_kwargs)
../../.pyenv/versions/3.11.7/envs/quality/lib/python3.11/site-packages/pyarrow/parquet/core.py:1021: in __init__
self.writer = _parquet.ParquetWriter(
pyarrow/_parquet.pyx:2219: in pyarrow._parquet.ParquetWriter.__cinit__
???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E pyarrow.lib.ArrowNotImplementedError: Cannot write struct type 'pipeline_options' with no child field to Parquet. Consider adding a dummy child field.
pyarrow/error.pxi:92: ArrowNotImplementedError
pyarrow == 19.0.1