Skip to content

[Bee] Cannot write struct type 'pipeline_options' with no child field to Parquet #107

Open
@wai25

Description

@wai25

Instantiating DoclingPredictionProvider with do_visualization=False as follows:

docling_provider = DoclingPredictionProvider(
        do_visualization=False, ignore_missing_predictions=False
    )

Will result in error when creating prediction dataset:

poetry run pytest -v  -s tests/test_tables_docling.py
================================================================================================================= test session starts ==================================================================================================================
platform darwin -- Python 3.11.7, pytest-7.4.4, pluggy-1.5.0 -- /Users/wai25/.pyenv/versions/3.11.7/envs/quality/bin/python
cachedir: .pytest_cache
rootdir: /Users/wai25/git/docling-eval
plugins: anyio-4.9.0, dependency-0.6.0, xdist-3.6.1
collected 1 item

Processing FinTabNet dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 87.08it/s]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1805.55ba/s]
Generating test split: 4 examples [00:00, 1349.52 examples/s]
Creating predictions: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.13s/it]
FAILED

======================================================================================================================= FAILURES =======================================================================================================================
______________________________________________________________________________________________________________ test_run_fintabnet_builder ______________________________________________________________________________________________________________

    def test_run_fintabnet_builder():
        target_path = Path(f"./scratch/{BenchMarkNames.FINTABNET.value}_docling/")
        docling_provider = DoclingPredictionProvider(
            do_visualization=False, ignore_missing_predictions=False
        )

        dataset = FintabNetDatasetBuilder(
            target=target_path / "gt_dataset",
            begin_index=1,
            end_index=5,
        )

        dataset.save_to_disk()  # does all the job of iterating the dataset, making GT+prediction records, and saving them in shards as parquet.

>       docling_provider.create_prediction_dataset(
            name=dataset.name,
            gt_dataset_dir=target_path / "gt_dataset",
            target_dataset_dir=target_path / "eval_dataset",
        )

tests/test_tables_docling.py:42:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
docling_eval/prediction_providers/base_prediction_provider.py:397: in create_prediction_dataset
    save_shard_to_disk(
docling_eval/utils/utils.py:436: in save_shard_to_disk
    batch.to_parquet(output_file)
../../.pyenv/versions/3.11.7/envs/quality/lib/python3.11/site-packages/datasets/arrow_dataset.py:5103: in to_parquet
    ).write()
../../.pyenv/versions/3.11.7/envs/quality/lib/python3.11/site-packages/datasets/io/parquet.py:93: in write
    written = self._write(file_obj=buffer, batch_size=batch_size, **self.parquet_writer_kwargs)
../../.pyenv/versions/3.11.7/envs/quality/lib/python3.11/site-packages/datasets/io/parquet.py:107: in _write
    writer = pq.ParquetWriter(file_obj, schema=schema, **parquet_writer_kwargs)
../../.pyenv/versions/3.11.7/envs/quality/lib/python3.11/site-packages/pyarrow/parquet/core.py:1021: in __init__
    self.writer = _parquet.ParquetWriter(
pyarrow/_parquet.pyx:2219: in pyarrow._parquet.ParquetWriter.__cinit__
    ???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pyarrow.lib.ArrowNotImplementedError: Cannot write struct type 'pipeline_options' with no child field to Parquet. Consider adding a dummy child field.

pyarrow/error.pxi:92: ArrowNotImplementedError 

pyarrow == 19.0.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions