You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.
Pass args and kwargs on Unstructured base for load_data() and pass them when calling partition() or partition_via_api().
This would add flexibility to manipulate the (far too many) kwargs from the paritition library.
Reason
Over the last week, I tried taking advantage of the many good advantages partition offers through this loader. To give a few examples,
For .docx I intended to use include_page_breaks, which is set True by default on their docx.py but False on their "auto" method partition -> this is the one called by the loader.
For .pdf, I intended to use cool features such as infer_table_structure or strategy (to set hi_res). Similarly, I intended to use the former kwarg for .pptx as well.
The fact that I cannot manipulate the kwargs passed onto partition prevents me from manipulating data extraction the way I intend, and it's forcing me to subclass and override behavior for a very simple change.
Value of Feature
As explained before, users would be able to take advantage of the many great functionalities unstructured can offer, namely infer_table_structure, strategy, include_page_breaks, etc, by simply passing args and kwargs to the partition() or partition_via_api() methods.
The text was updated successfully, but these errors were encountered:
IgnacioPascale
changed the title
[Feature Request]:
[Feature Request]: Pass args and kwargs when calling partition or partition_via_api on Unstructured loader
Feb 13, 2024
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Feature Description
Pass
args
andkwargs
on Unstructured base forload_data()
and pass them when callingpartition()
orpartition_via_api()
.This would add flexibility to manipulate the (far too many)
kwargs
from the paritition library.Reason
Over the last week, I tried taking advantage of the many good advantages
partition
offers through this loader. To give a few examples,For
.docx
I intended to useinclude_page_breaks
, which is setTrue
by default on theirdocx.py
butFalse
on their "auto" methodpartition
-> this is the one called by the loader.For
.pdf
, I intended to use cool features such asinfer_table_structure
orstrategy
(to sethi_res
). Similarly, I intended to use the former kwarg for.pptx
as well.The fact that I cannot manipulate the
kwargs
passed ontopartition
prevents me from manipulating data extraction the way I intend, and it's forcing me to subclass and override behavior for a very simple change.Value of Feature
As explained before, users would be able to take advantage of the many great functionalities
unstructured
can offer, namelyinfer_table_structure
,strategy
,include_page_breaks
, etc, by simply passingargs
andkwargs
to thepartition()
orpartition_via_api()
methods.The text was updated successfully, but these errors were encountered: