-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add --fields-include to unstructured-ingest #376
Conversation
it's a little hard to review with the changes from the --metadata_include/exclude branch mixed in...but it doesn't look like there are any tests added to cover the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped a note on iteration and should probably add tests. Also, there are changes from the metadata_include/exclude PR mixed in here, so I can't approve that's fixed. Could either wait until that merges and rebase or make these changes against main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with
https://github.com/Unstructured-IO/unstructured/blob/main/examples/ingest/wikipedia/ingest.sh
adding
--fields-include text,type \
Confirmed that elements in json outputs only had text
and type
.
Also confirmed that the outputs are equivalent to prior behavior if --fields-include is not specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please bump version and update CHANGELOG to reflect the change happening under 0.5.7-devX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Currently, unstructured-ingest writes all top-level fields in an element with the exception of “coordinates.” This is a reasonable default. Update the command to accept a “--fields-include” in main.py so that the default is “element_id,text,type,metadata”.