Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add --fields-include to unstructured-ingest #376

Merged
merged 52 commits into from
Mar 22, 2023
Merged

Conversation

natygyoon
Copy link
Contributor

Currently, unstructured-ingest writes all top-level fields in an element with the exception of “coordinates.” This is a reasonable default. Update the command to accept a “--fields-include” in main.py so that the default is “element_id,text,type,metadata”.

CHANGELOG.md Outdated Show resolved Hide resolved
@ryannikolaidis
Copy link
Contributor

it's a little hard to review with the changes from the --metadata_include/exclude branch mixed in...but it doesn't look like there are any tests added to cover the new fields_include functionality, right? worth some coverage?

Copy link
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped a note on iteration and should probably add tests. Also, there are changes from the metadata_include/exclude PR mixed in here, so I can't approve that's fixed. Could either wait until that merges and rebase or make these changes against main.

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with

https://github.com/Unstructured-IO/unstructured/blob/main/examples/ingest/wikipedia/ingest.sh

adding

--fields-include text,type \

Confirmed that elements in json outputs only had text and type.

Also confirmed that the outputs are equivalent to prior behavior if --fields-include is not specified.

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bump version and update CHANGELOG to reflect the change happening under 0.5.7-devX

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@natygyoon natygyoon enabled auto-merge (squash) March 22, 2023 14:01
@natygyoon natygyoon merged commit 66a0369 into main Mar 22, 2023
@natygyoon natygyoon deleted the feat/fields-include branch March 22, 2023 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants