Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add local file system connector for unstructured-ingest #399

Merged
merged 33 commits into from
Mar 29, 2023

Conversation

natygyoon
Copy link
Contributor

The following parameters have been added to unstructured-ingest to support bulk processing a directory of files in the local file system.

  • --local-input-path
  • --local-recursive
  • --local-file-glob

@natygyoon natygyoon requested a review from cragwolfe March 27, 2023 16:48
@cragwolfe
Copy link
Contributor

Seems pretty close. Some comments:

Some comments:

  • Please remove test_unstructured_ingest/expected-structured-output/local-ingest-output/.gitkeep . This directory should get created by the connector if not present as it writes files.

  • --local-recursive does not appear to be working from the command line.

  • it should be possible to process a single file that is not a dir. For example:

PYTHONPATH=. ./unstructured/ingest/main.py \
  --metadata-exclude filename \
  --local-input-path example-docs/fake-html.html \
  --structured-output-dir local-ingest-output2 \
  --verbose \
  --reprocess

@natygyoon natygyoon enabled auto-merge (squash) March 29, 2023 15:07
Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

default=None,
help="A comma-separated list of file globs to limit which types of local files are accepted,"
" e.g. '*.html,*.txt'",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move the --local options above --remote-url

@natygyoon natygyoon merged commit 7f6e094 into main Mar 29, 2023
@natygyoon natygyoon deleted the feat/local-connector branch March 29, 2023 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants