Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add --fields-include to unstructured-ingest #376

Merged
merged 52 commits into from
Mar 22, 2023
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
3ef6265
added metadata in/exclude params
natygyoon Mar 15, 2023
a084a5c
updated process_file
natygyoon Mar 15, 2023
499280d
existing tests
natygyoon Mar 15, 2023
3b8d369
remove default behavior
natygyoon Mar 15, 2023
e8319f5
changelog and ci
natygyoon Mar 16, 2023
2d01c7a
line length
natygyoon Mar 16, 2023
d6ab499
import
natygyoon Mar 16, 2023
ef788f6
import
natygyoon Mar 16, 2023
297ff5f
import sorted
natygyoon Mar 16, 2023
780f7ff
import
natygyoon Mar 16, 2023
252654b
type
natygyoon Mar 16, 2023
59d7a00
line length
natygyoon Mar 16, 2023
b5393c3
version sync
natygyoon Mar 16, 2023
ef07f40
main
natygyoon Mar 16, 2023
1283a84
ci
natygyoon Mar 16, 2023
97d04e6
json
natygyoon Mar 16, 2023
dd0319a
dict
natygyoon Mar 16, 2023
bdd4a1e
type ignore
natygyoon Mar 16, 2023
79333a5
lint
natygyoon Mar 16, 2023
fc1db61
unit tests for process_file
natygyoon Mar 17, 2023
4f92b68
lint
natygyoon Mar 17, 2023
203e1b2
added --fields-include
natygyoon Mar 17, 2023
53b7aea
lint
natygyoon Mar 17, 2023
e35a04f
line length
natygyoon Mar 17, 2023
b5d7b3e
fix version
natygyoon Mar 17, 2023
3905681
type changed to Optional(str)
natygyoon Mar 20, 2023
c690284
ci
natygyoon Mar 20, 2023
5fab790
line length
natygyoon Mar 20, 2023
46b7550
merge `feat/metadata` into `feat/fields-include`
natygyoon Mar 20, 2023
c0f2ad0
merge conflict
natygyoon Mar 20, 2023
dbf7774
merge conflict
natygyoon Mar 20, 2023
ed5869d
line length
natygyoon Mar 20, 2023
45c9b86
type check
natygyoon Mar 20, 2023
84b831a
line length
natygyoon Mar 20, 2023
c0dbcbe
default
natygyoon Mar 20, 2023
a18d91d
subclass type
natygyoon Mar 20, 2023
7a131b7
fixed dict iter error
natygyoon Mar 20, 2023
054fdc4
Merge branch 'main' into feat/fields-include
natygyoon Mar 20, 2023
bc758bc
Merge branch 'main' into feat/fields-include
natygyoon Mar 21, 2023
6c42cf4
code refactor
natygyoon Mar 21, 2023
f1f90d2
added unit tests for fields_include
natygyoon Mar 21, 2023
245266b
version sync
natygyoon Mar 21, 2023
73f0f6d
Merge branch 'main' into feat/fields-include
natygyoon Mar 21, 2023
509457b
remove custom class
natygyoon Mar 21, 2023
33faf50
unit tests
natygyoon Mar 21, 2023
ec46090
changelog
natygyoon Mar 21, 2023
4e7da4f
version bump
natygyoon Mar 22, 2023
d98c8c2
nit
natygyoon Mar 22, 2023
7fd4ad5
Merge branch 'main' into feat/fields-include
natygyoon Mar 22, 2023
3b46996
remove duplicate
natygyoon Mar 22, 2023
840a1ae
nit
natygyoon Mar 22, 2023
f29bf65
Merge branch 'main' into feat/fields-include
natygyoon Mar 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
## 0.5.5-dev0
## 0.5.5-dev2
ryannikolaidis marked this conversation as resolved.
Show resolved Hide resolved

### Enhancements

### Features

* Add `--fields-include` parameter to `unstructured-ingest`
* Add `--metadata-include` and `--metadata-exclude` parameters to `unstructured-ingest`
* Add `clean_non_ascii_chars` to remove non-ascii characters from unicode string

### Fixes
Expand Down
1 change: 1 addition & 0 deletions test_unstructured_ingest/test-ingest-azure.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
cd "$SCRIPT_DIR"/.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename \
--remote-url abfs://container1/ \
--azure-account-name azureunstructured1 \
--structured-output-dir azure-ingest-output \
Expand Down
1 change: 1 addition & 0 deletions test_unstructured_ingest/test-ingest-biomed-api.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ if [[ "$(find test_unstructured_ingest/expected-structured-output/biomed-ingest-
fi

PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename \
--biomed-api-from "2019-01-02" \
--biomed-api-until "2019-01-02+00:03:10" \
--structured-output-dir biomed-ingest-output-api \
Expand Down
1 change: 1 addition & 0 deletions test_unstructured_ingest/test-ingest-biomed-path.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ if [[ "$(find test_unstructured_ingest/expected-structured-output/biomed-ingest-
fi

PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename \
--biomed-path "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf" \
--structured-output-dir biomed-ingest-output-path \
--num-processes 2 \
Expand Down
7 changes: 6 additions & 1 deletion test_unstructured_ingest/test-ingest-github.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,12 @@ if [[ "$CI" == "true" ]]; then
fi


PYTHONPATH=. ./unstructured/ingest/main.py --github-url dcneiner/Downloadify --git-file-glob '*.html,*.txt' --structured-output-dir github-downloadify-output --verbose
PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename \
--github-url dcneiner/Downloadify \
--git-file-glob '*.html,*.txt' \
--structured-output-dir github-downloadify-output \
--verbose

if ! diff -ru test_unstructured_ingest/expected-structured-output/github-downloadify github-downloadify-output ; then
echo
Expand Down
1 change: 1 addition & 0 deletions test_unstructured_ingest/test-ingest-gitlab.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename \
--gitlab-url https://gitlab.com/gitlab-com/content-sites/docsy-gitlab \
--git-file-glob '*.md,*.txt' \
--structured-output-dir gitlab-ingest-output \
Expand Down
6 changes: 5 additions & 1 deletion test_unstructured_ingest/test-ingest-s3.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,11 @@ if [[ "$(find test_unstructured_ingest/expected-structured-output/s3-small-batch
exit 1
fi

PYTHONPATH=. ./unstructured/ingest/main.py --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ --s3-anonymous --structured-output-dir s3-small-batch-output
PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename \
--s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
--s3-anonymous \
--structured-output-dir s3-small-batch-output \

if ! diff -ru test_unstructured_ingest/expected-structured-output/s3-small-batch s3-small-batch-output ; then
echo
Expand Down
1 change: 1 addition & 0 deletions test_unstructured_ingest/test-ingest-wikipedia.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename \
--wikipedia-page-title "Open Source Software" \
--structured-output-dir wikipedia-ingest-output \
--num-processes 2 \
Expand Down
78 changes: 78 additions & 0 deletions test_unstructured_ingest/test_interfaces.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import os
import pathlib

import pytest

from unstructured.ingest.connector.git import GitIngestDoc, SimpleGitConfig

DIRECTORY = pathlib.Path(__file__).parent.resolve()
EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, "..", "example-docs")

test_files = [
"layout-parser-paper-fast.jpg",
"layout-parser-paper-fast.pdf",
]


@pytest.mark.parametrize("filename", test_files)
def test_process_file_include_filename(filename: str):
ingest_doc = GitIngestDoc(
path=filename,
config=SimpleGitConfig(
download_dir=EXAMPLE_DOCS_DIRECTORY,
metadata_include="filename",
),
)
isd_elems = ingest_doc.process_file()

for elem in isd_elems:
for k in elem["metadata"]:
assert k == "filename"


@pytest.mark.parametrize("filename", test_files)
def test_process_file_include_filename_pagenum(filename: str):
ingest_doc = GitIngestDoc(
path=filename,
config=SimpleGitConfig(
download_dir=EXAMPLE_DOCS_DIRECTORY,
metadata_include="filename,page_number",
),
)
isd_elems = ingest_doc.process_file()

for elem in isd_elems:
for k in elem["metadata"]:
assert k in ["filename", "page_number"]


@pytest.mark.parametrize("filename", test_files)
def test_process_file_exclude_filename(filename: str):
ingest_doc = GitIngestDoc(
path=filename,
config=SimpleGitConfig(
download_dir=EXAMPLE_DOCS_DIRECTORY,
metadata_exclude="filename",
),
)
isd_elems = ingest_doc.process_file()

for elem in isd_elems:
for k in elem["metadata"]:
assert k != "filename"


@pytest.mark.parametrize("filename", test_files)
def test_process_file_exclude_filename_pagenum(filename: str):
ingest_doc = GitIngestDoc(
path=filename,
config=SimpleGitConfig(
download_dir=EXAMPLE_DOCS_DIRECTORY,
metadata_exclude="filename,page_number",
),
)
isd_elems = ingest_doc.process_file()

for elem in isd_elems:
for k in elem["metadata"]:
assert k not in ["filename", "page_number"]
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.5.5-dev0" # pragma: no cover
__version__ = "0.5.5-dev2" # pragma: no cover
3 changes: 3 additions & 0 deletions unstructured/ingest/connector/biomed.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ class SimpleBiomedConfig(BaseConnectorConfig):
output_dir: str
re_download: bool = False
preserve_downloads: bool = False
metadata_include: str = ""
metadata_exclude: str = ""
fields_include: str = ""

def _validate_date_args(self, date):
date_formats = ["%Y-%m-%d", "%Y-%m-%d+%H:%M:%S"]
Expand Down
3 changes: 3 additions & 0 deletions unstructured/ingest/connector/fsspec.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ class SimpleFsspecConfig(BaseConnectorConfig):
output_dir: str
preserve_downloads: bool = False
re_download: bool = False
metadata_include: str = ""
metadata_exclude: str = ""
fields_include: str = ""

# fsspec specific options
access_kwargs: dict = field(default_factory=dict)
Expand Down
3 changes: 3 additions & 0 deletions unstructured/ingest/connector/git.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ class SimpleGitConfig(BaseConnectorConfig):
output_dir: str
preserve_downloads: bool = False
re_download: bool = False
metadata_include: str = ""
metadata_exclude: str = ""
fields_include: str = ""

repo_path: str = field(init=False, repr=False)

Expand Down
3 changes: 3 additions & 0 deletions unstructured/ingest/connector/google_drive.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,9 @@ class SimpleGoogleDriveConfig(BaseConnectorConfig):
output_dir: str
re_download: bool = False
preserve_downloads: bool = False
metadata_include: str = ""
metadata_exclude: str = ""
fields_include: str = ""

recursive: bool = False

Expand Down
3 changes: 3 additions & 0 deletions unstructured/ingest/connector/reddit.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ class SimpleRedditConfig(BaseConnectorConfig):
output_dir: str
preserve_downloads: bool = False
re_download: bool = False
metadata_include: str = ""
metadata_exclude: str = ""
fields_include: str = ""

def __post_init__(self):
if self.num_posts <= 0:
Expand Down
3 changes: 3 additions & 0 deletions unstructured/ingest/connector/wikipedia.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ class SimpleWikipediaConfig(BaseConnectorConfig):
output_dir: str
preserve_downloads: bool = False
re_download: bool = False
metadata_include: str = ""
metadata_exclude: str = ""
fields_include: str = ""


@dataclass
Expand Down
22 changes: 20 additions & 2 deletions unstructured/ingest/interfaces.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ class BaseConnectorConfig(ABC):
# where to write structured data outputs
output_dir: str
re_download: bool = False
metadata_include: str = ""
metadata_exclude: str = ""
fields_include: str = ""


class BaseIngestDoc(ABC):
Expand All @@ -58,6 +61,8 @@ class BaseIngestDoc(ABC):
Crucially, it is not responsible for the actual processing of the raw document.
"""

config: BaseConnectorConfig

@property
@abstractmethod
def filename(self):
Expand Down Expand Up @@ -94,8 +99,21 @@ def process_file(self):
self.isd_elems_no_filename = []
for elem in isd_elems:
# type: ignore
elem["metadata"].pop("filename", None) # type: ignore[attr-defined]
elem.pop("coordinates") # type: ignore[attr-defined]
if self.config.metadata_exclude:
ex_list = self.config.metadata_exclude.split(",")
for ex in ex_list:
elem["metadata"].pop(ex, None) # type: ignore[attr-defined]
elif self.config.metadata_include:
in_list = self.config.metadata_include.split(",")
for k in elem["metadata"]:
if k not in in_list:
elem["metadata"].pop(k, None) # type: ignore[attr-defined]

in_list = self.config.fields_include.split(",")
for k in elem:
if k not in in_list:
elem.pop(k, None) # type: ignore[attr-defined]

self.isd_elems_no_filename.append(elem)

return self.isd_elems_no_filename
Loading