- Implement staging brick for Argilla. Converts lists of
Text
elements toargilla
dataset classes. - Removing the local PDF parsing code and any dependencies and tests.
- Reorganizes the staging bricks in the unstructured.partition module
- Allow entities to be passed into the Datasaur staging brick
- Added HTML escapes to the
replace_unicode_quotes
brick - Fix bad responses in partition_pdf to raise ValueError
- Adds
partition_html
for partitioning HTML documents.
- Small change to how _read is placed within the inheritance structure since it doesn't really apply to pdf
- Add partitioning brick for calling the document image analysis API
- Update python requirement to >=3.7
- Add alternative way of importing
Final
to support google colab
- Add cleaning bricks for removing prefixes and postfixes
- Add cleaning bricks for extracting text before and after a pattern
- Add staging brick for Datasaur
- Added brick to convert an ISD dictionary to a list of elements
- Update
PDFDocument
to use thefrom_file
method - Added staging brick for CSV format for ISD (Initial Structured Data) format.
- Added staging brick for separating text into attention window size chunks for
transformers
. - Added staging brick for LabelBox.
- Added ability to upload LabelStudio predictions
- Added utility function for JSONL reading and writing
- Added staging brick for CSV format for Prodigy
- Added staging brick for Prodigy
- Added ability to upload LabelStudio annotations
- Added text_field and id_field to stage_for_label_studio signature
- Initial release of unstructured