Doc binary object extraction #304

ruebot · 2019-01-31T19:50:35Z

Using the image extraction process as a basis, our next set of binary object extractions will be documents. This issue is meant to focus specially on doc docx, and odt.

There may be a some tweaks to this depending on the outcome of #298.

The text was updated successfully, but these errors were encountered:

- Add WordProcessor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add tests for new DF and binary extractions - Add test fixture for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Back out 39831c2 (We _might_ not have to do this)

@jrwiebe

- Add Word Processor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add Text files DF and binary extraction - Add tests for new DF and binary extractions - Add test fixtures for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Use aut-resources repo to distribute our shaded tika-parsers 1.22 - Close TikaInputStream - Add RDD filters on MimeTypeTika values - Add CodeCov configuration yaml - Includes work by @jrwiebe, see #346 for all commits before squash

- Address #190 - Address #259 - Address #302 - Address #303 - Address #304 - Address #305 - Address #306 - Address #307

* Add binary extration DataFrames to PySpark. - Address #190 - Address #259 - Address #302 - Address #303 - Address #304 - Address #305 - Address #306 - Address #307 - Resolves #350 - Update README

ruebot added enhancement Scala feature DataFrames labels Jan 31, 2019

jrwiebe mentioned this issue Aug 2, 2019

Spreadsheet binary object extraction #303

Closed

ruebot self-assigned this Aug 14, 2019

ruebot mentioned this issue Aug 15, 2019

Add office document binary extraction. #346

Merged

ianmilligan1 closed this as completed in #346 Aug 16, 2019

ruebot added a commit that referenced this issue Aug 20, 2019

Add binary extration DataFrames to PySpark.

1176fd5

- Address #190 - Address #259 - Address #302 - Address #303 - Address #304 - Address #305 - Address #306 - Address #307

ruebot mentioned this issue Aug 20, 2019

Add binary extraction DataFrames to PySpark. #350

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc binary object extraction #304

Doc binary object extraction #304

ruebot commented Jan 31, 2019

Doc binary object extraction #304

Doc binary object extraction #304

Comments

ruebot commented Jan 31, 2019