Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc binary object extraction #304

Closed
ruebot opened this issue Jan 31, 2019 · 0 comments · Fixed by #346
Closed

Doc binary object extraction #304

ruebot opened this issue Jan 31, 2019 · 0 comments · Fixed by #346

Comments

@ruebot
Copy link
Member

ruebot commented Jan 31, 2019

Using the image extraction process as a basis, our next set of binary object extractions will be documents. This issue is meant to focus specially on doc docx, and odt.

There may be a some tweaks to this depending on the outcome of #298.

@ruebot ruebot self-assigned this Aug 14, 2019
ruebot added a commit that referenced this issue Aug 15, 2019
- Add WordProcessor DF and binary extraction
- Add Spreadsheets DF and binary extraction
- Add Presentation Program DF and binary extraction
- Add tests for new DF and binary extractions
- Add test fixture for new DF and binary extractions
- Resolves #303
- Resolves #304
- Resolves #305
- Back out 39831c2 (We _might_ not have
to do this)
ianmilligan1 pushed a commit that referenced this issue Aug 16, 2019
- Add Word Processor DF and binary extraction
- Add Spreadsheets DF and binary extraction
- Add Presentation Program DF and binary extraction
- Add Text files DF and binary extraction
- Add tests for new DF and binary extractions
- Add test fixtures for new DF and binary extractions
- Resolves #303
- Resolves #304
- Resolves #305
- Use aut-resources repo to distribute our shaded tika-parsers 1.22
- Close TikaInputStream
- Add RDD filters on MimeTypeTika values
- Add CodeCov configuration yaml
- Includes work by @jrwiebe, see #346 for all commits before squash
ruebot added a commit that referenced this issue Aug 20, 2019
- Address #190
- Address #259
- Address #302
- Address #303
- Address #304
- Address #305
- Address #306
- Address #307
ianmilligan1 pushed a commit that referenced this issue Aug 21, 2019
* Add binary extration DataFrames to PySpark.
- Address #190
- Address #259
- Address #302
- Address #303
- Address #304
- Address #305
- Address #306
- Address #307
- Resolves #350 
- Update README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant