Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PlainTextExtractor to just extract text #452

Closed
ruebot opened this issue Apr 21, 2020 · 1 comment · Fixed by #453
Closed

Update PlainTextExtractor to just extract text #452

ruebot opened this issue Apr 21, 2020 · 1 comment · Fixed by #453

Comments

@ruebot
Copy link
Member

ruebot commented Apr 21, 2020

Currently there is a fair bit of overlap between the PlainTextExtractor and WebPagesExtractor. Really, the only different between them now is the name of the content/text column, and WebPagesExtractor has some additional columns.

I propose that PlainTextExtractor moves to something that is more in the spirit of its name. It should run RemoveHTMLDF, RemoveHTTPHeaderDF, a DataFrame version of ExtractBoilerpipeTextRDD, and output a single column (csv or parquet), or possibly a single text file.

@ianmilligan1
Copy link
Member

Yes, that's a great idea @ruebot - I think that's more in spirit of its name, and you could imagine using it in a pipeline through to text analysis better than WebPagesExtractor.

ruebot added a commit that referenced this issue Apr 21, 2020
- Resolves #452
- PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on
`content`
- Update test
ianmilligan1 pushed a commit that referenced this issue Apr 22, 2020
- Resolves #452
- PlainTextExtractor runs ExtractBoilerplate on `content`
- Update test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants