Update PlainTextExtractor to just extract text #452

ruebot · 2020-04-21T21:33:08Z

Currently there is a fair bit of overlap between the PlainTextExtractor and WebPagesExtractor. Really, the only different between them now is the name of the content/text column, and WebPagesExtractor has some additional columns.

I propose that PlainTextExtractor moves to something that is more in the spirit of its name. It should run RemoveHTMLDF, RemoveHTTPHeaderDF, a DataFrame version of ExtractBoilerpipeTextRDD, and output a single column (csv or parquet), or possibly a single text file.

The text was updated successfully, but these errors were encountered:

ianmilligan1 · 2020-04-21T21:40:13Z

Yes, that's a great idea @ruebot - I think that's more in spirit of its name, and you could imagine using it in a pipeline through to text analysis better than WebPagesExtractor.

- Resolves #452 - PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on `content` - Update test

- Resolves #452 - PlainTextExtractor runs ExtractBoilerplate on `content` - Update test

ruebot added enhancement Scala DataFrames App labels Apr 21, 2020

ruebot self-assigned this Apr 21, 2020

ruebot added a commit that referenced this issue Apr 21, 2020

Update PlainTextExtractor to output a single column; text.

5ef2920

- Resolves #452 - PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on `content` - Update test

ruebot mentioned this issue Apr 21, 2020

Update PlainTextExtractor to output a single column; text. #453

Merged

ianmilligan1 closed this as completed in #453 Apr 22, 2020

ianmilligan1 pushed a commit that referenced this issue Apr 22, 2020

Update PlainTextExtractor to output a single column; text. (#453)

e91d01f

- Resolves #452 - PlainTextExtractor runs ExtractBoilerplate on `content` - Update test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PlainTextExtractor to just extract text #452

Update PlainTextExtractor to just extract text #452

ruebot commented Apr 21, 2020

ianmilligan1 commented Apr 21, 2020

Update PlainTextExtractor to just extract text #452

Update PlainTextExtractor to just extract text #452

Comments

ruebot commented Apr 21, 2020

ianmilligan1 commented Apr 21, 2020