You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently there is a fair bit of overlap between the PlainTextExtractor and WebPagesExtractor. Really, the only different between them now is the name of the content/text column, and WebPagesExtractor has some additional columns.
I propose that PlainTextExtractor moves to something that is more in the spirit of its name. It should run RemoveHTMLDF, RemoveHTTPHeaderDF, a DataFrame version of ExtractBoilerpipeTextRDD, and output a single column (csv or parquet), or possibly a single text file.
The text was updated successfully, but these errors were encountered:
Yes, that's a great idea @ruebot - I think that's more in spirit of its name, and you could imagine using it in a pipeline through to text analysis better than WebPagesExtractor.
Currently there is a fair bit of overlap between the
PlainTextExtractor
andWebPagesExtractor
. Really, the only different between them now is the name of the content/text column, andWebPagesExtractor
has some additional columns.I propose that
PlainTextExtractor
moves to something that is more in the spirit of its name. It should runRemoveHTMLDF
,RemoveHTTPHeaderDF
, a DataFrame version ofExtractBoilerpipeTextRDD
, and output a single column (csv or parquet), or possibly a single text file.The text was updated successfully, but these errors were encountered: