Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extension: better extractors #8562

Merged
merged 3 commits into from
Nov 13, 2024
Merged

extension: better extractors #8562

merged 3 commits into from
Nov 13, 2024

Conversation

spolu
Copy link
Contributor

@spolu spolu commented Nov 8, 2024

Description

Fixes: https://github.com/dust-tt/tasks/issues/1619

  • First pass at creating an extractor that preserves some structure of the page yet is not too verbose.
  • Partial support for google docs (tested it still works on most pages).

The generic

Risk

N/A

Deploy Plan

N/A

@spolu spolu changed the title extension: smart-ish extractors extension: better extractors Nov 13, 2024
@spolu spolu marked this pull request as ready for review November 13, 2024 12:57
@spolu spolu requested a review from PopDaph November 13, 2024 12:57
Copy link
Contributor

@PopDaph PopDaph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on a couple of websites, works great!

@spolu spolu merged commit 97231b8 into main Nov 13, 2024
3 checks passed
@spolu spolu deleted the spolu-extension_extractors branch November 13, 2024 16:29
@spolu
Copy link
Contributor Author

spolu commented Nov 13, 2024

It's a bit more verbose than innerText but it does capture more info (content of input and textarea as an example which is a must have for some use-cases) and more structure.

The verbosity can be compensated on well known website with CSS selector (eg selected github pages, or linkedin pages, or...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants