Replace bs4 with selectolax (lexbor backend) #248
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I typically don't do PRs, so I'll keep it short...
Why Remove bs4 and replace with some new library?
selectolax is my daily driver for work and various projects. It feels much faster than bs4, even with lxml.
That being said, based on this benchmark [10 seconds (bs4 lxml) vs ~2.5 seconds (selectolax)], it's much faster too.
I needed HTML to Markdown and the rest of the pipe was already selectolax; didn't feel like round-tripping through bs4 just to be polite.
Changes
ruffformatBattletested?
Kinda. Threw some wiki-shaped HTML at it and the results looks fine. I didn't do any exact regression testing.
Version & Status
As I'm pulling out bs4, I've also bumped it to 2.0.0 and left it in draft until someone braver signs off.
Is this related to Scrapling?
No. This was independent of that Issue posted. It was on my hard drive before that.
Tests...
There's a bunch of failing tests at the moment. I've listed some notable ones:
assert md('<![CDATA[foobar]]>') == 'foobar'fails. Within selectolax,CDATAcomment is completely ignored / not iterated. (By design?)