Skip to content

Conversation

@Ristellise
Copy link

@Ristellise Ristellise commented Nov 17, 2025

I typically don't do PRs, so I'll keep it short...

Why Remove bs4 and replace with some new library?

selectolax is my daily driver for work and various projects. It feels much faster than bs4, even with lxml.

That being said, based on this benchmark [10 seconds (bs4 lxml) vs ~2.5 seconds (selectolax)], it's much faster too.

I needed HTML to Markdown and the rest of the pipe was already selectolax; didn't feel like round-tripping through bs4 just to be polite.

Changes

  • Replace bs4 backend with selectolax (+ Additional functions missing from selectolax)
  • ruff format

Battletested?

Kinda. Threw some wiki-shaped HTML at it and the results looks fine. I didn't do any exact regression testing.

Version & Status

As I'm pulling out bs4, I've also bumped it to 2.0.0 and left it in draft until someone braver signs off.

Is this related to Scrapling?

No. This was independent of that Issue posted. It was on my hard drive before that.

Tests...

There's a bunch of failing tests at the moment. I've listed some notable ones:

  1. assert md('<![CDATA[foobar]]>') == 'foobar' fails. Within selectolax, CDATA comment is completely ignored / not iterated. (By design?)
  2. selectolax / lexbor appears to do some space stripping which causes a mismatch when it comes to spaces.

@Ristellise
Copy link
Author

Ristellise commented Nov 17, 2025

I've fixed most of the tests. For now, the failing tests are:

FAILED tests/test_advanced.py::test_chomp - AssertionError: assert ' ' == '  '
FAILED tests/test_advanced.py::test_special_tags - AssertionError: assert '' == 'foobar'
FAILED tests/test_args.py::test_strip_document - AssertionError: assert '\n\nHello\n\n' == 'Hello'
FAILED tests/test_args.py::test_strip_pre - AssertionError: assert '\n\n```\n  Hello\n```\n\n' == '```\n  Hello\n```'
FAILED tests/test_basic.py::test_whitespace - AssertionError: assert 'a b c ' == ' a b c '
FAILED tests/test_conversions.py::test_br - AssertionError: assert 'foo  \nbar' == ' foo bar |'
FAILED tests/test_conversions.py::test_spaces - AssertionError: assert '\n\n1. x\n2. y\n\n' == '\n\n1. x\n2. y\n'
FAILED tests/test_escaping.py::test_misc - AssertionError: assert '1\\. x' == ' 1\\. x'
FAILED tests/test_lists.py::test_ol - AssertionError: assert '\n\n1. a\n- b\n' == '\n\n1. a\n2. b\n'
FAILED tests/test_tables.py::test_table - AssertionError: assert '\n\n| Firstn...on | 94 |\n\n' == '\n\n| Firstn...on | 94 |\n\n'
FAILED tests/test_tables.py::test_table_infer_header - AssertionError: assert '\n\n| Firstn...on | 94 |\n\n' == '\n\n| Firstn...on | 94 |\n\n'

I'll continue looking at the tests later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant