Skip to content

Regression on parsing invalid URLs #2382

@kamil-certat

Description

@kamil-certat

As a continuation of #2377, we have a regression on parsing invalid URLs. Previously, the urllib was mach more liberal in processing URLs, now it rejects much more cases.

We use it for sanitize the URLs, and html_parser is an example of bot that uses the liberal behavior in tests:

EXAMPLE_EVENT2['source.url'] = "http://[D] lingvaworld.ru/media/system/css/messg.jpg"

def test_event_without_split(self):
self.sysconfig = {"columns": ["time.source", "source.url", "malware.hash.md5",
"source.ip", "__IGNORE__"],
"skip_head": True,
"default_url_protocol": "http://",
"type": "malware-distribution"}
self.run_bot()
self.assertMessageEqual(0, EXAMPLE_EVENT2)

In patched Python versions (e.g. 3.11.4), this URL is rejected. We need to either decide against allowing such URLs, or redesign our sanitization.

Temporally, the test is skipped to unlock other work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIndicates an unexpected problem or unintended behaviorcomponent: botscomponent: coregood first issueIndicates a good issue for first-time contributorshelp wantedIndicates that a maintainer wants help on an issue or pull request

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions