Skip to content

gh-136063: fix various quadratic worst-time complexities in _header_value_parser.py [WIP] #134947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

picnixz
Copy link
Member

@picnixz picnixz commented May 30, 2025

This is a work in progress (I need to go now) but I'll continue tomorrow. I want to add tests, and some other places are still not fixed because I didn't find a straightforward fix.

@picnixz picnixz changed the title gh-134873: fix various quadratic worst-time complexity in _header_value_parser.py [WIP] gh-134873: fix various quadratic worst-time complexities in _header_value_parser.py [WIP] May 30, 2025
@picnixz picnixz added type-security A security issue needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels May 30, 2025
@picnixz
Copy link
Member Author

picnixz commented May 31, 2025

Urgh, so there is still a quadratic complexity that I need to think about, it's when we alternate if-branches. For instance in get_phrase:

    try:
        token, value = get_word(value)
        phrase.append(token)
    except errors.HeaderParseError:
        phrase.defects.append(errors.InvalidHeaderDefect(
            "phrase does not start with word"))
    while value and value[0] not in PHRASE_ENDS:
        if value[0]=='.':
            phrase.append(DOT)
            phrase.defects.append(errors.ObsoleteHeaderDefect(
                "period in 'phrase'"))
            value = value[1:]
        else:
            try:
                token, value = get_word(value)
            except errors.HeaderParseError:
                if value[0] in CFWS_LEADER:
                    token, value = get_cfws(value)
                    phrase.defects.append(errors.ObsoleteHeaderDefect(
                        "comment found without atom"))
                else:
                    raise
            phrase.append(token)
    return phrase, value

The

if value[0]=='.':
    phrase.append(DOT)
    phrase.defects.append(errors.ObsoleteHeaderDefect(
        "period in 'phrase'"))
    value = value[1:]

is quadratic if we're processing multiple times .. However, if I have something like 'a + '.a' * 100, then the if branch still requires a copy every two iterations, whatever I put inside. Even if the length reduces at each iteration, it doesn't sufficiently reduce. I'll need to think a bit more.

One idea would have been to use a deque to prevent a copy when stripping parts, but then this requires to reconstruct a deque everytime.

Maybe switch to a stateful parser? That way, we shouldn't have high complexity and we should be fine. But this requires a complete rewrite of this module.

@picnixz
Copy link
Member Author

picnixz commented May 31, 2025

Ok, this patch still fixes some cases but not all. Cases where two branches alternate would still be subject to O(n²) complexities (unless we avoid the copy in .lstrip() or in [1:], it's not possible to avoid this with .lstrip() or slices since they still need O(n) to copy the rest of the string).

The advantage of .lstrip() over slices is essentially when the if branch is executed more than once before going to the else branch (namely, we can batch-process some characters). For instance, "a" + "." * 50000 is parsed using lstrip() in O(n) instead of O(n²). However, "a" + ".a" * 50000 is still subject to O(n²) parsing.

@picnixz
Copy link
Member Author

picnixz commented May 31, 2025

@serhiy-storchaka I'm a bit stuck here. I don't really have a better idea than to rewrite the module where we would use a deque to hold the current value. That way, I can call .popleft() to drop prefixed chars. Unfortunately, this also means that cannot really return a string anymore as the module is used and signatures are actually in _typeshed: https://github.com/python/typeshed/blob/main/stdlib/email/_header_value_parser.pyi.

So, to ensure backward compatibility, I think I'll need a new module... I can't think of another solution instead of entirely rewriting the logic so that we don't have un-necessary slice so your help would be appreciated, TiA!


I thought about holding the current "index" where the parser stopped but again, I don't think it'll be sufficient as I'll still need to make slices at some point to extract some values to hold (OTOH, using a deque allows me to move some characters to elsewhere without having to copy the string twice, though I'll still need a ''.join() on the part that is being stored).

def get_something(value):
	storage = Storage()
	head = []
	while cond(value):
	    head.append(value.popleft())
	storage.value = ''.join(head)
	return storage, value

@serhiy-storchaka serhiy-storchaka self-requested a review June 1, 2025 06:06
johnzhou721

This comment was marked as resolved.

@serhiy-storchaka serhiy-storchaka changed the title gh-134873: fix various quadratic worst-time complexities in _header_value_parser.py [WIP] gh-136063: fix various quadratic worst-time complexities in _header_value_parser.py [WIP] Jun 28, 2025
@serhiy-storchaka
Copy link
Member

This approach doesn't work, because the code has inherently quadratic complexity. Every function splits the input string on two parts -- the parsed part and the rest. This operation (creating a new still unparsed string) has linear complexity. Repeated, the total complexity is quadratic.

The only way to avoid quadratic complexity is to change interface of all functions. To rewrite the code in a style similar to the re parser or HTMLParser. I'll take care of this when I'm done with HTMLParser.

@picnixz
Copy link
Member Author

picnixz commented Jun 28, 2025

The only way to avoid quadratic complexity is to change interface of all functions. To rewrite the code in a style similar to the re parser or HTMLParser. I'll take care of this when I'm done with HTMLParser.

Yeah that's what I feared and thus I didn't start this work until you confirmed it. Thanks for taking care of this as it would take me too much time.

@picnixz picnixz closed this Jun 28, 2025
@picnixz picnixz deleted the fix/email/dos-134873 branch June 28, 2025 12:02
@picnixz picnixz removed type-security A security issue needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Jun 28, 2025
@picnixz picnixz removed the needs backport to 3.14 bugs and security fixes label Jun 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants