Skip to content

Speed up parsing notation LLSD #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Nov 17, 2022

Conversation

SaladDais
Copy link
Contributor

@SaladDais SaladDais commented Sep 28, 2022

Thanks for splitting out this code!

As part of working with LEAP API, I need to parse very large notation LLSD payloads containing a large number of strings. The parsing of those payloads is currently heavily bottle-necked by the code handling delimited strings.

For example, invoking the LLWindowListener::getPaths() method over LEAP returns a 5.5mb blob of LLSD (assuming logged in with a relatively bare inventory,) and that payload takes 2.5s to parse with llsd.parse(). Since latency is important when dealing with LEAP, I looked at how ujson implemented their delimited string handling to see how llsd's string handling logic could be improved.

I ended up:

  • Removing use of self._getc() and reading from self._buffer directly, this was the most significant improvement. Since we need to read a byte at a time, getting single-byte byte strings is inefficient. The function not using a potentially overridden self._getc() shouldn't be an issue, since LLSDBinaryParser has done manual manipulation of self._index for a long time now. Basically, self._getc() must operate on self._buffer and self._index for the parsers to work correctly, so not using the abstraction in this particular case shouldn't break any existing uses
  • Removing self references inside the loop to avoid the attribute lookup cost
  • Slightly reordering the logic to do fewer comparisons per iteration in the common case of non-special characters
  • Using a pre-allocated decode buffer for all string parses, only using an expanded, dynamically-allocated buffer when the string's size exceeds that of the shared buffer. This avoids costly dynamic allocs associated with doing my_str.append(char) in a hot loop.
  • Precalculating as much as possible to avoid function calls in the hot loop (the value of ord('\\'), etc.)

Given a script like

import llsd
with open("large_getPaths_resp.llsd", "rb") as f:
        llsd.parse(f.read())

the script now takes 0.7~s to complete rather than the 2.5s it previously took.

Much better improvements are also possible, but they would either change how invalid escapes like \p are handled, or would require native code.

Checking if we're currently within an escape sequence every single
loop iteration is wasteful.
This avoids an unnecessary alloc in the common case of strings
under 1024 chars.
@SaladDais
Copy link
Contributor Author

Slightly more typical example, parsing a 5.5mb notation LLSD serialized inventory file went from 2.7s to 1.8s, mostly due to the map keys parsing more quickly.

@SaladDais
Copy link
Contributor Author

SaladDais commented Sep 28, 2022

Looks like there are some Py2.7 failures, I'll check in a local pipenv

@nat-goodspeed
Copy link
Contributor

Tangential, but of interest to me: what if LEAP used binary LLSD format instead of notation format? Would that be faster, or a wash? (If it's not yet faster, could LLSD binary parsing be optimized further than notation parsing?)

Even without a viewer that emits that, you could test by converting your large_getPaths_resp.llsd or other test files to binary LLSD format.

@SaladDais
Copy link
Contributor Author

what if LEAP used binary LLSD format instead of notation format? Would that be faster, or a wash?

Oh absolutely faster. A binary LLSD serialized version of the large_getPaths_resp.llsd takes about 0.1s to parse on my PC. I think in this specific case that's mostly due to the fact that the delimited string form isn't really used in binary LLSD even though it's technically allowed there.

Even if you continue using notation form, but just make the viewer's formatter use the s(len)"val" length-prefixed string form, python-llsd is able to parse that in 0.15s. Still, notation's not ideal for potentially large and frequent payloads due to all the complex branching on individual bytes required (parsing numbers is still complicated,) so binary LEAP could be very helpful perf-wise.

`bytes` on PY2 are really `str`s, so `val[n]` returns a character
rather than an integer. Wrapping bytes values on PY2 helps us
preserve semantics without requiring branching on Python version
or version-specific lambdas to peek from `self._buffer`.
@SaladDais
Copy link
Contributor Author

SaladDais commented Sep 28, 2022

That should fix the PY2 issues, I forgot that bytes_val[n] returns something different there. I considered instead creating helper lambdas for peek()ing inside _parsed_delimited_string() based on what version you were on, but that code path is so hot that doing so added 300ms for a 1.2s parse time. The wrapper I added shouldn't have any effects other than a slightly useless initial copy under PY2.

# Any additions may now overflow the decoding buffer, make
# a new expanded buffer containing the existing contents.
decode_buff = bytearray(decode_buff)
decode_buff.extend(b"\x00" * _DECODE_BUFF_ALLOC_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be faster still to do something like:

self._decode_buffs = [bytearray(_DECODE_BUFF_ALLOC_SIZE)]
# ...
buff_idx = 0
decode_buff = self._decode_buffs[buff_idx]
# ...
try:
    decode_buff[insert_idx] = cc
except IndexError:
    buff_idx += 1
    try:
        decode_buff = self._decode_buffs[buff_idx]
    except IndexError:
        decode_buff = bytearray(_DECODE_BUFF_ALLOC_SIZE)
        self._decode_buffs.append(decode_buff)
    insert_idx = 0
    decode_buff[insert_idx] = cc
insert_idx += 1
# ...
try:
    return b''.join(self._decode_buffs[:-1] + [decode_buff[:insert_idx]]).decode('utf-8')
except UnicodeDecodeError as exc:
    # ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how many oversized strings you expect in the input stream. The code you submitted leaves you with a single larger buffer, prepared to handle many oversized strings without further allocations. My suggestion above should expand faster the first time, but requires consolidating multiple bytearrays for each oversized string. Plus which it's admittedly more complex.

I still suggest the code you wrote would be improved by catching IndexError on decode_buff assignment, though, rather than testing insert_idx as shown. Also I suspect there's a bug in your test.

Say you're working on the second oversized string in the same input stream, so len(self._decode_buff) == 2*_DECODE_BUFF_ALLOC_SIZE. You just inserted the byte at (_DECODE_BUFF_ALLOC_SIZE - 1) and incremented insert_idx. Now, although insert_idx == len(decode_buff)/2, insert_idx % _DECODE_BUFF_ALLOC_SIZE in fact equals 0, so you extend decode_buff -- resulting in _DECODE_BUFF_ALLOC_SIZE bytes of garbage in the buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you submitted leaves you with a single larger buffer, prepared to handle many oversized strings without further allocations.

Whoops, not true: your expanded decode_buff isn't stored in self._decode_buff, so subsequent oversized strings would in fact require a new expanded decode_buff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, not true: your expanded decode_buff isn't stored in self._decode_buff, so subsequent oversized strings would in fact require a new expanded decode_buff.

Yep, my reasoning there was I sometimes use long-lived LLSD*Parser objects and I was concerned about keeping a large scratch buffer around just because I'd previously parsed a large string.

except IndexError:

That is a good point! I forgot that in Python it's much less expensive to handle the (potential) IndexError than check if we're about to overflow. The internal IndexError check is priced in and you don't really get to opt out anyway. Doing an opportunistic assignment and doing my buffer copy in the except IndexError shaves off about 100ms! I think this strategy ends up being easier to read as well (I didn't feel great about the %.)

The buffer copy itself isn't terribly expensive relative to juggling + concatenating multiple buffers though. Buffer concatenation was a tiny bit slower for payloads containing strings mostly under 1024 bytes.

@bennettgoble bennettgoble added the enhancement New feature or request label Sep 28, 2022
@bennettgoble
Copy link
Member

bennettgoble commented Oct 27, 2022

Hello, @SaladDais. Thanks for this improvement! Could you pull the latest changes from main in order to pick up our newly added CLA behavior?

@github-actions
Copy link

github-actions bot commented Oct 27, 2022

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@SaladDais
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@SaladDais
Copy link
Contributor Author

recheck

@bennettgoble
Copy link
Member

@nat-goodspeed are there any unresolved threads here?

Copy link
Contributor

@nat-goodspeed nat-goodspeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLA was the only blocker. Good to go!

@bennettgoble bennettgoble merged commit eea1b9a into secondlife:main Nov 17, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Nov 17, 2022
@bennettgoble
Copy link
Member

This has shipped in v1.1.0. 🎊

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants