SL-18330: Refactor notation parsing to manage a lookahead char. #6

nat-goodspeed · 2023-03-16T17:21:51Z

Instead of peeking ahead by a character and then (most of the time) rereading it simply to discard it, go ahead and read the next character, passing it into the various LLSDNotationParser._parse_mumble() functions. This eliminates most LLSDNotationParser use of LLSDBaseParser._peek(), save for _get_re().

This eliminates the need for LLSDNotationParser._skip_then(), whose only job was to discard the next character and return the specified value.

Break out LLSDNotationParser._parse_true() and _parse_false() in anticipation of potential reimplementation without _get_re().

Instead of peeking ahead by a character and then (most of the time) rereading it simply to discard it, go ahead and read the next character, passing it into the various LLSDNotationParser._parse_mumble() functions. This eliminates most LLSDNotationParser use of LLSDBaseParser._peek(), save for _get_re(). This eliminates the need for LLSDNotationParser._skip_then(), whose only job was to discard the next character and return the specified value. Break out LLSDNotationParser._parse_true() and _parse_false() in anticipation of potential reimplementation without _get_re().

nat-goodspeed · 2023-03-16T17:23:10Z

This PR is still a draft because I believe we could realize further notation parsing performance gains by eliminating some (all?) uses of LLSDNotationParser._get_re().

nat-goodspeed · 2023-03-16T17:24:51Z

Sigh, more infrastructure failures: codecov failing its upload. This is starting to make GitHub checks look seriously flaky.

bennettgoble · 2023-03-16T17:33:20Z

Sigh, more infrastructure failures: codecov failing its upload. This is starting to make GitHub checks look seriously flaky.

@nat-goodspeed, the only check that is having problems is specifically codecov which depends on an external service. Take a look at #7 --if this doesn't improve upload reliability we can see about pulling codecov.

codecov-commenter · 2023-03-16T18:21:14Z

Codecov Report

Merging #6 (9249fce) into main (0adae95) will increase coverage by 0.38%.
The diff coverage is 97.29%.

@@            Coverage Diff             @@
##             main       #6      +/-   ##
==========================================
+ Coverage   89.90%   90.28%   +0.38%     
==========================================
  Files           6        6              
  Lines         822      844      +22     
==========================================
+ Hits          739      762      +23     
+ Misses         83       82       -1

Impacted Files	Coverage Δ
llsd/serde_xml.py	`96.85% <80.00%> (-0.72%)`	⬇️
llsd/serde_notation.py	`91.11% <97.80%> (+1.42%)`	⬆️
llsd/base.py	`87.45% <100.00%> (+0.04%)`	⬆️
llsd/serde_binary.py	`94.28% <100.00%> (-0.05%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

To further reduce LLSDNotationParser's reliance on _get_re(), change _parse_integer() to read individual characters until it sees a delimiter -- instead of using _peek(). Make LLSDBaseParser._getc() support full=False, allowing a short read when explicitly specified. This is useful for when an LLSD int ends at EOF. Add LLSDBaseParser._putback(), a convenience wrapper around seek().

and hence without using LLSDBaseParser._peek().

LLSDBinaryParser._parse_array() does not actually depend on peek() to check for close bracket ']': it has an explicit element count.

The only _peek() call was in _error(). Once _error() reads a character, we no longer care about the read position of _stream.

Remove LLSDBaseParser._peek(). All parsers now use only _getc(), with occasional _putback(), which uses seek(). Remove LLSDNotationParser._get_re(), the last consumer of _peek(). Remove serde_notation._real_regex, the last usage of _get_re(). Recast _parse_real(), the last caller of _get_re(), to scan digits et al. like a lexer. Break out _collect_sign() and _collect_digits() from _parse_integer(); these are called multiple times in new _parse_real() logic. Break out parameterized _parse_bool() from _parse_true() and _parse_false(), eliminating redundant implementations. Further break out _expect() from _parse_bool(); this is also called by new _parse_real() logic. Make LLSDBaseParser._putback() accept the bytes string it's supposed to put back, but only for its length: sometimes we call _putback() when we've reached EOF, in which case the passed string is empty. This avoids having to replicate 'if cc: self._putback()' logic everywhere. Refine LLSDBaseParser._reset() wrapper logic to wrap an existing stream in io.BufferedReader only if it's not already seekable(). That includes an incoming bytes object: wrap it only in BytesIO, which is seekable().

nat-goodspeed · 2023-03-17T19:38:07Z

Once we've iterated on this one, I believe it's time for a new release.

brad-linden · 2023-03-17T19:48:19Z

llsd/serde_binary.py

+    # Python 3
+    xrange = range
+
+


/me shakes fist at python 2

Agreed ... but llsd is perhaps the most ubiquitous of Linden's Python packages, and it's certain that there are still Python 2 consumers. I don't want any obstacles of the form: "We'd have to update to Python 3 before we could switch to the newer python-llsd implementation."

yeah not trying to derail things too much, I just wish we could pretend to be more like python3, something like

try: range = xrange # pretend to be python 3 except NameError: # hooray we are! pass

and then we could use proper range() everywhere below

Okay: 9249fce

brad-linden · 2023-03-17T19:54:14Z

llsd/serde_notation.py

+                # skip decimal point and collect fractional digits
+                cc = self._collect_digits(self._getc(full=False), fdigits)
+                digits.extend(fdigits)
+            # Fun fact: (cc in b'eE') is True even when cc is b''!


monty-linden

A few questions and points. No blockers, I think.

monty-linden · 2023-03-18T18:10:03Z

llsd/base.py


    def _error(self, message, offset=0):
        oldpos = self._stream.tell()
        # 'offset' is relative to current pos
        self._stream.seek(offset, io.SEEK_CUR)
        raise LLSDParseError("%s at byte %d: %r" %
-                             (message, oldpos+offset, self._peek(1, full=False)))
+                             (message, oldpos+offset, self._getc(1, full=False)))


New version leaves the position one byte advanced relative to original. Do we care?

I don't. This is in the LLSDBaseParser._error() function, whose whole purpose is to raise LLSDParseError. Any consumer relying on the read position of a stream passed to llsd.parse() after LLSDParseError needs to be fixed.

monty-linden · 2023-03-18T18:53:33Z

llsd/serde_binary.py

-            count += 1
-            cc = self._peek()
-        if cc != b']':
+        if self._getc() != b']':


Similar. New error condition walks over offending char, old condition left it at position to be read again. Nobody may care. (This is something that happens at several points so I'm not going to mention it anymore.)

monty-linden · 2023-03-18T19:00:12Z

llsd/serde_notation.py

        """
        This is the basic public interface for parsing.

-        :param buffer: the notation string to parse.
+        :param baseparser: LLSDBaseParser or subclass holding data to parse.


This can be base parser or subclass, bytes, seekable(), or wrapable?

Aha, thanks. d31981d

monty-linden · 2023-03-18T19:17:51Z

llsd/serde_notation.py

-        return float(self._get_re("real", _real_regex))
+        # recognize:
+        # [+-]?inf
+        # [+-]?nan


Yes: https://github.com/secondlife/python-llsd/blob/main/llsd/serde_notation.py#L11

I don't seriously expect to find +nan or -nan in existing serialized LLSD, but since the previous implementation accepted them and it would be impossible to prove no such instances exist...

The description of the parameter to parse didn't enumerate all acceptable forms.

Apparently, wrapping an incoming bytes object in io.BytesIO and passing it to xml.etree.ElementTree.parse() is slower than passing the bytes object directly to fromstring(). Detect the BytesIO case and call fromstring() with contents.

brad-linden

👍

nat-goodspeed requested review from bennettgoble and brad-linden March 16, 2023 17:21

Merge branch 'main' into sl-18330-perf

b1abf5e

nat-goodspeed added 9 commits March 16, 2023 15:37

Merge branch 'main' into sl-18330-perf

d707921

Merge branch 'main' into sl-18330-perf

358aa11

SL-18330: Merge branch 'main' of python-llsd into sl-18330-perf

02b27f0

SL-18330: Parse notation true/false without using _get_re()

16ca726

and hence without using LLSDBaseParser._peek().

Merge branch 'main' into sl-18330-perf

f06b761

SL-18330: Eliminate LLSDBaseParser._peek() from serde_binary.py.

228103e

LLSDBinaryParser._parse_array() does not actually depend on peek() to check for close bracket ']': it has an explicit element count.

SL-18330: Eliminate _peek() call from LLSDBaseParser.

6591194

The only _peek() call was in _error(). Once _error() reads a character, we no longer care about the read position of _stream.

nat-goodspeed marked this pull request as ready for review March 17, 2023 19:32

brad-linden approved these changes Mar 17, 2023

View reviewed changes

monty-linden approved these changes Mar 18, 2023

View reviewed changes

nat-goodspeed added 4 commits March 20, 2023 15:03

SL-18330: Update LLSD{Binary,Notation}Parser.parse() docstrings.

d31981d

The description of the parameter to parse didn't enumerate all acceptable forms.

SL-18330: Merge branch 'main' of python-llsd into sl-18330-perf

b83295a

SL-18330: Unify Py 2 and 3 on 'range()' not 'xrange()'

9249fce

brad-linden approved these changes Mar 21, 2023

View reviewed changes

bennettgoble approved these changes Mar 22, 2023

View reviewed changes

nat-goodspeed merged commit 0931046 into main Mar 22, 2023

nat-goodspeed deleted the sl-18330-perf branch March 22, 2023 20:36

github-actions bot locked and limited conversation to collaborators Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SL-18330: Refactor notation parsing to manage a lookahead char. #6

SL-18330: Refactor notation parsing to manage a lookahead char. #6

nat-goodspeed commented Mar 16, 2023

nat-goodspeed commented Mar 16, 2023

nat-goodspeed commented Mar 16, 2023

bennettgoble commented Mar 16, 2023 •

edited

Loading

codecov-commenter commented Mar 16, 2023 •

edited by codecov bot

Loading

nat-goodspeed commented Mar 17, 2023

brad-linden Mar 17, 2023

nat-goodspeed Mar 20, 2023

brad-linden Mar 21, 2023

nat-goodspeed Mar 21, 2023

brad-linden Mar 17, 2023

monty-linden left a comment

monty-linden Mar 18, 2023

nat-goodspeed Mar 20, 2023

monty-linden Mar 18, 2023

monty-linden Mar 18, 2023

nat-goodspeed Mar 20, 2023

monty-linden Mar 18, 2023

nat-goodspeed Mar 20, 2023

brad-linden left a comment

SL-18330: Refactor notation parsing to manage a lookahead char. #6

SL-18330: Refactor notation parsing to manage a lookahead char. #6

Conversation

nat-goodspeed commented Mar 16, 2023

nat-goodspeed commented Mar 16, 2023

nat-goodspeed commented Mar 16, 2023

bennettgoble commented Mar 16, 2023 • edited Loading

codecov-commenter commented Mar 16, 2023 • edited by codecov bot Loading

Codecov Report

nat-goodspeed commented Mar 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

monty-linden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brad-linden left a comment

Choose a reason for hiding this comment

bennettgoble commented Mar 16, 2023 •

edited

Loading

codecov-commenter commented Mar 16, 2023 •

edited by codecov bot

Loading