Skip to content

SL-18330: Refactor notation parsing to manage a lookahead char. #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 22, 2023

Conversation

nat-goodspeed
Copy link
Contributor

Instead of peeking ahead by a character and then (most of the time) rereading it simply to discard it, go ahead and read the next character, passing it into the various LLSDNotationParser._parse_mumble() functions. This eliminates most LLSDNotationParser use of LLSDBaseParser._peek(), save for _get_re().

This eliminates the need for LLSDNotationParser._skip_then(), whose only job was to discard the next character and return the specified value.

Break out LLSDNotationParser._parse_true() and _parse_false() in anticipation of potential reimplementation without _get_re().

Instead of peeking ahead by a character and then (most of the time) rereading
it simply to discard it, go ahead and read the next character, passing it into
the various LLSDNotationParser._parse_mumble() functions. This eliminates most
LLSDNotationParser use of LLSDBaseParser._peek(), save for _get_re().

This eliminates the need for LLSDNotationParser._skip_then(), whose only job
was to discard the next character and return the specified value.

Break out LLSDNotationParser._parse_true() and _parse_false() in anticipation
of potential reimplementation without _get_re().
@nat-goodspeed
Copy link
Contributor Author

This PR is still a draft because I believe we could realize further notation parsing performance gains by eliminating some (all?) uses of LLSDNotationParser._get_re().

@nat-goodspeed
Copy link
Contributor Author

Sigh, more infrastructure failures: codecov failing its upload. This is starting to make GitHub checks look seriously flaky.

@bennettgoble
Copy link
Member

bennettgoble commented Mar 16, 2023

Sigh, more infrastructure failures: codecov failing its upload. This is starting to make GitHub checks look seriously flaky.

@nat-goodspeed, the only check that is having problems is specifically codecov which depends on an external service. Take a look at #7 --if this doesn't improve upload reliability we can see about pulling codecov.

@codecov-commenter
Copy link

codecov-commenter commented Mar 16, 2023

Codecov Report

Merging #6 (9249fce) into main (0adae95) will increase coverage by 0.38%.
The diff coverage is 97.29%.

@@            Coverage Diff             @@
##             main       #6      +/-   ##
==========================================
+ Coverage   89.90%   90.28%   +0.38%     
==========================================
  Files           6        6              
  Lines         822      844      +22     
==========================================
+ Hits          739      762      +23     
+ Misses         83       82       -1     
Impacted Files Coverage Δ
llsd/serde_xml.py 96.85% <80.00%> (-0.72%) ⬇️
llsd/serde_notation.py 91.11% <97.80%> (+1.42%) ⬆️
llsd/base.py 87.45% <100.00%> (+0.04%) ⬆️
llsd/serde_binary.py 94.28% <100.00%> (-0.05%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

To further reduce LLSDNotationParser's reliance on _get_re(), change
_parse_integer() to read individual characters until it sees a delimiter --
instead of using _peek().

Make LLSDBaseParser._getc() support full=False, allowing a short read when
explicitly specified. This is useful for when an LLSD int ends at EOF.

Add LLSDBaseParser._putback(), a convenience wrapper around seek().
and hence without using LLSDBaseParser._peek().
LLSDBinaryParser._parse_array() does not actually depend on peek() to check
for close bracket ']': it has an explicit element count.
The only _peek() call was in _error(). Once _error() reads a character, we no
longer care about the read position of _stream.
Remove LLSDBaseParser._peek(). All parsers now use only _getc(), with
occasional _putback(), which uses seek().

Remove LLSDNotationParser._get_re(), the last consumer of _peek().
Remove serde_notation._real_regex, the last usage of _get_re().
Recast _parse_real(), the last caller of _get_re(), to scan digits et al.
like a lexer.

Break out _collect_sign() and _collect_digits() from _parse_integer(); these
are called multiple times in new _parse_real() logic.

Break out parameterized _parse_bool() from _parse_true() and _parse_false(),
eliminating redundant implementations. Further break out _expect() from
_parse_bool(); this is also called by new _parse_real() logic.

Make LLSDBaseParser._putback() accept the bytes string it's supposed to put
back, but only for its length: sometimes we call _putback() when we've reached
EOF, in which case the passed string is empty. This avoids having to replicate
'if cc: self._putback()' logic everywhere.

Refine LLSDBaseParser._reset() wrapper logic to wrap an existing stream in
io.BufferedReader only if it's not already seekable(). That includes an
incoming bytes object: wrap it only in BytesIO, which is seekable().
@nat-goodspeed nat-goodspeed marked this pull request as ready for review March 17, 2023 19:32
@nat-goodspeed
Copy link
Contributor Author

Once we've iterated on this one, I believe it's time for a new release.

# Python 3
xrange = range


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/me shakes fist at python 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed ... but llsd is perhaps the most ubiquitous of Linden's Python packages, and it's certain that there are still Python 2 consumers. I don't want any obstacles of the form: "We'd have to update to Python 3 before we could switch to the newer python-llsd implementation."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah not trying to derail things too much, I just wish we could pretend to be more like python3, something like

try:
    range = xrange  # pretend to be python 3
except NameError:
    # hooray we are!
   pass

and then we could use proper range() everywhere below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay: 9249fce

# skip decimal point and collect fractional digits
cc = self._collect_digits(self._getc(full=False), fdigits)
digits.extend(fdigits)
# Fun fact: (cc in b'eE') is True even when cc is b''!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link

@monty-linden monty-linden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions and points. No blockers, I think.


def _error(self, message, offset=0):
oldpos = self._stream.tell()
# 'offset' is relative to current pos
self._stream.seek(offset, io.SEEK_CUR)
raise LLSDParseError("%s at byte %d: %r" %
(message, oldpos+offset, self._peek(1, full=False)))
(message, oldpos+offset, self._getc(1, full=False)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version leaves the position one byte advanced relative to original. Do we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't. This is in the LLSDBaseParser._error() function, whose whole purpose is to raise LLSDParseError. Any consumer relying on the read position of a stream passed to llsd.parse() after LLSDParseError needs to be fixed.

count += 1
cc = self._peek()
if cc != b']':
if self._getc() != b']':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar. New error condition walks over offending char, old condition left it at position to be read again. Nobody may care. (This is something that happens at several points so I'm not going to mention it anymore.)

"""
This is the basic public interface for parsing.

:param buffer: the notation string to parse.
:param baseparser: LLSDBaseParser or subclass holding data to parse.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be base parser or subclass, bytes, seekable(), or wrapable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, thanks. d31981d

return float(self._get_re("real", _real_regex))
# recognize:
# [+-]?inf
# [+-]?nan

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+/- nans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: https://github.com/secondlife/python-llsd/blob/main/llsd/serde_notation.py#L11

I don't seriously expect to find +nan or -nan in existing serialized LLSD, but since the previous implementation accepted them and it would be impossible to prove no such instances exist...

The description of the parameter to parse didn't enumerate all acceptable forms.
Apparently, wrapping an incoming bytes object in io.BytesIO and passing it to
xml.etree.ElementTree.parse() is slower than passing the bytes object directly
to fromstring(). Detect the BytesIO case and call fromstring() with contents.
Copy link

@brad-linden brad-linden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nat-goodspeed nat-goodspeed merged commit 0931046 into main Mar 22, 2023
@nat-goodspeed nat-goodspeed deleted the sl-18330-perf branch March 22, 2023 20:36
@github-actions github-actions bot locked and limited conversation to collaborators Mar 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants