-
Notifications
You must be signed in to change notification settings - Fork 1
SL-18330: Refactor notation parsing to manage a lookahead char. #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Instead of peeking ahead by a character and then (most of the time) rereading it simply to discard it, go ahead and read the next character, passing it into the various LLSDNotationParser._parse_mumble() functions. This eliminates most LLSDNotationParser use of LLSDBaseParser._peek(), save for _get_re(). This eliminates the need for LLSDNotationParser._skip_then(), whose only job was to discard the next character and return the specified value. Break out LLSDNotationParser._parse_true() and _parse_false() in anticipation of potential reimplementation without _get_re().
This PR is still a draft because I believe we could realize further notation parsing performance gains by eliminating some (all?) uses of |
Sigh, more infrastructure failures: codecov failing its upload. This is starting to make GitHub checks look seriously flaky. |
@nat-goodspeed, the only check that is having problems is specifically codecov which depends on an external service. Take a look at #7 --if this doesn't improve upload reliability we can see about pulling codecov. |
Codecov Report
@@ Coverage Diff @@
## main #6 +/- ##
==========================================
+ Coverage 89.90% 90.28% +0.38%
==========================================
Files 6 6
Lines 822 844 +22
==========================================
+ Hits 739 762 +23
+ Misses 83 82 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
To further reduce LLSDNotationParser's reliance on _get_re(), change _parse_integer() to read individual characters until it sees a delimiter -- instead of using _peek(). Make LLSDBaseParser._getc() support full=False, allowing a short read when explicitly specified. This is useful for when an LLSD int ends at EOF. Add LLSDBaseParser._putback(), a convenience wrapper around seek().
and hence without using LLSDBaseParser._peek().
LLSDBinaryParser._parse_array() does not actually depend on peek() to check for close bracket ']': it has an explicit element count.
The only _peek() call was in _error(). Once _error() reads a character, we no longer care about the read position of _stream.
Remove LLSDBaseParser._peek(). All parsers now use only _getc(), with occasional _putback(), which uses seek(). Remove LLSDNotationParser._get_re(), the last consumer of _peek(). Remove serde_notation._real_regex, the last usage of _get_re(). Recast _parse_real(), the last caller of _get_re(), to scan digits et al. like a lexer. Break out _collect_sign() and _collect_digits() from _parse_integer(); these are called multiple times in new _parse_real() logic. Break out parameterized _parse_bool() from _parse_true() and _parse_false(), eliminating redundant implementations. Further break out _expect() from _parse_bool(); this is also called by new _parse_real() logic. Make LLSDBaseParser._putback() accept the bytes string it's supposed to put back, but only for its length: sometimes we call _putback() when we've reached EOF, in which case the passed string is empty. This avoids having to replicate 'if cc: self._putback()' logic everywhere. Refine LLSDBaseParser._reset() wrapper logic to wrap an existing stream in io.BufferedReader only if it's not already seekable(). That includes an incoming bytes object: wrap it only in BytesIO, which is seekable().
Once we've iterated on this one, I believe it's time for a new release. |
# Python 3 | ||
xrange = range | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/me shakes fist at python 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed ... but llsd
is perhaps the most ubiquitous of Linden's Python packages, and it's certain that there are still Python 2 consumers. I don't want any obstacles of the form: "We'd have to update to Python 3 before we could switch to the newer python-llsd implementation."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah not trying to derail things too much, I just wish we could pretend to be more like python3, something like
try:
range = xrange # pretend to be python 3
except NameError:
# hooray we are!
pass
and then we could use proper range()
everywhere below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay: 9249fce
# skip decimal point and collect fractional digits | ||
cc = self._collect_digits(self._getc(full=False), fdigits) | ||
digits.extend(fdigits) | ||
# Fun fact: (cc in b'eE') is True even when cc is b''! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions and points. No blockers, I think.
|
||
def _error(self, message, offset=0): | ||
oldpos = self._stream.tell() | ||
# 'offset' is relative to current pos | ||
self._stream.seek(offset, io.SEEK_CUR) | ||
raise LLSDParseError("%s at byte %d: %r" % | ||
(message, oldpos+offset, self._peek(1, full=False))) | ||
(message, oldpos+offset, self._getc(1, full=False))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New version leaves the position one byte advanced relative to original. Do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't. This is in the LLSDBaseParser._error()
function, whose whole purpose is to raise LLSDParseError
. Any consumer relying on the read position of a stream passed to llsd.parse()
after LLSDParseError
needs to be fixed.
count += 1 | ||
cc = self._peek() | ||
if cc != b']': | ||
if self._getc() != b']': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar. New error condition walks over offending char, old condition left it at position to be read again. Nobody may care. (This is something that happens at several points so I'm not going to mention it anymore.)
llsd/serde_notation.py
Outdated
""" | ||
This is the basic public interface for parsing. | ||
|
||
:param buffer: the notation string to parse. | ||
:param baseparser: LLSDBaseParser or subclass holding data to parse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be base parser or subclass, bytes, seekable(), or wrapable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, thanks. d31981d
return float(self._get_re("real", _real_regex)) | ||
# recognize: | ||
# [+-]?inf | ||
# [+-]?nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+/- nans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: https://github.com/secondlife/python-llsd/blob/main/llsd/serde_notation.py#L11
I don't seriously expect to find +nan
or -nan
in existing serialized LLSD, but since the previous implementation accepted them and it would be impossible to prove no such instances exist...
The description of the parameter to parse didn't enumerate all acceptable forms.
Apparently, wrapping an incoming bytes object in io.BytesIO and passing it to xml.etree.ElementTree.parse() is slower than passing the bytes object directly to fromstring(). Detect the BytesIO case and call fromstring() with contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Instead of peeking ahead by a character and then (most of the time) rereading it simply to discard it, go ahead and read the next character, passing it into the various LLSDNotationParser._parse_mumble() functions. This eliminates most LLSDNotationParser use of LLSDBaseParser._peek(), save for _get_re().
This eliminates the need for LLSDNotationParser._skip_then(), whose only job was to discard the next character and return the specified value.
Break out LLSDNotationParser._parse_true() and _parse_false() in anticipation of potential reimplementation without _get_re().