Skip to content

SL-18330: Refactor notation parsing to manage a lookahead char. #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 11 additions & 38 deletions llsd/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -404,13 +404,12 @@ def _reset(self, something):
# Wrap an incoming bytes string into a stream. If the passed bytes
# string is so large that the overhead of copying it into a
# BytesIO is significant, advise caller to pass a stream instead.
# BytesIO has no peek() method, so wrap it in BufferedReader.
self._stream = io.BufferedReader(io.BytesIO(something))
elif hasattr(something, 'peek'):
# 'something' is already a buffered stream, use directly
self._stream = io.BytesIO(something)
elif something.seekable():
# 'something' is already a seekable stream, use directly
self._stream = something
else:
# 'something' isn't buffered, wrap in BufferedReader
# 'something' isn't seekable, wrap in BufferedReader
# (let BufferedReader handle the problem of passing an
# inappropriate object)
self._stream = io.BufferedReader(something)
Expand Down Expand Up @@ -482,49 +481,23 @@ def _next_nonblank(self):
c = self._stream.read(1)
return c

def _getc(self, num=1):
def _getc(self, num=1, full=True):
got = self._stream.read(num)
if len(got) < num:
if full and len(got) < num:
self._error("Trying to read past end of stream")
return got

def _peek(self, num=1, full=True):
# full=True means error if we can't peek ahead num bytes
if num < 0:
# There aren't many ways this can happen. The likeliest is that
# we've just read garbage length bytes from a binary input string.
# We happen to know that lengths are encoded as 4 bytes, so back
# off by 4 bytes to try to point the user at the right spot.
self._error("Invalid length field %d" % num, -4)

# Instead of using self._stream.peek() at all, use read(num) and reset
# the read pointer. BufferedReader.peek() does not promise to return
# the requested length, but does not clarify the conditions under
# which it returns fewer bytes.
# https://docs.python.org/3/library/io.html#io.BufferedReader.peek
# In practice, we've seen it fail with an input file up over 100Kb:
# peek() returns only part of what we requested, but because we also
# passed full=False (see LLSDNotationParser._get_re()), we didn't
# notice and the parse failed looking for a map delimiter halfway
# through a large decimal integer. read(num), on the other hand,
# promises to return num bytes until actual EOF.
oldpos = self._stream.tell()
try:
got = self._stream.read(num)
if full and len(got) < num:
self._error("Trying to peek past end of stream")

return got

finally:
self._stream.seek(oldpos)
def _putback(self, cc):
# if this test fails, it's not a user error, it's a coding error
assert self._stream.tell() >= len(cc)
self._stream.seek(-len(cc), io.SEEK_CUR)

def _error(self, message, offset=0):
oldpos = self._stream.tell()
# 'offset' is relative to current pos
self._stream.seek(offset, io.SEEK_CUR)
raise LLSDParseError("%s at byte %d: %r" %
(message, oldpos+offset, self._peek(1, full=False)))
(message, oldpos+offset, self._getc(1, full=False)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version leaves the position one byte advanced relative to original. Do we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't. This is in the LLSDBaseParser._error() function, whose whole purpose is to raise LLSDParseError. Any consumer relying on the read position of a stream passed to llsd.parse() after LLSDParseError needs to be fixed.


# map char following escape char to corresponding character
_escaped = {
Expand Down
24 changes: 14 additions & 10 deletions llsd/serde_binary.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@
_str_to_bytes, binary, is_integer, is_string, uri)


try:
# Python 2: make 'range()' lazy like Python 3
range = xrange
except NameError:
# Python 3: 'range()' is already lazy
pass


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/me shakes fist at python 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed ... but llsd is perhaps the most ubiquitous of Linden's Python packages, and it's certain that there are still Python 2 consumers. I don't want any obstacles of the form: "We'd have to update to Python 3 before we could switch to the newer python-llsd implementation."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah not trying to derail things too much, I just wish we could pretend to be more like python3, something like

try:
    range = xrange  # pretend to be python 3
except NameError:
    # hooray we are!
   pass

and then we could use proper range() everywhere below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay: 9249fce

class LLSDBinaryParser(LLSDBaseParser):
"""
Parse application/llsd+binary to a python object.
Expand Down Expand Up @@ -56,15 +64,16 @@ def __init__(self):
for c, func in _dispatch_dict.items():
self._dispatch[ord(c)] = func

def parse(self, buffer, ignore_binary = False):
def parse(self, something, ignore_binary = False):
"""
This is the basic public interface for parsing.

:param buffer: the binary data to parse in an indexable sequence.
:param something: serialized LLSD to parse: a bytes object, a binary
stream or an LLSDBaseParser subclass.
:param ignore_binary: parser throws away data in llsd binary nodes.
:returns: returns a python object.
"""
self._reset(buffer)
self._reset(something)
self._keep_binary = not ignore_binary
try:
return self._parse()
Expand Down Expand Up @@ -107,15 +116,10 @@ def _parse_array(self):
"Parse a single llsd array"
rv = []
size = struct.unpack("!i", self._getc(4))[0]
count = 0
cc = self._peek()
while (cc != b']') and (count < size):
for count in range(size):
rv.append(self._parse())
count += 1
cc = self._peek()
if cc != b']':
if self._getc() != b']':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar. New error condition walks over offending char, old condition left it at position to be read again. Nobody may care. (This is something that happens at several points so I'm not going to mention it anymore.)

self._error("invalid array close token")
self._getc()
return rv

def _parse_string(self):
Expand Down
Loading