Skip to content

SL-18330: Support parsing direct from stream; improve C++ compatibility. #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 14, 2023

Conversation

nat-goodspeed
Copy link
Contributor

The major feature introduced by this PR is that llsd.parse() can now accept a seekable binary stream instead of requiring the caller to read the stream (e.g. a file) into memory before parsing.

It also improves recognition of MIME-type headers produced by the C++ LLSD library embedded in the Second Life viewer.

In Python 3, the previous logic would always try and fail to import
cElementTree before falling back on the deprecated xml.etree.cElementTree. If
_use_celementree was set False, it would bounce through two explicit
ImportErrors and fail to import elementtree.ElementTree before finally
successfully importing xml.etree.ElementTree.
In addition to existing parse_xml(), parse_notation() and parse_binary(),
introduce parse_xml_noh(), parse_notation_noh() and parse_binary_noh() for
when we know there's no '<? llsd/mimetype ?>' header.

Internally, add match_xml_hdr(), match_notation_hdr(), match_binary_hdr() that
scan forward through the passed bytes object (or stream) when it matches. This
lets parse() recognize and skip the relevant header, calling the corresponding
parse_mumble_noh() function.

match_xml_hdr(), match_notation_hdr() and match_binary_hdr() call matchseq(),
which generalizes header recognition by supporting a binary stream as well as
a bytes object, allowing any amount of whitespace and ignoring case.

Notably, the viewer's LLSDSerialize::serialize() emits "<? LLSD/Binary ?>"
while parse() recognized only "<?llsd/binary?>". Moreover, parse() was
confused by serialize()'s "<? LLSD/XML ?>" and "<? llsd/notation ?>" headers.

Try to support a binary stream, as well as a bytes object, in parse_xml() and
parse_xml_noh(): use matchseq() instead of regex matching, and call
xml.etree.ElementTree.parse() instead of fromstring() when passed a binary
stream.
Refactor LLSDBaseParser and subclasses so that direct references to
self._buffer and self._index are replaced by operations on a new BytesUtil or
StreamUtil instance, each of which implements the utility methods presented by
LLSDBaseParser: _error(), _peek(), _getc().
Now that llsd.parse() supports parsing directly from streams, make its tests
validate that we haven't broken that feature.
BytesType was defined as 'str' in Python 2, 'bytes' in Python 3. But in Python
2, 'bytes' is an alias for 'str': (bytes is str) is True.
Having chosen either BytesUtil or StreamUtil, instead of storing a reference
in LLSDBaseParser._util and indirecting through that on every access, patch
LLSDBaseParser instance with (underscore-prefixed) direct references to the
public BytesUtil / StreamUtil methods. Fix _parse_string_delim() accordingly.

Introduce LLSDBaseParser class attribute force_stream to always pick
StreamUtil implementation, wrapping any passed bytes object in io.BytesIO.

Should we choose to discard BytesUtil entirely, the current tactic of patching
StreamUtil methods into the LLSDBaseParser instance positions us to adopt
those StreamUtil methods directly into LLSDBaseParser.
@nat-goodspeed
Copy link
Contributor Author

This PR is required by the extended self-tests in secondlife/viewer#29 .

@codecov-commenter
Copy link

codecov-commenter commented Dec 7, 2022

Codecov Report

Merging #4 (cbb89ca) into main (10992b7) will increase coverage by 0.78%.
The diff coverage is 93.79%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main       #4      +/-   ##
==========================================
+ Coverage   90.30%   91.09%   +0.78%     
==========================================
  Files           5        6       +1     
  Lines         743      797      +54     
==========================================
+ Hits          671      726      +55     
+ Misses         72       71       -1     
Impacted Files Coverage Δ
llsd/fastest_elementtree.py 58.33% <50.00%> (+8.33%) ⬆️
llsd/base.py 88.67% <98.27%> (+2.14%) ⬆️
llsd/__init__.py 96.77% <100.00%> (ø)
llsd/serde_binary.py 93.61% <100.00%> (-0.79%) ⬇️
llsd/serde_notation.py 90.49% <100.00%> (+0.21%) ⬆️
llsd/serde_xml.py 99.19% <100.00%> (+0.01%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bennettgoble bennettgoble added the enhancement New feature or request label Mar 1, 2023
llsd/base.py Outdated
return chars


class StreamUtil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of building wrappers around bytes and streams, could llsd.parse convert all input into a unified class such as BufferedReader which provides the requisite peek implementation? This would make these classes unneeded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 interested to hear discussion about this as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction between StreamUtil and BytesUtil becomes moot when LLSDBaseParser.force_stream is True, which it is by default: in that case, instead of using BytesUtil at all, _reset() copies the input bytes object into an io.BytesIO and proceeds with StreamUtil.

I was curious whether a BytesUtil wrapper would have any performance benefits over the tactic above. I suspect it only matters if the consumer passes a very large bytes object, in which case both the buffer allocation and the copy itself would incur noticeable overhead. But - in the likely case that the consumer is reading that very large bytes object from a stream, we could address that overhead by passing the stream directly to parse().

Bottom line: we can probably make BytesUtil go away entirely. In that case, yes, we could wrap the input stream (whether passed by the caller or an internal BytesIO) in a BufferedReader and use its peek(). It might still be useful to wrap it in a class that provides StreamUtil.error().

Is that the direction we should go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BytesStream and AnyStream could similarly be elided.

Copy link
Member

@bennettgoble bennettgoble Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see SteamUtil, MatchStream, AnyStream, BytesStream disappeared, they're confusing and, imo, too complicated when we can just take input and coerce it into a standard I/O contract.



def parse_binary_noh(something):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"noh" is cute but its meaning is not obvious. Could the header option be an optional kwarg of the simpler parse_* functions or could the name be made clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about parse_binary_nohdr()? Or even _parse_binary_nohdr() since I view this as primarily internal?

I want to be able to pass control directly to this code from llsd.parse() when the generic parser recognizes a particular header, instead of passing and testing a bool.

Copy link

@brad-linden brad-linden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conditional on resolving the question about BufferedReader usage above, this PR looks good to me

@brad-linden
Copy link

also, per recent developments, I think it is not necessarily worth maintaining python2 and 3.4 fallback code for new enhancement work like this. If one wanted to juice our codecov metrics, it could be done by ripping some of that old stuff out.

llsd/base.py Outdated
# prepend an underscore to each method name.
for attr in dir(util):
if not attr.startswith('_'):
setattr(self, '_' + attr, getattr(util, attr))
Copy link
Member

@bennettgoble bennettgoble Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why...? Let's avoid metaprogramming-for-minor-convenience-reasons if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was actually intended for performance: avoiding an extra hash lookup. There's code in our parsers that uses the Python idiom of capturing certain object attributes as local variables. I don't often do that myself because I'm not often dealing with performance-critical Python code, and I prefer clarity. That said, the lines above are a nod to the theoretical performance benefit of skipping an extra indirection.

But with the local stream wrappers vanishing, the above becomes moot and will likewise disappear.

Instead, wrap any bytes object passed to LLSDBaseParser._reset() in BytesIO so
we can uniformly treat as a stream. Further wrap that (or any passed stream
that doesn't already support peek()) in BufferedReader so we can directly use
peek().

Migrate StreamUtil's getc(), peek() and error() methods into LLSDBaseParser as
the corresponding sunder methods. These not only preserve the existing API for
subclasses, but implement error checks around read() and peek().
Specifically, remove MatchStream, BytesStream and AnyStream, which were used
specifically for matchseq() and starts_with(). Use existing LLSDBaseParser
wrapping instead.

Now that we're treating all input, including bytes objects, as binary streams,
remove PY3SemanticBytes and is_bytes().

Fold free functions base.matchseq() and starts_with() into LLSDBaseParser,
changing signatures: with the instance object managing the underlying stream,
they can simply accept (pattern) and return bool. Add remainder() method to
retrieve the (advanced) underlying stream object.

Also remove serde_binary.match_binary_hdr(), serde_xml.match_xml_hdr() and
serde_notation.match_notation_hdr(). Instead add BINARY_HEADER, XML_HEADER and
NOTATION_HEADER to base.py and call LLSDBaseParser.matchseq() with them as
needed.

Rename parse_{binary,notation,xml}_noh() to ...nohdr(). Make each of them
accept an LLSDBaseParser instance instead of a (bytes object or stream): since
we only ever call parse_mumble_nohdr() after examining the input using an
LLSDBaseParser, pass that instance into the nohdr() method. parse_xml_nohdr()
must further manipulate its input with an LLSDBaseParser, so without this
signature change parse_xml() would extract remainder() from an LLSDBaseParser
and pass it into parse_xml_nohdr(), which would wrap it with another
LLSDBaseParser and finally extract ITS remainder() to pass to
fastest_elementtree.parse().
@nat-goodspeed
Copy link
Contributor Author

Since this PR has become essential for ongoing DRTVWR-558 Puppetry work, absent further objections or requested changes, I'm going to merge this soonish.

@nat-goodspeed nat-goodspeed merged commit db460d6 into main Mar 14, 2023
@nat-goodspeed nat-goodspeed deleted the sl-18330 branch March 14, 2023 14:23
@github-actions github-actions bot locked and limited conversation to collaborators Mar 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants