SL-18330: Support parsing direct from stream; improve C++ compatibility. #4

nat-goodspeed · 2022-12-07T16:22:14Z

The major feature introduced by this PR is that llsd.parse() can now accept a seekable binary stream instead of requiring the caller to read the stream (e.g. a file) into memory before parsing.

It also improves recognition of MIME-type headers produced by the C++ LLSD library embedded in the Second Life viewer.

In Python 3, the previous logic would always try and fail to import cElementTree before falling back on the deprecated xml.etree.cElementTree. If _use_celementree was set False, it would bounce through two explicit ImportErrors and fail to import elementtree.ElementTree before finally successfully importing xml.etree.ElementTree.

In addition to existing parse_xml(), parse_notation() and parse_binary(), introduce parse_xml_noh(), parse_notation_noh() and parse_binary_noh() for when we know there's no '<? llsd/mimetype ?>' header. Internally, add match_xml_hdr(), match_notation_hdr(), match_binary_hdr() that scan forward through the passed bytes object (or stream) when it matches. This lets parse() recognize and skip the relevant header, calling the corresponding parse_mumble_noh() function. match_xml_hdr(), match_notation_hdr() and match_binary_hdr() call matchseq(), which generalizes header recognition by supporting a binary stream as well as a bytes object, allowing any amount of whitespace and ignoring case. Notably, the viewer's LLSDSerialize::serialize() emits "<? LLSD/Binary ?>" while parse() recognized only "<?llsd/binary?>". Moreover, parse() was confused by serialize()'s "<? LLSD/XML ?>" and "<? llsd/notation ?>" headers. Try to support a binary stream, as well as a bytes object, in parse_xml() and parse_xml_noh(): use matchseq() instead of regex matching, and call xml.etree.ElementTree.parse() instead of fromstring() when passed a binary stream.

Refactor LLSDBaseParser and subclasses so that direct references to self._buffer and self._index are replaced by operations on a new BytesUtil or StreamUtil instance, each of which implements the utility methods presented by LLSDBaseParser: _error(), _peek(), _getc().

Now that llsd.parse() supports parsing directly from streams, make its tests validate that we haven't broken that feature.

BytesType was defined as 'str' in Python 2, 'bytes' in Python 3. But in Python 2, 'bytes' is an alias for 'str': (bytes is str) is True.

Having chosen either BytesUtil or StreamUtil, instead of storing a reference in LLSDBaseParser._util and indirecting through that on every access, patch LLSDBaseParser instance with (underscore-prefixed) direct references to the public BytesUtil / StreamUtil methods. Fix _parse_string_delim() accordingly. Introduce LLSDBaseParser class attribute force_stream to always pick StreamUtil implementation, wrapping any passed bytes object in io.BytesIO. Should we choose to discard BytesUtil entirely, the current tactic of patching StreamUtil methods into the LLSDBaseParser instance positions us to adopt those StreamUtil methods directly into LLSDBaseParser.

nat-goodspeed · 2022-12-07T16:25:49Z

This PR is required by the extended self-tests in secondlife/viewer#29 .

🙄

codecov-commenter · 2022-12-07T17:12:01Z

Codecov Report

Merging #4 (cbb89ca) into main (10992b7) will increase coverage by 0.78%.
The diff coverage is 93.79%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main       #4      +/-   ##
==========================================
+ Coverage   90.30%   91.09%   +0.78%     
==========================================
  Files           5        6       +1     
  Lines         743      797      +54     
==========================================
+ Hits          671      726      +55     
+ Misses         72       71       -1

Impacted Files	Coverage Δ
llsd/fastest_elementtree.py	`58.33% <50.00%> (+8.33%)`	⬆️
llsd/base.py	`88.67% <98.27%> (+2.14%)`	⬆️
llsd/__init__.py	`96.77% <100.00%> (ø)`
llsd/serde_binary.py	`93.61% <100.00%> (-0.79%)`	⬇️
llsd/serde_notation.py	`90.49% <100.00%> (+0.21%)`	⬆️
llsd/serde_xml.py	`99.19% <100.00%> (+0.01%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bennettgoble · 2023-03-01T21:29:36Z

llsd/base.py

+        return chars
+
+
+class StreamUtil:


Instead of building wrappers around bytes and streams, could llsd.parse convert all input into a unified class such as BufferedReader which provides the requisite peek implementation? This would make these classes unneeded.

+1 interested to hear discussion about this as well

The distinction between StreamUtil and BytesUtil becomes moot when LLSDBaseParser.force_stream is True, which it is by default: in that case, instead of using BytesUtil at all, _reset() copies the input bytes object into an io.BytesIO and proceeds with StreamUtil.

I was curious whether a BytesUtil wrapper would have any performance benefits over the tactic above. I suspect it only matters if the consumer passes a very large bytes object, in which case both the buffer allocation and the copy itself would incur noticeable overhead. But - in the likely case that the consumer is reading that very large bytes object from a stream, we could address that overhead by passing the stream directly to parse().

Bottom line: we can probably make BytesUtil go away entirely. In that case, yes, we could wrap the input stream (whether passed by the caller or an internal BytesIO) in a BufferedReader and use its peek(). It might still be useful to wrap it in a class that provides StreamUtil.error().

Is that the direction we should go?

BytesStream and AnyStream could similarly be elided.

I'd like to see SteamUtil, MatchStream, AnyStream, BytesStream disappeared, they're confusing and, imo, too complicated when we can just take input and coerce it into a standard I/O contract.

bennettgoble · 2023-03-01T21:30:59Z

llsd/serde_binary.py


+
+def parse_binary_noh(something):


"noh" is cute but its meaning is not obvious. Could the header option be an optional kwarg of the simpler parse_* functions or could the name be made clearer?

How about parse_binary_nohdr()? Or even _parse_binary_nohdr() since I view this as primarily internal?

I want to be able to pass control directly to this code from llsd.parse() when the generic parser recognizes a particular header, instead of passing and testing a bool.

brad-linden

conditional on resolving the question about BufferedReader usage above, this PR looks good to me

brad-linden · 2023-03-02T19:39:53Z

also, per recent developments, I think it is not necessarily worth maintaining python2 and 3.4 fallback code for new enhancement work like this. If one wanted to juice our codecov metrics, it could be done by ripping some of that old stuff out.

bennettgoble · 2023-03-02T22:11:52Z

llsd/base.py

+        # prepend an underscore to each method name.
+        for attr in dir(util):
+            if not attr.startswith('_'):
+                setattr(self, '_' + attr, getattr(util, attr))


Why...? Let's avoid metaprogramming-for-minor-convenience-reasons if possible.

That was actually intended for performance: avoiding an extra hash lookup. There's code in our parsers that uses the Python idiom of capturing certain object attributes as local variables. I don't often do that myself because I'm not often dealing with performance-critical Python code, and I prefer clarity. That said, the lines above are a nod to the theoretical performance benefit of skipping an extra indirection.

But with the local stream wrappers vanishing, the above becomes moot and will likewise disappear.

Instead, wrap any bytes object passed to LLSDBaseParser._reset() in BytesIO so we can uniformly treat as a stream. Further wrap that (or any passed stream that doesn't already support peek()) in BufferedReader so we can directly use peek(). Migrate StreamUtil's getc(), peek() and error() methods into LLSDBaseParser as the corresponding sunder methods. These not only preserve the existing API for subclasses, but implement error checks around read() and peek().

Specifically, remove MatchStream, BytesStream and AnyStream, which were used specifically for matchseq() and starts_with(). Use existing LLSDBaseParser wrapping instead. Now that we're treating all input, including bytes objects, as binary streams, remove PY3SemanticBytes and is_bytes(). Fold free functions base.matchseq() and starts_with() into LLSDBaseParser, changing signatures: with the instance object managing the underlying stream, they can simply accept (pattern) and return bool. Add remainder() method to retrieve the (advanced) underlying stream object. Also remove serde_binary.match_binary_hdr(), serde_xml.match_xml_hdr() and serde_notation.match_notation_hdr(). Instead add BINARY_HEADER, XML_HEADER and NOTATION_HEADER to base.py and call LLSDBaseParser.matchseq() with them as needed. Rename parse_{binary,notation,xml}_noh() to ...nohdr(). Make each of them accept an LLSDBaseParser instance instead of a (bytes object or stream): since we only ever call parse_mumble_nohdr() after examining the input using an LLSDBaseParser, pass that instance into the nohdr() method. parse_xml_nohdr() must further manipulate its input with an LLSDBaseParser, so without this signature change parse_xml() would extract remainder() from an LLSDBaseParser and pass it into parse_xml_nohdr(), which would wrap it with another LLSDBaseParser and finally extract ITS remainder() to pass to fastest_elementtree.parse().

nat-goodspeed · 2023-03-13T19:21:40Z

Since this PR has become essential for ongoing DRTVWR-558 Puppetry work, absent further objections or requested changes, I'm going to merge this soonish.

nat-goodspeed added 6 commits November 29, 2022 12:44

SL-18330: Test parsing from io.BytesIO streams as well as bytes.

2bbfc7c

Now that llsd.parse() supports parsing directly from streams, make its tests validate that we haven't broken that feature.

SL-18330: BytesType alias is pointless: just use 'bytes'.

b2f3c06

BytesType was defined as 'str' in Python 2, 'bytes' in Python 3. But in Python 2, 'bytes' is an alias for 'str': (bytes is str) is True.

nat-goodspeed requested review from bennettgoble and brad-linden December 7, 2022 16:22

nat-goodspeed mentioned this pull request Dec 7, 2022

SL-18330: Send binary LLSD to LEAP plugins; extend LLSD compatibility. secondlife/viewer#29

Merged

SL-18330: Python 2 doesn't recognize rb'...' but br'...'

f178592

🙄

bennettgoble added the enhancement New feature or request label Mar 1, 2023

bennettgoble reviewed Mar 1, 2023

View reviewed changes

brad-linden approved these changes Mar 2, 2023

View reviewed changes

bennettgoble reviewed Mar 2, 2023

View reviewed changes

nat-goodspeed added 3 commits March 3, 2023 12:00

SL-18330: Allow for even crazier long ints in fuzz testing.

cbb89ca

nat-goodspeed mentioned this pull request Mar 13, 2023

SL-18330: Use new standalone llsd module instead of llbase.llsd. secondlife/leap#14

Closed

nat-goodspeed merged commit db460d6 into main Mar 14, 2023

nat-goodspeed deleted the sl-18330 branch March 14, 2023 14:23

github-actions bot locked and limited conversation to collaborators Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SL-18330: Support parsing direct from stream; improve C++ compatibility. #4

SL-18330: Support parsing direct from stream; improve C++ compatibility. #4

nat-goodspeed commented Dec 7, 2022

nat-goodspeed commented Dec 7, 2022

codecov-commenter commented Dec 7, 2022 •

edited

Loading

bennettgoble Mar 1, 2023

brad-linden Mar 2, 2023

nat-goodspeed Mar 2, 2023

nat-goodspeed Mar 2, 2023

bennettgoble Mar 2, 2023 •

edited

Loading

bennettgoble Mar 1, 2023

nat-goodspeed Mar 2, 2023

brad-linden left a comment

brad-linden commented Mar 2, 2023

bennettgoble Mar 2, 2023 •

edited

Loading

nat-goodspeed Mar 3, 2023

nat-goodspeed commented Mar 13, 2023



		def parse_binary_noh(something):

SL-18330: Support parsing direct from stream; improve C++ compatibility. #4

SL-18330: Support parsing direct from stream; improve C++ compatibility. #4

Conversation

nat-goodspeed commented Dec 7, 2022

nat-goodspeed commented Dec 7, 2022

codecov-commenter commented Dec 7, 2022 • edited Loading

Codecov Report

bennettgoble Mar 1, 2023

Choose a reason for hiding this comment

brad-linden Mar 2, 2023

Choose a reason for hiding this comment

nat-goodspeed Mar 2, 2023

Choose a reason for hiding this comment

nat-goodspeed Mar 2, 2023

Choose a reason for hiding this comment

bennettgoble Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

bennettgoble Mar 1, 2023

Choose a reason for hiding this comment

nat-goodspeed Mar 2, 2023

Choose a reason for hiding this comment

brad-linden left a comment

Choose a reason for hiding this comment

brad-linden commented Mar 2, 2023

bennettgoble Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

nat-goodspeed Mar 3, 2023

Choose a reason for hiding this comment

nat-goodspeed commented Mar 13, 2023

codecov-commenter commented Dec 7, 2022 •

edited

Loading

bennettgoble Mar 2, 2023 •

edited

Loading

bennettgoble Mar 2, 2023 •

edited

Loading