-
Notifications
You must be signed in to change notification settings - Fork 1
SL-18330: Support parsing direct from stream; improve C++ compatibility. #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In Python 3, the previous logic would always try and fail to import cElementTree before falling back on the deprecated xml.etree.cElementTree. If _use_celementree was set False, it would bounce through two explicit ImportErrors and fail to import elementtree.ElementTree before finally successfully importing xml.etree.ElementTree.
In addition to existing parse_xml(), parse_notation() and parse_binary(), introduce parse_xml_noh(), parse_notation_noh() and parse_binary_noh() for when we know there's no '<? llsd/mimetype ?>' header. Internally, add match_xml_hdr(), match_notation_hdr(), match_binary_hdr() that scan forward through the passed bytes object (or stream) when it matches. This lets parse() recognize and skip the relevant header, calling the corresponding parse_mumble_noh() function. match_xml_hdr(), match_notation_hdr() and match_binary_hdr() call matchseq(), which generalizes header recognition by supporting a binary stream as well as a bytes object, allowing any amount of whitespace and ignoring case. Notably, the viewer's LLSDSerialize::serialize() emits "<? LLSD/Binary ?>" while parse() recognized only "<?llsd/binary?>". Moreover, parse() was confused by serialize()'s "<? LLSD/XML ?>" and "<? llsd/notation ?>" headers. Try to support a binary stream, as well as a bytes object, in parse_xml() and parse_xml_noh(): use matchseq() instead of regex matching, and call xml.etree.ElementTree.parse() instead of fromstring() when passed a binary stream.
Refactor LLSDBaseParser and subclasses so that direct references to self._buffer and self._index are replaced by operations on a new BytesUtil or StreamUtil instance, each of which implements the utility methods presented by LLSDBaseParser: _error(), _peek(), _getc().
Now that llsd.parse() supports parsing directly from streams, make its tests validate that we haven't broken that feature.
BytesType was defined as 'str' in Python 2, 'bytes' in Python 3. But in Python 2, 'bytes' is an alias for 'str': (bytes is str) is True.
Having chosen either BytesUtil or StreamUtil, instead of storing a reference in LLSDBaseParser._util and indirecting through that on every access, patch LLSDBaseParser instance with (underscore-prefixed) direct references to the public BytesUtil / StreamUtil methods. Fix _parse_string_delim() accordingly. Introduce LLSDBaseParser class attribute force_stream to always pick StreamUtil implementation, wrapping any passed bytes object in io.BytesIO. Should we choose to discard BytesUtil entirely, the current tactic of patching StreamUtil methods into the LLSDBaseParser instance positions us to adopt those StreamUtil methods directly into LLSDBaseParser.
This PR is required by the extended self-tests in secondlife/viewer#29 . |
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #4 +/- ##
==========================================
+ Coverage 90.30% 91.09% +0.78%
==========================================
Files 5 6 +1
Lines 743 797 +54
==========================================
+ Hits 671 726 +55
+ Misses 72 71 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
llsd/base.py
Outdated
return chars | ||
|
||
|
||
class StreamUtil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of building wrappers around bytes and streams, could llsd.parse
convert all input into a unified class such as BufferedReader
which provides the requisite peek implementation? This would make these classes unneeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 interested to hear discussion about this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction between StreamUtil
and BytesUtil
becomes moot when LLSDBaseParser.force_stream
is True
, which it is by default: in that case, instead of using BytesUtil
at all, _reset()
copies the input bytes
object into an io.BytesIO
and proceeds with StreamUtil
.
I was curious whether a BytesUtil
wrapper would have any performance benefits over the tactic above. I suspect it only matters if the consumer passes a very large bytes
object, in which case both the buffer allocation and the copy itself would incur noticeable overhead. But - in the likely case that the consumer is reading that very large bytes
object from a stream, we could address that overhead by passing the stream directly to parse()
.
Bottom line: we can probably make BytesUtil
go away entirely. In that case, yes, we could wrap the input stream (whether passed by the caller or an internal BytesIO
) in a BufferedReader
and use its peek()
. It might still be useful to wrap it in a class that provides StreamUtil.error()
.
Is that the direction we should go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BytesStream
and AnyStream
could similarly be elided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see SteamUtil, MatchStream, AnyStream, BytesStream disappeared, they're confusing and, imo, too complicated when we can just take input and coerce it into a standard I/O contract.
llsd/serde_binary.py
Outdated
|
||
|
||
def parse_binary_noh(something): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"noh"
is cute but its meaning is not obvious. Could the header option be an optional kwarg of the simpler parse_*
functions or could the name be made clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about parse_binary_nohdr()
? Or even _parse_binary_nohdr()
since I view this as primarily internal?
I want to be able to pass control directly to this code from llsd.parse()
when the generic parser recognizes a particular header, instead of passing and testing a bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conditional on resolving the question about BufferedReader usage above, this PR looks good to me
also, per recent developments, I think it is not necessarily worth maintaining python2 and 3.4 fallback code for new enhancement work like this. If one wanted to juice our codecov metrics, it could be done by ripping some of that old stuff out. |
llsd/base.py
Outdated
# prepend an underscore to each method name. | ||
for attr in dir(util): | ||
if not attr.startswith('_'): | ||
setattr(self, '_' + attr, getattr(util, attr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why...? Let's avoid metaprogramming-for-minor-convenience-reasons if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was actually intended for performance: avoiding an extra hash lookup. There's code in our parsers that uses the Python idiom of capturing certain object attributes as local variables. I don't often do that myself because I'm not often dealing with performance-critical Python code, and I prefer clarity. That said, the lines above are a nod to the theoretical performance benefit of skipping an extra indirection.
But with the local stream wrappers vanishing, the above becomes moot and will likewise disappear.
Instead, wrap any bytes object passed to LLSDBaseParser._reset() in BytesIO so we can uniformly treat as a stream. Further wrap that (or any passed stream that doesn't already support peek()) in BufferedReader so we can directly use peek(). Migrate StreamUtil's getc(), peek() and error() methods into LLSDBaseParser as the corresponding sunder methods. These not only preserve the existing API for subclasses, but implement error checks around read() and peek().
Specifically, remove MatchStream, BytesStream and AnyStream, which were used specifically for matchseq() and starts_with(). Use existing LLSDBaseParser wrapping instead. Now that we're treating all input, including bytes objects, as binary streams, remove PY3SemanticBytes and is_bytes(). Fold free functions base.matchseq() and starts_with() into LLSDBaseParser, changing signatures: with the instance object managing the underlying stream, they can simply accept (pattern) and return bool. Add remainder() method to retrieve the (advanced) underlying stream object. Also remove serde_binary.match_binary_hdr(), serde_xml.match_xml_hdr() and serde_notation.match_notation_hdr(). Instead add BINARY_HEADER, XML_HEADER and NOTATION_HEADER to base.py and call LLSDBaseParser.matchseq() with them as needed. Rename parse_{binary,notation,xml}_noh() to ...nohdr(). Make each of them accept an LLSDBaseParser instance instead of a (bytes object or stream): since we only ever call parse_mumble_nohdr() after examining the input using an LLSDBaseParser, pass that instance into the nohdr() method. parse_xml_nohdr() must further manipulate its input with an LLSDBaseParser, so without this signature change parse_xml() would extract remainder() from an LLSDBaseParser and pass it into parse_xml_nohdr(), which would wrap it with another LLSDBaseParser and finally extract ITS remainder() to pass to fastest_elementtree.parse().
Since this PR has become essential for ongoing DRTVWR-558 Puppetry work, absent further objections or requested changes, I'm going to merge this soonish. |
The major feature introduced by this PR is that
llsd.parse()
can now accept a seekable binary stream instead of requiring the caller to read the stream (e.g. a file) into memory before parsing.It also improves recognition of MIME-type headers produced by the C++ LLSD library embedded in the Second Life viewer.