SL-19707 Support formatting and parsing llsd objects of arbitrary depths. #14

roxanneskelly · 2023-05-25T23:50:52Z

Change the formatting code to use iteration instead of recursion to build the resultant data. This prevents the stack overflow errors we were getting with depths of over 300 or so.

Also, improve performance in python3 formatting by using the python3 str.translate functionality to do some of the key, string, and uri modification in one step instead of many.

This will allow us to handle much deeper hierachies. Also, reduce the number of function calls required to render values for performance.

github-actions · 2023-05-25T23:51:10Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

codecov · 2023-05-25T23:55:03Z

Codecov Report

Merging #14 (1ebc616) into main (a63abbe) will increase coverage by 0.58%.
The diff coverage is 95.14%.

❗ Current head 1ebc616 differs from pull request most recent head 7c18a42. Consider uploading reports for the commit 7c18a42 to get more accurate results

@@            Coverage Diff             @@
##             main      #14      +/-   ##
==========================================
+ Coverage   90.33%   90.91%   +0.58%     
==========================================
  Files           6        6              
  Lines         848      925      +77     
==========================================
+ Hits          766      841      +75     
- Misses         82       84       +2

Impacted Files	Coverage Δ
llsd/serde_xml.py	`93.47% <93.20%> (-3.47%)`	⬇️
llsd/serde_binary.py	`96.34% <96.42%> (+2.05%)`	⬆️
llsd/serde_notation.py	`91.79% <97.33%> (+0.67%)`	⬆️
llsd/base.py	`87.01% <100.00%> (-0.44%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

roxanneskelly · 2023-05-26T19:04:15Z

I have read the CLA Document and I hereby sign the CLA

nat-goodspeed · 2023-06-01T15:14:29Z

llsd/serde_xml.py

+
+    for x in INVALID_XML_BYTES:
+        XML_ESC_TRANS[x] = None
+
 INVALID_XML_RE = re.compile(r'[\x00-\x08\x0b\x0c\x0e-\x1f]')


Minor tweak: I'd like to see INVALID_XML_RE remain adjacent to INVALID_XML_BYTES so it's clear to a reader that they describe the same set of bytes.

nat-goodspeed · 2023-06-01T15:20:02Z

llsd/serde_xml.py

+        "Construct a pretty serializer."
+        # Call the super class constructor so that we have the type map
+        super(LLSDXMLFormatter, self).__init__()
+        self.py2 = PY2


Is it faster to look up py2 as an attribute of self than to find PY2 in the globals?

nat-goodspeed · 2023-06-01T15:24:04Z

llsd/serde_xml.py

-        else:
-            return self._elt(b'boolean', b'false')
+            return b'<boolean>true</boolean>'
+        return b'<boolean>false</boolean>'


Is it slower to put an else here? I would expect even a limited optimizer to produce equivalent code, and imo else reads better.

pylint complains if you do it the old way.

nat-goodspeed · 2023-06-01T15:29:16Z

llsd/serde_xml.py

    def _INTEGER(self, v):
-        return self._elt(b'integer', str(v))
+        return b'<integer>' + str(v).encode('utf-8') + b'</integer>'


So encapsulating these operations as an _elt() method really slows things down? Sigh, too bad.

Would it be faster to write b''.join((b'<integer>', str(v).encode('utf-8'), b'</integer>')) ? Because join() can determine the final length up front, that should require only one bytes allocation instead of two.

(But see later suggestion about making these type-specific methods directly write to self.stream, in which case it would be self.stream.writelines((...)))

Yeah, function call overhead was marginally slower.
I've changed this in future versions anyway.

nat-goodspeed · 2023-06-01T15:52:33Z

llsd/serde_xml.py

-        return self._elt(b'string', self.xml_esc(v))
+        if self.py2:    # pragma: no cover
+            return b'<string>' + _str_to_bytes(xml_esc(v)) + b'</string>'
+        return b'<string>' + v.translate(XML_ESC_TRANS).encode('utf-8') + b'</string>'


Instead of testing self.py2 every time we hit a _STRING type, it's worth testing whether it would be faster to have a conditional module-scope function definition: the PY2 implementation unconditionally returns _str_to_bytes(xml_esc(v)), the other returns str(v).translate(XML_ESC_TRANS).encode('utf-8'). That function could also be used for the _URI case and the two _write() methods. (Then the py2 instance attribute could be removed.)

If you've already determined that the runtime if is faster, a comment to that effect would avoid some future maintainer changing it as described.

Doing it this way is slightly more performant, so I'm adding comments.

nat-goodspeed · 2023-06-01T20:39:59Z

llsd/serde_xml.py

+                if not item_type in self.type_map:
+                    raise LLSDSerializationError(
+                        "Cannot serialize unknown type: %s (%s)" % (item_type, item))
+                tfunction = self.type_map[item_type]


Lines 144-147 could better be expressed as a try: / except KeyError: block. That way you only need one dict lookup, instead of two.

nat-goodspeed · 2023-06-01T22:00:33Z

llsd/serde_xml.py

-        v = remove_invalid_xml_bytes(v)
-        return v.replace(b'&',b'&amp;').replace(b'<',b'&lt;').replace(b'>',b'&gt;')
+    def __init__(self, indent_atom = None):
+        "Construct a pretty serializer."


This isn't the pretty version, though.

nat-goodspeed · 2023-06-01T22:06:57Z

llsd/serde_xml.py

+                    if self.py2: # pragma: no cover
+                        self.stream.write(b'<key>' +
+                                          _str_to_bytes(xml_esc(UnicodeType(item))) +
+                                          b'</key>')


Consider instead:

self.stream.writelines((b'<key>', _str_to_bytes(xml_esc(UnicodeType(item))), b'</key>'))

to avoid having to concatenate in memory before writing in sequence.

Fixed in future versions.

nat-goodspeed · 2023-06-01T22:28:15Z

llsd/serde_xml.py

+        NOTE: This is nearly identical to the above _write with the exception
+        that this one includes newlines and indentation.  Doing something clever
+        for the above may decrease performance for the common case, so it's been
+        split out.  We can probably revisit this, though.


Comparing the two _write() methods, I see only self._indent() calls and changes to self._indent_level in this one. It's really tempting to:

give LLSDXMLFormatter an _eol attribute of b'', versus LLSDXMLPrettyFormatter's _eol of b'\n'

give _indent() an int parameter, default 0

introduce an empty LLSDXMLFormatter._indent() method

make LLSDXMLPrettyFormatter._indent() add a negative param to self._indent_level before the writelines() call

make it add a non-negative param to self._indent_level after the writelines() call

call self._indent(1) for _MAP and _ARRAY

call self._indent(-1) on StopIteration

add self._eol to various of the write() calls

and see what the performance hit really is.

nat-goodspeed · 2023-06-01T22:35:25Z

llsd/serde_xml.py

+                    self.stream.write(b'<array>')
+                    iter_stack.append((iter(item), b"array", None))
+                else:
+                    self.stream.write(tfunction(item))


It's tempting to suggest that:

iter_stack become an instance attribute

each of the other type-specific methods actually include the self.stream.write() call

the if ... elif code in lines 149-154 migrate to the _MAP() and _ARRAY methods.

Then lines 149-156 become simply tfunction(item).

done in future versions.

notation format now uses iteration to serialize, instead of recursion.

And a few small perf improvements

Uses iteration instead of recursion.

Also, changed xml to use deque for slight perf increase

Also, a few perf tweaks

remove some unused functions.

roxanneskelly added 4 commits May 24, 2023 23:09

SL-19707 - use iterators instead of recursion to generate XML

324d5af

This will allow us to handle much deeper hierachies. Also, reduce the number of function calls required to render values for performance.

Increase depth test

3117e20

Speed optimizations

bcb4202

SL-19707 - use python3 translate functionality for escaping strings

3a2106e

roxanneskelly added the enhancement New feature or request label May 25, 2023

bennettgoble requested a review from nat-goodspeed May 26, 2023 18:45

Don't do coverage for python2 code as github actions doesn't cover that

62bc530

nat-goodspeed reviewed Jun 1, 2023

View reviewed changes

roxanneskelly added 10 commits June 5, 2023 21:44

SL-19707 - notation format serialization perf and depth improvements.

ca8b9b5

notation format now uses iteration to serialize, instead of recursion.

SL-19707 Slight perf improvements

bba106f

SL-19707 Some small perf optimizations

27da089

SL-19707 - Allow 'unlimited' depth binary formatting of llsd

1573a08

And a few small perf improvements

SL-19707 - allow "infinite" depth when parsing XML LLSD

a73b2ac

Uses iteration instead of recursion.

SL-19707 faster xml parsing parsing error handling.

82d1332

SL-19707 - Binary parsing using iteration not recursion

44ddccc

Also, changed xml to use deque for slight perf increase

SL-19707 Make notation parsing iterative

61923f2

Also, a few perf tweaks

SL-19707 - CR fixes

1ebc616

remove some unused functions.

SL-19707 - perf improvements

7c18a42

roxanneskelly closed this Jul 7, 2023

github-actions bot locked and limited conversation to collaborators Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SL-19707 Support formatting and parsing llsd objects of arbitrary depths. #14

SL-19707 Support formatting and parsing llsd objects of arbitrary depths. #14

roxanneskelly commented May 25, 2023

github-actions bot commented May 25, 2023 •

edited

Loading

codecov bot commented May 25, 2023 •

edited

Loading

roxanneskelly commented May 26, 2023

nat-goodspeed Jun 1, 2023

roxanneskelly Jun 13, 2023

nat-goodspeed Jun 1, 2023

roxanneskelly Jun 13, 2023

nat-goodspeed Jun 1, 2023

roxanneskelly Jun 13, 2023

nat-goodspeed Jun 1, 2023

roxanneskelly Jun 13, 2023

nat-goodspeed Jun 1, 2023

roxanneskelly Jun 13, 2023

nat-goodspeed Jun 1, 2023

roxanneskelly Jun 14, 2023

nat-goodspeed Jun 1, 2023

roxanneskelly Jun 14, 2023

nat-goodspeed Jun 1, 2023

roxanneskelly Jun 13, 2023

nat-goodspeed Jun 1, 2023

nat-goodspeed Jun 1, 2023 •

edited

Loading

roxanneskelly Jun 13, 2023

SL-19707 Support formatting and parsing llsd objects of arbitrary depths. #14

SL-19707 Support formatting and parsing llsd objects of arbitrary depths. #14

Conversation

roxanneskelly commented May 25, 2023

github-actions bot commented May 25, 2023 • edited Loading

codecov bot commented May 25, 2023 • edited Loading

Codecov Report

roxanneskelly commented May 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nat-goodspeed Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 25, 2023 •

edited

Loading

codecov bot commented May 25, 2023 •

edited

Loading

nat-goodspeed Jun 1, 2023 •

edited

Loading