Skip to content

SL-19707 Support formatting and parsing llsd objects of arbitrary depths. #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

roxanneskelly
Copy link
Contributor

Change the formatting code to use iteration instead of recursion to build the resultant data. This prevents the stack overflow errors we were getting with depths of over 300 or so.

Also, improve performance in python3 formatting by using the python3 str.translate functionality to do some of the key, string, and uri modification in one step instead of many.

This will allow us to handle much deeper hierachies.

Also, reduce the number of function calls required to render
values for performance.
@roxanneskelly roxanneskelly added the enhancement New feature or request label May 25, 2023
@github-actions
Copy link

github-actions bot commented May 25, 2023

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@codecov
Copy link

codecov bot commented May 25, 2023

Codecov Report

Merging #14 (1ebc616) into main (a63abbe) will increase coverage by 0.58%.
The diff coverage is 95.14%.

❗ Current head 1ebc616 differs from pull request most recent head 7c18a42. Consider uploading reports for the commit 7c18a42 to get more accurate results

@@            Coverage Diff             @@
##             main      #14      +/-   ##
==========================================
+ Coverage   90.33%   90.91%   +0.58%     
==========================================
  Files           6        6              
  Lines         848      925      +77     
==========================================
+ Hits          766      841      +75     
- Misses         82       84       +2     
Impacted Files Coverage Δ
llsd/serde_xml.py 93.47% <93.20%> (-3.47%) ⬇️
llsd/serde_binary.py 96.34% <96.42%> (+2.05%) ⬆️
llsd/serde_notation.py 91.79% <97.33%> (+0.67%) ⬆️
llsd/base.py 87.01% <100.00%> (-0.44%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@roxanneskelly
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA


for x in INVALID_XML_BYTES:
XML_ESC_TRANS[x] = None

INVALID_XML_RE = re.compile(r'[\x00-\x08\x0b\x0c\x0e-\x1f]')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor tweak: I'd like to see INVALID_XML_RE remain adjacent to INVALID_XML_BYTES so it's clear to a reader that they describe the same set of bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing

"Construct a pretty serializer."
# Call the super class constructor so that we have the type map
super(LLSDXMLFormatter, self).__init__()
self.py2 = PY2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it faster to look up py2 as an attribute of self than to find PY2 in the globals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing

else:
return self._elt(b'boolean', b'false')
return b'<boolean>true</boolean>'
return b'<boolean>false</boolean>'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it slower to put an else here? I would expect even a limited optimizer to produce equivalent code, and imo else reads better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylint complains if you do it the old way.

def _INTEGER(self, v):
return self._elt(b'integer', str(v))
return b'<integer>' + str(v).encode('utf-8') + b'</integer>'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So encapsulating these operations as an _elt() method really slows things down? Sigh, too bad.

Would it be faster to write b''.join((b'<integer>', str(v).encode('utf-8'), b'</integer>')) ? Because join() can determine the final length up front, that should require only one bytes allocation instead of two.

(But see later suggestion about making these type-specific methods directly write to self.stream, in which case it would be self.stream.writelines((...)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, function call overhead was marginally slower.
I've changed this in future versions anyway.

return self._elt(b'string', self.xml_esc(v))
if self.py2: # pragma: no cover
return b'<string>' + _str_to_bytes(xml_esc(v)) + b'</string>'
return b'<string>' + v.translate(XML_ESC_TRANS).encode('utf-8') + b'</string>'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of testing self.py2 every time we hit a _STRING type, it's worth testing whether it would be faster to have a conditional module-scope function definition: the PY2 implementation unconditionally returns _str_to_bytes(xml_esc(v)), the other returns str(v).translate(XML_ESC_TRANS).encode('utf-8'). That function could also be used for the _URI case and the two _write() methods. (Then the py2 instance attribute could be removed.)

If you've already determined that the runtime if is faster, a comment to that effect would avoid some future maintainer changing it as described.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it this way is slightly more performant, so I'm adding comments.

if not item_type in self.type_map:
raise LLSDSerializationError(
"Cannot serialize unknown type: %s (%s)" % (item_type, item))
tfunction = self.type_map[item_type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 144-147 could better be expressed as a try: / except KeyError: block. That way you only need one dict lookup, instead of two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

v = remove_invalid_xml_bytes(v)
return v.replace(b'&',b'&amp;').replace(b'<',b'&lt;').replace(b'>',b'&gt;')
def __init__(self, indent_atom = None):
"Construct a pretty serializer."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the pretty version, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if self.py2: # pragma: no cover
self.stream.write(b'<key>' +
_str_to_bytes(xml_esc(UnicodeType(item))) +
b'</key>')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider instead:

self.stream.writelines((b'<key>',
    _str_to_bytes(xml_esc(UnicodeType(item))),
    b'</key>'))

to avoid having to concatenate in memory before writing in sequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in future versions.

NOTE: This is nearly identical to the above _write with the exception
that this one includes newlines and indentation. Doing something clever
for the above may decrease performance for the common case, so it's been
split out. We can probably revisit this, though.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing the two _write() methods, I see only self._indent() calls and changes to self._indent_level in this one. It's really tempting to:

  • give LLSDXMLFormatter an _eol attribute of b'', versus LLSDXMLPrettyFormatter's _eol of b'\n'
  • give _indent() an int parameter, default 0
  • introduce an empty LLSDXMLFormatter._indent() method
  • make LLSDXMLPrettyFormatter._indent() add a negative param to self._indent_level before the writelines() call
  • make it add a non-negative param to self._indent_level after the writelines() call
  • call self._indent(1) for _MAP and _ARRAY
  • call self._indent(-1) on StopIteration
  • add self._eol to various of the write() calls

and see what the performance hit really is.

self.stream.write(b'<array>')
iter_stack.append((iter(item), b"array", None))
else:
self.stream.write(tfunction(item))
Copy link
Contributor

@nat-goodspeed nat-goodspeed Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tempting to suggest that:

  • iter_stack become an instance attribute
  • each of the other type-specific methods actually include the self.stream.write() call
  • the if ... elif code in lines 149-154 migrate to the _MAP() and _ARRAY methods.

Then lines 149-156 become simply tfunction(item).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in future versions.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants