-
Notifications
You must be signed in to change notification settings - Fork 1
SL-19707 Support formatting and parsing llsd objects of arbitrary depths. #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will allow us to handle much deeper hierachies. Also, reduce the number of function calls required to render values for performance.
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
Codecov Report
@@ Coverage Diff @@
## main #14 +/- ##
==========================================
+ Coverage 90.33% 90.91% +0.58%
==========================================
Files 6 6
Lines 848 925 +77
==========================================
+ Hits 766 841 +75
- Misses 82 84 +2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I have read the CLA Document and I hereby sign the CLA |
llsd/serde_xml.py
Outdated
|
||
for x in INVALID_XML_BYTES: | ||
XML_ESC_TRANS[x] = None | ||
|
||
INVALID_XML_RE = re.compile(r'[\x00-\x08\x0b\x0c\x0e-\x1f]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor tweak: I'd like to see INVALID_XML_RE
remain adjacent to INVALID_XML_BYTES
so it's clear to a reader that they describe the same set of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing
llsd/serde_xml.py
Outdated
"Construct a pretty serializer." | ||
# Call the super class constructor so that we have the type map | ||
super(LLSDXMLFormatter, self).__init__() | ||
self.py2 = PY2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it faster to look up py2
as an attribute of self
than to find PY2
in the globals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing
llsd/serde_xml.py
Outdated
else: | ||
return self._elt(b'boolean', b'false') | ||
return b'<boolean>true</boolean>' | ||
return b'<boolean>false</boolean>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it slower to put an else
here? I would expect even a limited optimizer to produce equivalent code, and imo else
reads better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pylint complains if you do it the old way.
llsd/serde_xml.py
Outdated
def _INTEGER(self, v): | ||
return self._elt(b'integer', str(v)) | ||
return b'<integer>' + str(v).encode('utf-8') + b'</integer>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So encapsulating these operations as an _elt()
method really slows things down? Sigh, too bad.
Would it be faster to write b''.join((b'<integer>', str(v).encode('utf-8'), b'</integer>'))
? Because join()
can determine the final length up front, that should require only one bytes
allocation instead of two.
(But see later suggestion about making these type-specific methods directly write to self.stream
, in which case it would be self.stream.writelines((...))
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, function call overhead was marginally slower.
I've changed this in future versions anyway.
llsd/serde_xml.py
Outdated
return self._elt(b'string', self.xml_esc(v)) | ||
if self.py2: # pragma: no cover | ||
return b'<string>' + _str_to_bytes(xml_esc(v)) + b'</string>' | ||
return b'<string>' + v.translate(XML_ESC_TRANS).encode('utf-8') + b'</string>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of testing self.py2
every time we hit a _STRING
type, it's worth testing whether it would be faster to have a conditional module-scope function definition: the PY2
implementation unconditionally returns _str_to_bytes(xml_esc(v))
, the other returns str(v).translate(XML_ESC_TRANS).encode('utf-8')
. That function could also be used for the _URI
case and the two _write()
methods. (Then the py2
instance attribute could be removed.)
If you've already determined that the runtime if
is faster, a comment to that effect would avoid some future maintainer changing it as described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it this way is slightly more performant, so I'm adding comments.
llsd/serde_xml.py
Outdated
if not item_type in self.type_map: | ||
raise LLSDSerializationError( | ||
"Cannot serialize unknown type: %s (%s)" % (item_type, item)) | ||
tfunction = self.type_map[item_type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 144-147 could better be expressed as a try:
/ except KeyError:
block. That way you only need one dict lookup, instead of two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
llsd/serde_xml.py
Outdated
v = remove_invalid_xml_bytes(v) | ||
return v.replace(b'&',b'&').replace(b'<',b'<').replace(b'>',b'>') | ||
def __init__(self, indent_atom = None): | ||
"Construct a pretty serializer." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the pretty version, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
llsd/serde_xml.py
Outdated
if self.py2: # pragma: no cover | ||
self.stream.write(b'<key>' + | ||
_str_to_bytes(xml_esc(UnicodeType(item))) + | ||
b'</key>') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider instead:
self.stream.writelines((b'<key>',
_str_to_bytes(xml_esc(UnicodeType(item))),
b'</key>'))
to avoid having to concatenate in memory before writing in sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in future versions.
llsd/serde_xml.py
Outdated
NOTE: This is nearly identical to the above _write with the exception | ||
that this one includes newlines and indentation. Doing something clever | ||
for the above may decrease performance for the common case, so it's been | ||
split out. We can probably revisit this, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing the two _write()
methods, I see only self._indent()
calls and changes to self._indent_level
in this one. It's really tempting to:
- give
LLSDXMLFormatter
an_eol
attribute ofb''
, versusLLSDXMLPrettyFormatter
's_eol
ofb'\n'
- give
_indent()
an int parameter, default 0 - introduce an empty
LLSDXMLFormatter._indent()
method - make
LLSDXMLPrettyFormatter._indent()
add a negative param toself._indent_level
before thewritelines()
call - make it add a non-negative param to
self._indent_level
after thewritelines()
call - call
self._indent(1)
for_MAP
and_ARRAY
- call
self._indent(-1)
onStopIteration
- add
self._eol
to various of thewrite()
calls
and see what the performance hit really is.
llsd/serde_xml.py
Outdated
self.stream.write(b'<array>') | ||
iter_stack.append((iter(item), b"array", None)) | ||
else: | ||
self.stream.write(tfunction(item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tempting to suggest that:
iter_stack
become an instance attribute- each of the other type-specific methods actually include the
self.stream.write()
call - the
if
...elif
code in lines 149-154 migrate to the_MAP()
and_ARRAY
methods.
Then lines 149-156 become simply tfunction(item)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in future versions.
notation format now uses iteration to serialize, instead of recursion.
And a few small perf improvements
Uses iteration instead of recursion.
Also, changed xml to use deque for slight perf increase
Also, a few perf tweaks
remove some unused functions.
Change the formatting code to use iteration instead of recursion to build the resultant data. This prevents the stack overflow errors we were getting with depths of over 300 or so.
Also, improve performance in python3 formatting by using the python3 str.translate functionality to do some of the key, string, and uri modification in one step instead of many.