-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Reasons to keep URIRef, BlankNode, and Literal as subclasses of str
?
#2866
Comments
Another thing to point out that didn't need to be in the initial writeup: RDFLib's Memory usage could also be reduced if we could be interning the most commonly used strings. |
I have plans to write an update to After big update is in place, I plan to add a new feature in DefinedNamespace for the ability for some popular Namespaces to have their most popular members available as pre-interned instances, So it doesn't need to make a copy of the string, and doesn't even need to make a copy of the And finally, I plan to create a big update to MemoryStore (or a new Store backend) that leverages the Technically none of this is a breaking change. The fact |
I did a quick sanity check, to see if we would lose anything (performance, compatibility, interoperability, memory usage) by moving away from Simply moving from I'm not sure if any potential gains from the ability to use pre-hashed system-interned strings would offset this 10% performance hit here. |
Further testing confirms my earlier findings. The benefit is that Replacing that with the new class equivalent results in 10% slowdown: __hash__ = str.__hash__ # old
#---------
def __hash__(self) -> int: # new
return str.__hash__(self.inner_) I tested with just a minimal hash function on def __hash__(self) -> int:
return 0 And even this non-implementation is 10% slower than rdflib's current method. So that means the slowdown is simply the result of calling back into a python function, opposed to the C function. I also tested other benefits I mentioned in the original post, eg, pre-calculated hashes and using system.intern strings, and these features help, however none of them make up for the 10% overhead incurred by the new Maybe this thread can be closed. |
Good 'ol things are as good as they can be because we used C in 2003... |
Keeping as is due to faster string hashing |
There is a particular design choice in RDFLib that has existed since the original version in 2001.
URIRef, BlankNode, and Literal are subclasses of
Identifier
.Identifer
is a combination class, a subclass of bothstr
and the abstract classNode
.It has been this way for a long time. In the Py2->3 transition days,
Identifier
used to subclasssix.text_type
. And before that it was a subclass of Python 2'sunicode
type, right back until the first tracked commit. (Note, python2unicode
type was renamed tostr
for python3, it is literally the same type. Python2str
becamebytes
in python3).Its common to see very old python libraries using a similar pattern. There was good reason. Subclasses of
str
were treated as "real strings" by the system, that enabled better memory management and provided built-in text features likestartswith()
,lower()
, etc, and comparisons likeeq, gt, lt, gte, lte
that the subclass gets for free and work exactly like a native string.Additionally in Python prior to v2.1 "string interning" worked on subclasses of strings. That means that short constant strings, and strings managed using
sys.intern()
were managed using a global string table, were deduplicated in memory and had shortcuts like pre-calculated hashs. But for Python 2.1+, that was considered a bug and it was fixed so subclasses ofstr
could no longer be interned.One thing to note however, in the Python2 days,
Identifier
wasn't a subclass ofstr
, butunicode
which I don't think had an intern table, so RDFLib never benefited from that.That pattern in RDFLib has been maintained for 21 years without question, because "its always been that way", but I suggest it is hurting RDFLib's usability and performance. The main cause being an enormous amount of needless copying strings that happens when creating URIRef, BlankNode, and Literal instances, and reading them back out.
Examples:
Firstly, when you create a URIRef (or any subclass of Identifier) and pass in a
str
, the__new__
constructor will hand that directly into the internal string constructor (return str.__new__(input)
) however python knows its really part of a subclass of str, so it doesn't use the copy-on-write (CoW) deduplication optimisation, instead it takes a COPY of the string passed in, and saves it.But then there is now no way of getting back that string it copied and saved. If you want to serialize that
URIRef
or treat it like a string you can read from, you need to callstr(myuri)
on it. But that doesn't give you back the original string, or even the first copy it made. It makes a new COPY of the internal string. Even if you try to treat the variable as a real string, callingstr.__str__(myuri)
does the same thing, it makes another new copy. Finally, if you try to tap into the innerstring usingsuper().__str__()
it gives you the same result. There is no way to read the inner contents of a subclass of string without copying it.Python3 can still do intern strings. It is a major memory management and performance speedup. All static constant strings in your code that are 20 characters or less will be saved by the parser as interned strings, and will always refer to the same object.
and for a long string:
You can see each of the different variables and new strings that were created with the interned string are instances of the exact same string in memory.
RDFLib cannot take advantage of this however, because string interning DOES NOT work on subclasses of
str
.The reason is to do with immutability. Python treats all real strings as immutable. Any operation you do on a string will leave the original string untouched and give you back a new string. The guarantee of string immutability is something provided by the stringlib type interface in stdlib. That is why strings can be interned, identical interned strings will return the same object because they are immutable. However Python can't make those same guarantees about immutability on subclasses of strings. A subclass can change its internal contents while identifying as the same instance. That is why python must COPY the source string. Passing a real
str
intoURIRef()
constructor takes a copy of the original string, it can't use an internal reference to the original string because the copy is mutable, it needs to retain the original as immutable. Similarly when reading it back out as astr
, you can't simply read out the internal state as a string itself, because python needs to make an immutable snapshot of the subclass, it does that by taking a COPY.I was half way through making a PR to improve the performance and memory usage of URIRef creation and serialization by taking advantage of
sys.intern()
for frequently used URIRefs, when I started putting all the pieces of this issue together.It looks like RDFLib will need a fundamental shift in architecture of its most used foundation classes if we want to resolve this, so I think it will be a good issue to tackle for v8.0.
The text was updated successfully, but these errors were encountered: