Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N3 Formatting Performance #2923

Open
mgberg opened this issue Oct 8, 2024 · 5 comments
Open

N3 Formatting Performance #2923

mgberg opened this issue Oct 8, 2024 · 5 comments

Comments

@mgberg
Copy link
Contributor

mgberg commented Oct 8, 2024

I noticed that formatting various terms using the n3 method seemed to be quite slow when calling it a large number of times.

The n3 method uses NamespaceManager.normalizeUri, shown here:

def normalizeUri(self, rdfTerm: str) -> str: # noqa: N802, N803
"""
Takes an RDF Term and 'normalizes' it into a QName (using the
registered prefix) or (unlike compute_qname) the Notation 3
form for URIs: <...URI...>
"""
try:
namespace, name = split_uri(rdfTerm)
if namespace not in self.__strie:
insert_strie(self.__strie, self.__trie, str(namespace))
namespace = URIRef(str(namespace))
except Exception:
if isinstance(rdfTerm, Variable):
return "?%s" % rdfTerm
else:
return "<%s>" % rdfTerm
prefix = self.store.prefix(namespace)
if prefix is None and isinstance(rdfTerm, Variable):
return "?%s" % rdfTerm
elif prefix is None:
return "<%s>" % rdfTerm
else:
qNameParts = self.compute_qname(rdfTerm) # noqa: N806
return ":".join([qNameParts[0], qNameParts[-1]])

Note that the name variable is created but never used in the current implementation.

I'm confused as to why compute_qname is called at all. It seems that you should be able to simply do the following:

-            qNameParts = self.compute_qname(rdfTerm)  # noqa: N806
-            return ":".join([qNameParts[0], qNameParts[-1]])
+            return f"{prefix}:{name}"

I'm seeing about a 2x improvement in performance with this this change (assuming unique URIs), which is not surprising considering compute_qname essentially repeats what has already been done in normalizeUri thus far.

Is there a reason why compute_qname is called that I'm missing? Is there some value to having it there? If not, I'd be glad to submit a PR to make this change.

@nicholascar
Copy link
Member

@mgberg So you are saying the return logic should be just use {prefix}:{name} if you already have a prefix and the final if/else handles cases when there is a None prefix, so no need to re-call compute_qname.

I think the logic should be that if there is no prefix, a QName is computed for the IRI and stored back in the Store so when next the serializer calls this method, it will have a prefix for the IRI and won't need to recompute.

Perhaps then I understand your point here as the compute_qname seems to be in the wrong place?

This is likely 10+ year old code (long before my time on RDFLib) but if you an create a PR to improve this that passes all existing tests, please do!

@ashleysommer
Copy link
Contributor

ashleysommer commented Oct 16, 2024

This is the same issue I was discussing in our meeting a few weeks ago, I came to the same conclusion that there is a needless compute_qname in there, and has been there for a very long time. Its hard to tell what the intentions of the original author were.

I suspect this function was written for a different purpose; it uses N3 notation for URIs, unlike compute_qname, so maybe was written as a quick-and-dirty wrapper around compute_qname to add N3 compatibility, and perhaps only used in a test harness or such thing, then at some point it got used in the N3 (and Turtle) serializer and nobody thought to optimise it.

It also should be noted, for the majority of its life since it started in 2001, RDFLib was used by and maintained by a group of universities and was used mainly for research purposes, so there was never much consideration given to performance. I think you'll find there are many such examples of this in the codebase.

@mgberg
Copy link
Contributor Author

mgberg commented Oct 18, 2024

@ashleysommer a semi-related question: is there a place where performance-related concerns are being tracked? It would be interesting to look at what's there, submit some other issues I've run into if they're not already there, and/or contributing some fixes for known issues.

@ajnelson-nist
Copy link
Contributor

@ashleysommer a semi-related question: is there a place where performance-related concerns are being tracked? It would be interesting to look at what's there, submit some other issues I've run into if they're not already there, and/or contributing some fixes for known issues.

It seems the performance Label has some usage.

performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@ashleysommer @nicholascar @ajnelson-nist @mgberg and others