Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF terms with object interning support #2972

Open
edmondchuc opened this issue Nov 5, 2024 · 0 comments
Open

RDF terms with object interning support #2972

edmondchuc opened this issue Nov 5, 2024 · 0 comments

Comments

@edmondchuc
Copy link
Contributor

I'm interested in this. I've been playing around with the idea of implementing RDF terms with object interning to save memory and avoid copying. This issue is a continuation from #2866.

In any embarassingly parallel, distributed ETLs where I've used RDFLib, I've always seen the memory usage grow over time. By implementing object interning, we may be able to fix this issue and potentially stop the memory growth when objects are no longer referenced. I think this particular issue is also related to this other issue described here #740.

The key is to implement RDF terms as immutable data structures. This way, we can safely reuse references to the same object if the unicode code point sequence in the term's value is the same.

An example of a Blank Node implementation with object interning and is thread-safe when accessing the weakrefs. Memory should be freed once the objects are no longer in use even though we have a weakref pointing to it.

import threading
from dataclasses import dataclass, field
from typing import Any, Self, final
from uuid import uuid4
from weakref import WeakValueDictionary


class InternedBlankNode:
    _intern_cache: WeakValueDictionary[str, "Self"] = WeakValueDictionary()
    _lock = threading.Lock()

    __slots__ = ("__weakref__",)

    def __new__(cls, value: str | None = None) -> Self:
        if value is None:
            value = str(uuid4()).replace("-", "0")

        with cls._lock:
            if value in cls._intern_cache:
                return cls._intern_cache[value]

            instance = super().__new__(cls)
            object.__setattr__(instance, "value", value)
            cls._intern_cache[value] = instance
            return instance


@final
@dataclass(frozen=True, slots=True)
class BlankNode(InternedBlankNode):
    """
    An RDF blank node representing an anonymous resource.

    Specification: https://www.w3.org/TR/rdf12-concepts/#section-blank-nodes

    This implementation uses object interning to ensure that blank nodes
    with the same identifier reference the same object instance, optimizing
    memory usage. The class is marked final to ensure the :py:meth:`IRI.__new__`
    implementation cannot be overridden.

    :param value:
        A blank node identifier. If :py:obj:`None` is provided, an identifier
        will be generated.
    """

    value: str = field(default_factory=lambda: str(uuid4()).replace("-", "0"))

    def __str__(self) -> str:
        return f"_:{self.value}"

    def __reduce__(self) -> str | tuple[Any, ...]:
        return self.__class__, (self.value,)


__all__ = ["BlankNode"]

And tests:

import pickle

import pytest

from rdf_core.terms import BlankNode


def test_blank_node():
    bnode1 = BlankNode("123")
    bnode2 = BlankNode("123")
    bnode3 = BlankNode("222")

    assert bnode1.value == bnode2.value
    assert bnode1.value != bnode3.value
    assert bnode1 == bnode2
    assert bnode1 != bnode3
    assert bnode1 is bnode2
    assert bnode1 is not bnode3
    assert hash(bnode1) == hash(bnode2)

    bnode4 = BlankNode()
    assert len(bnode4.value) > 0


def test_blank_node_repr_str():
    bnode1 = BlankNode("123")
    assert repr(bnode1) == "BlankNode(value='123')"
    assert str(bnode1) == "_:123"


def test_blank_node_immutability():
    bnode1 = BlankNode("123")
    with pytest.raises(AttributeError):
        bnode1.value = "222"


def test_blank_node_pickling():
    bnode1 = BlankNode("123")
    pickled = pickle.dumps(bnode1)
    unpickled = pickle.loads(pickled)
    assert bnode1 is unpickled
    assert bnode1 == unpickled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant