Representing Hierarchies in the Catalog DB #684

genematx · 2024-03-08T15:53:17Z

We had a very interesting conversation with @danielballan yesterday about the structure of the catalog.db database, specifically the representation of the hierarchy of nodes.

The rows of the nodes table store metadata of assets that often (always?) are in a hierarchical relationship with each other (e.g. a tree of folders). Currently, this structure is encoded by storing lists of ancestors of each node. Since several nodes may have same uids (e.g. names of subfolders in separate paths of the tree), the lists of ancestors serve as a second part of the composite key to ensure uniqueness (please see attached). While having lists of ancestors readily available may improve the query performance, it also raises the question of the efficiency of data storage and the complexity of the key.

A possible solution could be:

Ensure that uid column uniquely identifies each node in the table.
Use these uids to capture the tree structure in an external linked "closure" table. The idea is to keep track of every "ancestor-descendant" pair (not just "parent-child") along with the corresponding distance (depth) between them. Assuming that each row is it's own descendant and ancestor at depth 0, this leads to fast rules for data queries and updates without the need to keep lists of references.

For small (shallow) hierarchies the befits of using closure tables could be minimal, but it should be quite noticeable with larger datasets and would help us to future-proof the database schema. Please share your thoughts and comments!
@danielballan, @dylanmcreynolds, @Kezzsim

The text was updated successfully, but these errors were encountered:

danielballan · 2024-03-08T17:05:40Z

The closure table looks like very elegant pattern that applies well to our use case. I like that triggers can be used to keep the closure table in sync with the "main" table (in our case, the nodes table) without the application getting directly involved.

I think we can use the id as a unique ID for each node.

tiled/tiled/catalog/orm.py

Lines 62 to 63 in 63b583b

    
           # This id is internal, never exposed to the client. 
        
           id = Column(Integer, primary_key=True, index=True, autoincrement=True)

The key column, which is sometimes a BlueskyRun <uid> but is in general is any string, is akin to name in the example in that article. The key / name can repeat; we just need a unique constraint on (key, parent). (Currently we have a unique constraint on (key, ancestors).)

I think makes sense to sketch this at a level we can benchmark it against realistic data at a realistic scale.

genematx added the SQL label Mar 8, 2024

danielballan mentioned this issue Aug 14, 2024

ENH: add alembic migration #777

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representing Hierarchies in the Catalog DB #684

Representing Hierarchies in the Catalog DB #684

genematx commented Mar 8, 2024

danielballan commented Mar 8, 2024

Representing Hierarchies in the Catalog DB #684

Representing Hierarchies in the Catalog DB #684

Comments

genematx commented Mar 8, 2024

danielballan commented Mar 8, 2024