hNhM(highly Normalized hybrid Model) – data modeling methodology that enables iterative and Agile-driven modifications to your Data Warehouse (DWH). The methodology is based on Anchor Modeling and Data Vault.
The idea is to represent each Entity using 3 types of tables:
Hub
stores Business and Surrogate keysAttribute
stores the history of changes (e.g. Name, Age, Gender...)Group
is used to reduce the number of underlying tables by grouping multiple Attributes in one Group. The best practice is to define a Group by the same data source.
The purpose of hnhm
is to:
- Define Entities, Links and Flows declaratively using Python
- Describe them "logically"
- The "physical" layer is managed by
hnhm
- Apply the changes to a DWH iteratively
hnhm
detects changes in the entities and creates new Attributes, Groups, or Links
- Load data from the Staging layer using Flows in the correct order
- Ignore, Update, or keep the history using
SCD2
- Ignore, Update, or keep the history using
Implementation of this package is based on report "How we implemented our data storage model — highly Normalized hybrid Model" by Evgeny Ermakov and Nikolai Grebenshchikov. 1) Yandex Report, habr.com. 2) SmartData Conference, youtube.com
You can find the full code in the dwh
folder.
Install the library
pip install hnhm
Create the dwh
directory
mkdir dwh
Spin up the Postgres Database
# dwh/docker-compose.yml
version: "3.9"
volumes:
postgres_data: { }
services:
postgres:
image: postgres:15
volumes:
- "postgres_data:/var/lib/postgresql/data"
ports:
- "5433:5432"
environment:
POSTGRES_DB: hnhm
POSTGRES_USER: hnhm
POSTGRES_PASSWORD: 123
# $ docker compose up -d
Create the __hnhm__.py
file with the following contents:
# dwh/__hnhm__.py
from hnhm import PostgresPsycopgSql, HnHm, HnhmRegistry, FileState
sql = PostgresPsycopgSql(
database="hnhm",
user="hnhm",
password="123",
port=5433
)
registry = HnhmRegistry(
hnhm=HnHm(
sql=sql,
state=FileState("state.json"),
),
)
PostgresPsycopgSql
generates and executes SQLHnhmRegistry
stores thehnhm
object, entities and linksHnHm
implements core logic and manages the stateFileState
stores the state of your DWH in a file
Apply the changes to your DWH:
$ hnhm apply dwh
Importing 'registry' object from the module: 'dwh.__hnhm__'.
Your DWH is up to date.
Our DWH is up-to-date because we haven't added any entities and links yet.
Let's add the new Entity. Add the dwh/user.py
file with the following contents:
# dwh/user.py
from hnhm import (
Layout,
LayoutType,
String,
ChangeType,
HnhmEntity,
)
class User(HnhmEntity):
"""User data."""
__layout__ = Layout(name="user", type=LayoutType.HNHM)
user_id = String("User ID.", change_type=ChangeType.IGNORE)
__keys__ = [user_id]
Add the User
entity to the registry:
# dwh/__hnhm__.py
from hnhm import PostgresPsycopgSql, HnHm, HnhmRegistry, FileState
+ from dwh.user import User
sql = PostgresPsycopgSql(
database="hnhm",
user="hnhm",
password="123",
port=5433
)
registry = HnhmRegistry(
+ entities=[User()],
hnhm=HnHm(
sql=sql,
state=FileState("state.json"),
),
)
Apply the changes to your DWH:
$ hnhm apply dwh
Importing 'registry' object from the module: 'dwh.__hnhm__'.
Plan:
+ entity 'HNHM.user'
+ view 'user'
+ hub 'user'
Apply migrations? [y/N]: y
Applied!
We added the new entity User
to our DWH
-- View on top the DDS tables
select * from entity__user;
-- Hub
select * from hub__user;
Let's add an Attribute and a Group to our Entity. Edit the dwh/user.py
file:
# dwh/user.py
from hnhm import (
Layout,
LayoutType,
String,
ChangeType,
HnhmEntity,
+ Integer
)
class User(HnhmEntity):
"""User data."""
__layout__ = Layout(name="user", type=LayoutType.HNHM)
user_id = String("User ID.", change_type=ChangeType.IGNORE)
+ age = Integer("Age.", change_type=ChangeType.UPDATE)
+ first_name = String("First name.", change_type=ChangeType.NEW, group="name")
+ last_name = String("Last name.", change_type=ChangeType.NEW, group="name")
__keys__ = [user_id]
Apply the changes to your DWH:
$ hnhm apply dwh
Importing 'registry' object from the module: 'dwh.__hnhm__'.
Plan:
[u] entity 'HNHM.user'
+ attribute 'age'
+ group 'name'
|attribute 'first_name'
|attribute 'last_name'
[u] view 'user'
Apply migrations? [y/N]: y
Applied!
Take a look at newly created tables
-- View on top of the DDS tables was updated
select * from entity__user;
-- Attribute 'age'
select * from attr__user__age;
-- Group 'name'
select * from group__user__name;
The physical result:
view: entity__user
┌────────────────────────────────────────────────────────────────┐
│┌───────────────────┐ ┌────────────────┐ ┌─────────────────┐│
│ group__user__name │ │ hub__user │ │ attr__user__age │
│ │ │ │ │ │
│ + user_sk (FK) ├──►│ + user_sk (PK) │◄──┤ + user_sk (FK) │
│ + first_name │ │ + user_id_bk │ │ + age │
│ + last_name │ │ + valid_from │ │ + valid_from │
│ + valid_from │ │ + _hash │ │ + _hash │
│ + valid_to │ │ + _source │ │ + _source │
│ + _hash │ │ + _loaded_at │ │ + _loaded_at │
│ + _source │ └────────────────┘ └─────────────────┘
│ + _loaded_at │
└───────────────────┘
- Entity: business entity (User, Review, Order, Booking)
- Link: the relationship between Entities (UserOrder, UserBooking)
- Flow: helps to load data from the stage layer to Entities and Links
- Hub: hub table contains Entity's Business Keys and Surrogate Key(MD5 hash of concatenated business keys)
- Attribute: attribute table contains FK to Entity's surrogate key, history of attribute changes, and the
valid_from
column - Group: group table contains FK to Entity's surrogate key, history of changes to group attributes, and the
valid_from
column - Link: link table contains FKs to Entities surrogate keys. Historicity by
SCD2
IGNORE
: insert the latest new data, ignore updatesUPDATE
: insert the latest new data, updateNEW
: full history usingSCD2
. Adds thevalid_to
column
- Postgres