Skip to content

Database design discussion

Mike King edited this page Jul 27, 2022 · 3 revisions

Intro

We are going to talk about relational vs document databases here. I am admit my bias toward the latter. There is no correct answer, both databases work very well in practice, and they have different pros and cons.

Relational databases

A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970.

The purpose of the relational database was to enable easy reporting on data. By organising information in rows and columns, it became simple to dump them out in tabular form, and to perform calculations on the values in the data. Most people are not aware of it, but the adoption of a rigid structure is what can make relational database design complicated.

SQL was an invention that make querying and manipulating of data in the database almost human readable. To this day it remains hard to beat.

Document databases

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. They are often referred to as NOSQL databases, just to make it clear they

Document databases, or object stores allow data to be represented in a more flexible way, like a word document or a spreadsheet. They may contain tabular data, but they can also contain other data such as sentences, diagrams and images that are to store in a relational database.

An example

Let's start with a "simple" example of a user's profile record. In the SQL definition it might look like this

CREATE TABLE PROFILE (
    id: INTEGER,
    first_name: VARCHAR(20),
    last_name: VARCHAR(20),
)

Simple enough, although you might notice the character limits on the fields, that are an early warning of structural limits in our data. It is possible to use a string that is not limited in length, but the relational database engine likes to pack its data tightly, and this goes against that technique.

Let's add some more information, like a phone number:

CREATE TABLE PROFILE (
    id: INTEGER,
    first_name: VARCHAR(20),
    last_name: VARCHAR(20),
    mobile: VARCHAR(12),
    home_phone: VARCHAR(12),
)

Again, simple enough, but what if a user has 2 mobile numbers?

CREATE TABLE PROFILE (
    id: INTEGER,
    first_name: VARCHAR(20),
    last_name: VARCHAR(20),
    mobile: VARCHAR(12),
    mobile2: VARCHAR(12),  -- A little clumsy
    home_phone: VARCHAR(12),
)

This approach can be acceptable, if the user's profile information isn't core to the application, we can live with that.

But what if we want to allow more flexibility with phone numbers? This is when we start creating additional tables. Let's create a PHONE table now:

CREATE TABLE PHONE (
    id: INTEGER,
    phone: VARCHAR(12),
    phone_type_id: INTEGER,
    profile_id: INTEGER,
)

Wait, what are all these id fields? These are foreign keys, basically references to the _id of the table it is referring to.

So now, we started by creating a PHONE table, but we also created a PHONE_TYPE table, which tells us what type of phone it is, like mobile, work, home, etc

This is all very well, but already we have a whole bunch of ceremony around a simple thing like a phone number.

Document database example

Let's look at how we can do it in a document database. We are going to use an npm module called simple-schema to show how the data is structured.

We'll approach it in reverse, starting with the PHONE collection. Note the use of collection instead of table. They are kind of the same thing, but it's a good idea to use different names for them.

So the PHONE collection looks like this

const PhoneSchema = new SimpleSchema({
  _id: randomId, // More about this later, it's a 17 character alphanumeric random string
  phone: String, // No fixed width fields here
  phone_type: {
    type: String,
    allowedValues: ["mobile", "work", "home", "emergency"],
  },
});

Ok, so what happened here? Where are the xxxx_id fields, don't we need them?

What we did was use the allowedValues construct, which eliminates the need for a separate collection.

So let's see how we use it in the PROFILE collection...

const ProfileSchema = new SimpleSchema({
 _id: randomId,
 first_name: String,
 last_name: String,
 phones: Array,     // An array of phone numbers
 phones.$: {
    type: PhoneSchema  // This is the structure for the array elements
 }
})

So the ProfileSchema includes an array of phone numbers and types. We defined an additional schema for PHONES, but we didn't create an additional collection. Phone numbers are inside the user's profile, which seems logical. No foreign keys, constraints and table joins to implement.

Achievements

Now it would have been entirely possible for us to implement 3 collections to mimic the "relational way", but that would have been a mistake. What we have done instead is economise on collections, and data retrieval very simple. We can do the same thing for other personal attributes, address is a good example, as we have multiple addresses, postal and shipping, home and work, etc.

We have put a little complexity in our schema (actually you can argue that the sub-schema is cleaner than multiple tables), and our data is stored in a more logical fashion.

Validation options

The SQL schema's strongest rules are around rigid constraints, such as a number must be in a certain range (no if's or but's), or a phone number is required, and it must conform to a certain format. These rules are a bit inflexible, and are often relaxed in favour of application level validation, which provides some leeway in the real world.

Let's take an example of a VEHICLE table. The vehicle record structure defines which fields are optional and which are required.

The optional/required setting on a field is kind of black and white, a field like the VIN number will be required, because a vehicle must have one to be registered.

In real life this rule is too rigid: the VIN number may have been removed, so we may not be able to fill it in. Typically this field is then made optional, because there are some cases where it may not be known. This isn't fully satisfactory, because VIN numbers are like the primary key for a vehicle, and should be there at some stage.

Validation functions in Simple Schema

Document schemas are more flexible than relational schemas, because you can make a field optional/required based on the value of another field in the same record.

So in this case, VIN number will be optional unless a registration number is present, or better still, the VIN number is only required when the status of the vehicle is "registered".

This is a much better result than the relational schema, because it reflects a real world scenario: it can be missing, but before we allow it to be "registered", the VIN number must be present.

Next level

I started pretty simply, but now I want to accelerate what we are doing, and go deeper.

The document database gives us some advantages, such as

Feature Value
Database is not structured We have the freedom to store anything, as long as it can be serialised.
Fewer tables This means better performance, less coding to build complex SQL joins across tables.
Schema is in the code This gives us amazing flexibility, as the schema is "soft". We can either store the schema in code, or even in the database. We can also manipulate the schema, and adapt it based on the circumstances.
Functions can be saved Javascript functions behave like data, and can be stored, and invoked at a later time (like to do validation). So we can store validation functions along with our schema

In JavaScript, functions are first-class objects, because they can have properties and methods just like any other object. What distinguishes them from other objects is that functions can be called. In brief, they are Function objects.

Single collection database

As we have the flexibility to use and manipulate schemas in the code, and our storage mechanism doesn't dictate anything to us, we can overlay all of our collections into a single collection.

This is a bit mind boggling at first, but there are reasons to do with the way that the document database works, that can counter some of the advantages of a relational datbase.

Let's look at a use case: Let's assume that we have 4 tables

JOB JOB_ITEM (ie a piece of work) ORDER ORDER_ITEM

Let's further assume that everything is related to a JOB, so each of these tables will include a JOB_ID as a foreign key reference.

Document databases don't do joins, so fetching all the data for a JOB will mean 4 queries, one on each table. This isn't very efficient, and if we have to then reach out to other related tables, fetching the data can become slow, especially with large datasets.

Let's go with our single collection database model. We just need add something like a RECORD_TYPE to each record to tell us what kind of record it is. We can fetch all of the related records with a single query for JOB_ID. This allows us to slice and dice the data as we please, we can spin it out to a spreadsheet, aggregate it for a dashboard, of display it on screen.

As long as we define the right indexes in the database, it will perform very well.

Convinced?

Are you convinced, or do you have doubts?

What's next?

If the idea seems sound, let's move on to implementation.

Stage 1 - structure

  1. Define our common collection and indexes
  2. Define schemas for our data
  3. Attach schemas to common collection
  4. Database update code needs to reference the correct schema

Stage 2 - dynamic schemas

We probably need to have some discussion around this. Let's make the following assumptions

Assumption Details
Base schema Define a base schema, eg for Assets. This will be common for all assets. This base schema is not editable. There will be multiple base schemas.
Extension schema Allow an admin user to extend a base schema - they will need a UI where they can edit the schema. Version 1 might just be the ability to edit some JSON
Display data in a list User can view a list of data items. The list view is customisable to allow selection of columns/sort order/filters etc. A prototype of this exists already

|

Clone this wiki locally