Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes and questions on logs from reading #5

Open
Cleop opened this issue Sep 28, 2018 · 1 comment
Open

Notes and questions on logs from reading #5

Cleop opened this issue Sep 28, 2018 · 1 comment
Assignees
Labels

Comments

@Cleop
Copy link
Member

Cleop commented Sep 28, 2018

As someone who is new to logs I could follow the example. However whilst discussing the subject area with others it brought up key terms and subject areas that were new to me. I think it would be useful to include some of this context in the readme for those who may stumble upon this repo without knowing what it is first.

What is a log?

AKA write-ahead log, commit log, transaction log. In this repo it will not refer to 'application logging', the kind of logging you might see for error messages.

A log is one of the most simple possible storage abstractions. An append-only, totally-ordered sequence of records ordered by time. They are visualised horizontally from left to right.

They're not all that different from a file or a table. If we consider a file as an array of bytes and a table as an array of records. Then a log can be thought of as a kind of table where records are sorted by time.

Logs are event driven. They record what happened and when continuously. As the records are stored in the order that the changes occurred this means that at any point you can revert back to a given point in time by finding it in your records. They can do this in near real-time, making them ideal for analytics. They are also helpful in the event of crashes or errors as their record of the state of the data at all times means data can easily be restored. By keeping an immutable log of the history of your data it means your data is kept clean and is never lost or changed. The log is added to by publishers of data and used / acted upon by subscribers but the records themselves cannot be mutated.

Keywords

Time series database: a database system optimised for handling time series data (arrays of numbers indexed by time). They handle queries for historical data/ time zones better than relational dbs.

Data integration: making all the data an organisation has available in all its services and systems.

Log compaction: methods to tidy up a log by deleting no longer needed data.

Questions

  • Are we performing physical (the data itself) or logical (the command or calculation which results in the data) logging?
  • Are all logs time series databases?
  • Does a log contain all of the fields of data that would be captured in a relational DB schema? Does this lead to there being a lot of empty rows because some columns aren't applicable to everything?
  • Are all logs append only?
  • What is a record in the context of a log? Is it the equivalent of a row in a table? Does it have column titles so all records have the same field titles?
  • Do all logs run horizontally?
  • Is a timestamp value mandatory in a record in a log? Do they act as the unique identifier for records or does a numeric id?
  • When are horizontal scaling partitions useful? I didn't understand in what context they'd be used from the article... Would you have a separate log per user and the partition is made for each user ID?
  • The article referred a lot to 'distributed data systems' - if a log is a move away from them, what kind of system would you call a log system? Does it have a name? Is it an integrated system?

These notes and questions came from reading:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

@Cleop Cleop added the question Further information is requested label Sep 28, 2018
@nelsonic
Copy link
Member

nelsonic commented Sep 30, 2018

Hi @Cleop, thank you for opening this issue and summarising your knowledge quest! 🎉

Stoked that you found the Example created by @Danwhy clear and followed it on your localhost. 🥇

Answers to your Questions (above)

  • "Physical vs Logical?" >> Logical.
    We are storing all data "CRUD" operations as new rows in a Postgres Database table
    and then (much later in the lifecycle of the product) computing a "View" https://en.wikipedia.org/wiki/View_(SQL) on that data to optimise SELECT queries.

    This is way too much detail (complexity) for a beginner to worry (or even know/think) about,
    but the curious can read:

  • All logs are time-series by definition. If a record is stored with a timestamp, it's a time series.
    Perhaps a useful clarification/distinction from the Wikipedia article https://en.wikipedia.org/wiki/Time_series_database (which is kinda useless/confusing) ... 😞
    Any data can be stored as time series and most ("big") data is!

    A "beginner-friendly" (+funny!) intro to "Big Data": https://www.bbc.co.uk/programmes/b0b9wbf8
    In the Apps we're building, we're using a UUID as the Primary Key (PK) to avoid write conflicts.
    Even with Erlang's Timestamp precision being microseconds anything more than 50 DB writes per second would result in an unacceptably high chance of PK collision. 💔
    Timestamps in Elixir:
    https://michal.muskala.eu/2017/02/02/unix-timestamps-in-elixir-1-4.html

  • "Are all logs append only?" >> Yes, all logs worth knowing/using are append-only.
    If logs are anything other than append-only they are a source of chaos/confusion. 😿

    Good "further reading" on this topic:

  • "What is a record in the context of a log? Is it the equivalent of a row in a table?" >> Yes.
    All records are rows. All rows have column headings. Most fields after the essential are optional.
    For example in an Address Book there's no point having an empty row
    (i.e. if all are fields optional, blank records would be "OK", this is obviously undesirable),
    so the minimum data acceptable for a new Address Book entry is the person's Name;
    everything is optional and can be added on subsequent writes/updates.

  • "Do all logs run horizontally?" >> Don't confuse yourself with the direction of "travel".
    All Time is linear and uni-directional. If in any doubt, logs run Vertically Top-to-bottom like reading a "page".

    Think about how you like to scroll a screen/phone using your keyboard, mouse or finger:
    How often do you scroll horizontally? (almost never. it's a UX anti-pattern)
    By contrast scrolling vertically is a natural UX.
    There might be the occasional timeline in a history book that is horizontal for aesthetic reasons. But it almost always means they have "truncated" the data points to fit on the page.
    Almost no computer systems are horizontal-scrolling.
    Note: horizontal scrolling is not the same as "swiping" on a mobile; the go-to UX for all hipster apps: https://uxplanet.org/horizontal-scrolling-in-mobile-643c81901af3

  • "Is a timestamp value mandatory in a record in a log?" >> Yes, by definition.
    And thankfully, Phoenix/Ecto already stores timestamps: inserted_at. 🎉

    • Do they act as the unique identifier for records or does a numeric id? >> No.

    Timestamps should never be used as a unique identifier in any serious system.
    The Primary Key (PK) should be Universally Unique (hence using using a UUID) i.e. virtually zero chance of conflict.
    Using a Timestamp would almost always guarantee conflict where writes-per second
    are greater than 1k. In a modest "Chat" application with 100k concurrent users,
    1k writes/sec are "normal". (each person sends one message once every couple of minutes...)
    See: http://www.internetlivestats.com
    time

  • "horizontal scaling partitions useful?" >> 😄 (What will you do when you win the lottery?)
    (it's good that you are keen! but you are over-thinking this ... com back to this question in 2020!)
    Thinking about Horizontal Scaling is way beyond the "scope" of this example/tutorial.
    To be clear: we are using an Append-only Log for reliability, accountability and record rollback-ability, not "scalability" (i.e. "premature optimisation"). Scalability is a distant future benefit, not a focus.

    If you (or anyone else) are personally interested in Scalability you should read: https://stackoverflow.com/questions/11707879/difference-between-scaling-horizontally-and-vertically-for-databases
    Using an Append-only Log with UUIDs as PKs is all the "ground work" we need
    to ensure that anything we build is prepared to scale both Vertically and Horizontally. ✅ 🚀
    When any of our apps reaches 10k writes/sec we will be insanely "successful". 🦄 🎉
    An AWS RDS db.m4.16xlarge instance has 256GB of RAM and can handle 10GB of "throughput". It's been benchmarked at 200k writes/second ... if we ever need to use one of these instances, we'll all be sipping coconut water on the shore of the @dwyl island! 🌴☀️🍸 🏄‍♀️ 🐢 🐟 ⛵️

  • 'distributed data systems' - "if a log is a move away from them" ...? >> Huh...? 😕
    log-solves-distributed-data
    Logs are the basis for all distributed/fault-tolerant systems. That's all we need to know/say.

    Please avoid adding more terms/complexity to the readme than are strictly necessary;
    anything more than "append-only log" will automatically "confuse" people who new to this.
    If you are curious about understanding this in way more detail, read: https://raft.github.io
    and watch @substack's talk on "What you can build with a log":
    image
    https://youtu.be/RPFjN1N148U
    For the purposes of all developers using Phoenix, just trust PostgreSQL to handle your data.
    You're in "good company" Who is using Postgres? learn-postgresql#31
    And if our app is lucky enough to be "successful", look into CitusDB

There is no need to confuse beginners with the term "write-ahead log" https://en.wikipedia.org/wiki/Write-ahead_logging ("TMI"); it's just an append-only log.
Hopefully people will understand the words: "append", "only" and "log" ... 💭
(if not, they probably aren't "ready" for this example/tutorial ...)

In general/practice, log compaction is never needed for most companies/products these days.
Data storage costs are so cheap (see detail below) that every App developer can easily afford to store all data generated by their App - provided the app does not store arbitrary data for no reason.
In most companies, data is the "life blood" of the strategic decision making, the entire field of "BI" https://en.wikipedia.org/wiki/Business_intelligence relies on having as much data as possible.

Aside: Never Over-write or Delete Data it is your Most Valuable Asset!

I feel "qualified" to assert that using Append-only Logs for everything in a company
are the single best technology decision that can be made. Logs are the foundation for the building.
I helped setup "BI" at Groupon, no data was ever deleted or compacted, only periodically "warehoused" to save on storage cost, but still accessible via Hadoop.
Per employee, we were the single biggest source of additional revenue, cost savings and value creation in the company. A "department" of 8 people generated $20M of annual value for the company by spotting trends in markets, products & buying behaviour. 📈
Hedge funds (like Bridgewater or Citadel) use data analysis to multiply cash.
It's a fascinating world/activity that generates no (NET) value to society
but makes a handful of people phenomenally wealthy!
https://en.wikipedia.org/wiki/The_rich_get_richer_and_the_poor_get_poorer

If you're ever interested in learning how some Mathematicians use their skills to print money,
read: https://en.wikipedia.org/wiki/The_Quants (I think we still have a copy in the @dwyl library...)
Or if you're short on time or prefer to watch the dramatised version, The "Big Short" is good. 📺
If you haven't seen it, add it to your list: https://www.imdb.com/title/tt1596363
Or [Spoiler Alert] just watch the ending: https://youtu.be/Bu2wNKlVRzE
The most insightful part of the film is in the end credits. Michael Burry is investing in Water:
image
If you or anyone else has "spare capital", to do the same; invest in water (recovery/treatment).
(not to profit from gouging the poor! but water treatment startups/ideas/tech are "win-win")

tl;dr

Some Data gets Updated, Most Never Does

Many developers have an "observation bias" toward data changing (mutability) because we often change the data we interact with and we therefore think that mutability is the "norm" but it really is not.
Once people realise the fact that most data gets created/written once and never gets updated.
Append-only logs are (unsurprisingly) common in virtually every area of computing.

Even when data appears to be mutable (from the User's perspective) it is often stored in an immutable log so the underlying store is immutable but the UI/UX appears that data is being mutated because the UI only displays the latest version of the data.

In many popular web frameworks, preserving "history" of a record (or piece of content)
is often done with a "history table" which stores copies of the "previous" versions of data whenever a record is updated.

Examples in Every Industry

Examples of use-cases where Append-only Logs are useful were included
in the original "Why? What? How?" issue: #1

To those examples we can add:

  • Banking/Financial transactions are all append-only (write once) ledgers. If they were not, the all accounting would be chaos and the world economy would collapse. When the "available balance" of an account is required, it is calculated from the list/log of transactions.
    (a summary of the data in an account may be cached in a "view" but it is never mutated)

    You may recall at the End of your F&C course you were applying to work for a company and they asked you to do a coding challenge/exercise that involved computing the values of a bank account.
    We paired on solving that challenge using a Log (debit and credit ledger). Data was never mutated. The account balance was always calculated "just in time" by "traversing" the Log.

  • Healthcare: a patient's medical data gets captured/recorded once as a "snapshot" in time. The doctor or ECG machine does not go back and "update" the value of the patients heart rate or electrophysiologic pattern. A new value is sampled at each time interval.
  • Analytics is all append-only logs which are (time) series of events streamed from the device to server, saved in a time-series data store, and streamed (or "replayed") to visualisation dashboard.
    • Events in Analytics systems are often aggregated (using "views") into charts/graphs. The "views" of the data are "temporary tables" which store the aggregated or computed data but do not touch the underlying log/stream.

Fact is: that append-only logs while not often mentioned by name, are in fact the norm.
Think of an application that stores data of any kind, and you can see how it can either benefit from
(or is totally dependent on) data immutability.

Everything is always an event.
Everything in the universe happens once. Time is linear and uni-directional.
Even in the "multiverse hypothesis", time is still uni-directional, we cannot (no theory in quantum mechanics allows us to) go back in time;
Mutating data is "time travelling"; it's updating data that was created in the past and destroying the history/timeline. In a distributed system mutating data means unpredictability/unreliability.

Address Book Example

In the example given in the README.md of an Address Book,
we illustrate how a person can "update" their address when they move home,
however (in the example) their previous address does not get over-written,
we simply insert their new address into the database
and the new address is treated as the "current" version.
That way the "Address Book" has a complete record of the history.
This is a good use of an append-only log
because address data will almost always change over the lifetime of a person,
but knowing previous addresses is useful (and in some cases essential).

If Thor decides to leave his parents' house (planet)
His "old" address for reference:

Thor
The Hall, Valhalla, Asgard
AS1 3DG
0800123123

thor-old-address
and crashes on his buddy Stephen's couch till he gets his own place in NY.
His _new (temporary) address will be:

Thor Odinson c/o Dr. Strange
177A Bleecker Street,
New York, NY 10012, USA

thor-new-address

Thanks @mathiasbynens for this handy character/byte counter: https://mothereff.in/byte-counter 🥇

The average person in the United States is expected to move 11.4 times in their lifetime:
https://fivethirtyeight.com/features/how-many-times-the-average-person-moves

Historical Context of Mutable Data: It Was Too Expensive to Store Everything

To understand why many Web Application frameworks (still) over-write data when an update is made,
we need to look back to when data storage was expensive. At the dawn of digital computing
storing data was prohibitively expensive. In 1956 a storing a Megabyte of data cost $9,200.
i.e. $9.2 Million per Gigabyte. IBM charged $850 a month (rent) for 3.75 Megabytes of storage.
image
With a Megabyte of storage, and assuming that the average (home) address requires 50 bytes we could store 20K addresses. This is barely enough to store the addresses for for a town.
When the cost of storage is this expensive storing all previous versions is unfeasible.
This is also why early mobile phones had a limited contact list (100 contacts in most cases)

Today we can store an incredible amount of data on a MicroSD card the size of a postage stamp:
image
Data storage (cost) is no longer a limiting factor, so we can afford to design all applications to be immutable.

For complete history, see: http://www.computerhistory.org/timeline/memory-storage
For relative prices of data storage see: https://jcmit.net/diskprice.htm
For a good intro to Bits and Bytes, see: https://web.stanford.edu/class/cs101/bits-bytes.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants