Skip to content

Commit

Permalink
docs: Improving architecture docs (datahub-project#2241)
Browse files Browse the repository at this point in the history
  • Loading branch information
shirshanka authored Mar 16, 2021
1 parent c015cf7 commit f8b88c5
Show file tree
Hide file tree
Showing 11 changed files with 301 additions and 196 deletions.
2 changes: 1 addition & 1 deletion datahub-web-react/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# DataHub React App (Incubating)
# DataHub React App

## About
This module contains a React version of the DataHub UI, which is currently under incubation. Notice that this
Expand Down
13 changes: 11 additions & 2 deletions docs-website/generateDocsDir.ts
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ function list_markdown_files(): string[] {
/^metadata-ingestion-examples\//,
/^docs\/rfc\/templates\/000-template\.md$/,
/^docs\/docker\/README\.md/, // This one is just a pointer to another file.
/^docs\/README\.md/, // This one is just a pointer to the hosted docs site.
];

const markdown_files = all_markdown_files.filter((filepath) => {
Expand All @@ -75,7 +76,6 @@ function get_id(filepath: string): string {

const hardcoded_slugs = {
"README.md": "/",
"docs/README.md": "docs/overview",
};

function get_slug(filepath: string): string {
Expand All @@ -100,13 +100,19 @@ const hardcoded_titles = {
"docs/demo.md": "Demo",
};

const hardcoded_descriptions = {
// Only applied if title is also overridden.
"README.md":
"DataHub is a data discovery application built on an extensible metadata platform that helps you tame the complexity of diverse data ecosystems.",
};

// FIXME: Eventually, we'd like to fix all of the broken links within these files.
const allowed_broken_links = [
"docs/architecture/metadata-serving.md",
"docs/developers.md",
"docs/how/customize-elasticsearch-query-template.md",
"docs/how/graph-onboarding.md",
"docs/how/search-onboarding.md",
"docs/how/build-metadata-service.md",
];

function markdown_guess_title(
Expand All @@ -120,6 +126,9 @@ function markdown_guess_title(
let title: string;
if (filepath in hardcoded_titles) {
title = hardcoded_titles[filepath];
if (filepath in hardcoded_descriptions) {
contents.data.description = hardcoded_descriptions[filepath];
}
} else {
// Find first h1 header and use it as the title.
const headers = contents.content.match(/^# (.+)$/gm);
Expand Down
7 changes: 4 additions & 3 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -42,12 +42,12 @@ module.exports = {
// TODO "docs/how/data-source-onboarding",
],
Architecture: [
// "docs/README",
"docs/architecture/architecture",
"docs/architecture/metadata-ingestion",
"docs/what/gma",
//"docs/what/gma",
"docs/architecture/metadata-serving",
"docs/what/gms",
//"docs/what/gms",
"datahub-web-react/README",
],
// },
// developerGuideSidebar: {
Expand All @@ -69,6 +69,7 @@ module.exports = {
"docs/what/graph",
"docs/what/search-index",
"docs/how/add-new-aspect",
"docs/how/build-metadata-service",
"docs/how/customize-elasticsearch-query-template",
"docs/how/entity-onboarding",
"docs/how/graph-onboarding",
Expand Down
29 changes: 1 addition & 28 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1 @@
# Introduction

DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our
[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). You should also visit [DataHub Architecture](architecture/architecture.md) to get a better understanding of how DataHub is implemented and [DataHub Onboarding Guide](how/entity-onboarding.md) to understand how to extend DataHub for your own use case.

In general, Datahub has two types of users in mind. One has metadata, and use tools provided by Datahub to ingest metadata into Datahub; The other is to use Datahub to discover metadatas available within Datahub. Datahub provides intuitive UI, full text search capablitity, and graph relationship presentation to make the metadata discover and understanding much eaiser.

The following sequence diagram highlights the key features Datahub has, and how the two types of users - metadata ingestion engineers and metadata discover users, can take full advantage of the Datahub.

![datahub-sequence-diagram](imgs/datahub-sequence-diagram.png)
1. It starts with ingesting your metadata into datahub. We provide a [collection of sample Python scripts](https://github.com/linkedin/datahub/tree/master/metadata-ingestion) for you. Those scripts work with the popular relationship databases, find metadata of the data source, and publish metadata in Avro data format to MetadataChangeEvent(MCE) Kafka topic.
2. A MetadataChangeEvent (MCE) processor consumes Kafka message with given topic, and make necessary transformation, send to Generalized Metadata Service (GMS), and GMS persists the metadata to a relational database of your choice. Currently we support MySQL, PostgreSQL and MariaDB.
3. GMS also checks the received metadata to find out whether there is a previous version. If so, it will publish the difference to Kafka’s MetadataAuditEvent (MAE) topic.
4. MAE processor consumes MetadataAuditEvent message from Kafka, and persist to Neo4j & Elastic Search (ES).
5. The frontend of Datahub talks to the metadata restful API services of GMS. The metadata discovering users can browse, search metadatas, get the details of metadata such as the owner, the lineage and other customer tags.


## Documentation
* [DataHub Developer's Guide](developers.md)
* [DataHub Architecture](architecture/architecture.md)
* [DataHub Onboarding Guide](how/entity-onboarding.md)
* [Docker Images](../docker)
* [Frontend](../datahub-frontend)
* [Web App](../datahub-web)
* [Generalized Metadata Service](../gms)
* [Metadata Ingestion](../metadata-ingestion)
* [Metadata Processing Jobs](../metadata-jobs)
* [The RFC Process](rfc.md)
DataHub's project documentation is hosted at [datahubproject.io](https://datahubproject.io/docs)
42 changes: 35 additions & 7 deletions docs/architecture/architecture.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,39 @@
# DataHub Architecture Overview
![datahub-architecture](../imgs/datahub-architecture.svg)

## Generalized Metadata Architecture (GMA)
Refer to [GMA](../what/gma.md).
We highly recommend that you read the excellent [metadata architectures blog post] that describes the three generations of metadata architectures, and goes into a
lot of detail around the motivations and evolution of the DataHub architecture in comparison with other data discovery solutions and catalogs.

## Metadata Serving
Refer to [metadata-serving](metadata-serving.md).
The figure below describes the high-level architecture of DataHub, a third-generation metadata platform.

## Metadata Ingestion
Refer to [metadata-ingestion](metadata-ingestion.md).
![datahub-architecture](../imgs/datahub-architecture.png)

## The Components
The DataHub deployables are split into three components:

### Ingestion
This component controls how metadata is integrated with DataHub. Read [datahub-ingestion] to learn more.

### Serving
The component is responsible for storing and querying data within DataHub. Read [datahub-serving] to learn more.

### Frontend
This is the user-facing application that powers search and discovery over the metadata graph. Read [react-frontend] to learn more.


## Architecture Highlights
There are three main highlights of DataHub's architecture.

### Schema-first approach to Metadata Modeling
DataHub's metadata model is described using a [serialization agnostic language](https://linkedin.github.io/rest.li/pdl_schema). Both [REST](../../gms) and well as [GraphQL API-s](../../datahub-web-react/src/graphql) are supported. In addition, DataHub supports an [AVRO-based API](../../metadata-events) over Kafka to communicate metadata changes and subscribe to them. Our [roadmap](../roadmap.md) includes a milestone to support no-code metadata model edits very soon, which will allow for even more ease of use, while retaining all the benefits of a typed API. Read about metadata modeling at [metadata modeling].
### Stream-based Real-time Metadata Platform
DataHub's metadata infrastructure is stream-oriented, which allows for changes in metadata to be communicated and reflected within the platform within seconds. You can also subscribe to changes happening in DataHub's metadata, allowing you to build real-time metadata-driven systems. For example, you can build an access-control system that can observe a previously world-readable dataset adding a new schema field which contains PII, and locks down that dataset for access control reviews.
### Federated Metadata Serving
DataHub comes with a single [metadata service (gms)](../../gms) as part of the open source repository. However, it also supports federated metadata services which can be owned and operated by different teams –– in fact that is how LinkedIn runs DataHub internally. The federated services communicate with the central search index and graph using Kafka, to support global search and discovery while still enabling decoupled ownership of metadata. This kind of architecture is very amenable for companies who are implementing [data mesh](https://martinfowler.com/articles/data-monolith-to-mesh.html).


[metadata modeling]: ../how/metadata-modelling.md
[PDL]: https://linkedin.github.io/rest.li/pdl_schema
[metadata architectures blog post]: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
[datahub-serving]: metadata-serving.md
[datahub-ingestion]: metadata-ingestion.md
[react-frontend]: ../../datahub-web-react/README.md
74 changes: 18 additions & 56 deletions docs/architecture/metadata-ingestion.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,32 @@
# Metadata Ingestion Architecture

## MCE Consumer Job
DataHub supports an extremely flexible ingestion architecture that can support push, pull, asynchronous and synchronous models.
The figure below describes all the options possible for connecting your favorite system to DataHub.
![Ingestion Architecture](../imgs/ingestion-architecture.png)

Metadata providers communicate changes in metadata by emitting [MCE]s, which are consumed by a Kafka Streams job, [mce-consumer-job]. The [Python ingestion framework](../../metadata-ingestion/README.md) makes it easy to emit these MCEs.
The MCE consumer job converts the AVRO-based MCE into the equivalent [Pegasus Data Template] and saves it into the database by calling a special GMS ingest API.
## MCE: The Center Piece

## MAE Consumer Job
The center piece for ingestion is the [Metadata Change Event (MCE)] which represents a metadata change that is being communicated by an upstream system.
MCE-s can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses.

All the emitted [MAE] will be consumed by a Kafka Streams job, [mae-consumer-job], which updates the [graph] and [search index] accordingly.
The job itself is entity-agnostic and will execute corresponding graph & search index builders, which will be invoked by the job when a specific metadata aspect is changed.
The builder should instruct the job how to update the graph and search index based on the metadata change.
The builder can optionally use [Remote DAO] to fetch additional metadata from other sources to help compute the final update.
## Pull-based Integration

To ensure that metadata changes are processed in the correct chronological order,
MAEs are keyed by the entity [URN] — meaning all MAEs for a particular entity will be processed sequentially by a single Kafka streams thread.
DataHub ships with a Python based [metadata-ingestion system](../../metadata-ingestion/README.md) that can connect to different sources to pull metadata from them. This metadata is then pushed via Kafka or HTTP to the DataHub storage tier. Metadata ingestion pipelines can be [orchestrated by Airflow](../../metadata-ingestion/examples/airflow) to set up scheduled ingestion easily. If you don't find a source already supported, it is very easy to [write your own](../../metadata-ingestion/README.md#contributing).

## Search and Graph Index Builders
## Push-based Integration

As described in [Metadata Modelling] section, [Entity], [Relationship], and [Search Document] models do not directly encode the logic of how each field should be derived from metadata.
Instead, this logic should be provided in the form of a graph or search index builder.
As long as you can emit a [Metadata Change Event (MCE)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCE-s) at the point of origin.

The builders register the metadata [aspect]s of their interest against [MAE Consumer Job](#mae-consumer-job) and will be invoked whenever a MAE involving the corresponding aspect is received.
If the MAE itself doesn’t contain all the metadata needed, builders can use Remote DAO to fetch from GMS directly.
## Internal Components

```java
public abstract class BaseIndexBuilder<DOCUMENT extends RecordTemplate> {
### Applying MCE-s to DataHub Service Tier (mce-consumer)

BaseIndexBuilder(@Nonnull List<Class<? extends RecordTemplate>> snapshotsInterested);
DataHub comes with a Kafka Streams based job, [mce-consumer-job], which consumes the MCE-s and converts them into the [equivalent Pegasus format] and sends it to the DataHub Service Tier (datahub-gms) using the `/ingest` endpoint.

@Nullable
public abstract List<DOCUMENT> getDocumentsToUpdate(@Nonnull RecordTemplate snapshot);

@Nonnull
public abstract Class<DOCUMENT> getDocumentType();
}
```

```java
public interface GraphBuilder<SNAPSHOT extends RecordTemplate> {
GraphUpdates build(SNAPSHOT snapshot);

@Value
class GraphUpdates {
List<? extends RecordTemplate> entities;
List<RelationshipUpdates> relationshipUpdates;
}

@Value
class RelationshipUpdates {
List<? extends RecordTemplate> relationships;
BaseGraphWriterDAO.RemovalOption preUpdateOperation;
}
}
```

[MCE]: ../what/mxe.md#metadata-change-event-mce
[Metadata Change Event (MCE)]: ../what/mxe.md#metadata-change-event-mce
[Metadata Audit Event (MAE)]: ../what/mxe.md#metadata-audit-event-mae
[MAE]: ../what/mxe.md#metadata-audit-event-mae
[Pegasus Data Template]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer
[graph]: ../what/graph.md
[search index]: ../what/search-index.md
[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer
[mce-consumer-job]: ../../metadata-jobs/mce-consumer-job
[mae-consumer-job]: ../../metadata-jobs/mae-consumer-job
[Remote DAO]: ../architecture/metadata-serving.md#remote-dao
[URN]: ../what/urn.md
[Metadata Modelling]: ../how/metadata-modelling.md
[Entity]: ../what/entity.md
[Relationship]: ../what/relationship.md
[Search Document]: ../what/search-document.md
[Aspect]: ../what/aspect.md
[Python emitters]: ../../metadata-ingestion/README.md#using-as-a-library

Loading

0 comments on commit f8b88c5

Please sign in to comment.