Skip to content

Commit

Permalink
notes
Browse files Browse the repository at this point in the history
  • Loading branch information
ImTaliesin committed Sep 6, 2024
1 parent fbd2726 commit a9086c4
Show file tree
Hide file tree
Showing 6 changed files with 181 additions and 12 deletions.
2 changes: 1 addition & 1 deletion .obsidian/graph.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@
"repelStrength": 12.2837370242215,
"linkStrength": 1,
"linkDistance": 30,
"scale": 0.6145026783518123,
"scale": 0.6145026783518126,
"close": true
}
31 changes: 22 additions & 9 deletions .obsidian/workspace.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,19 @@
"state": {
"type": "markdown",
"state": {
"file": "Data Warehouse.md",
"file": "Data Engineering Notes.md",
"mode": "source",
"source": false
}
}
},
{
"id": "e0bb5ed8c2fbc3c7",
"type": "leaf",
"state": {
"type": "markdown",
"state": {
"file": "Data Factory.md",
"mode": "source",
"source": false
}
Expand All @@ -31,7 +43,8 @@
}
}
}
]
],
"currentTab": 1
}
],
"direction": "vertical"
Expand Down Expand Up @@ -99,7 +112,7 @@
"state": {
"type": "backlink",
"state": {
"file": "Data Warehouse.md",
"file": "Data Factory.md",
"collapseAll": false,
"extraContext": false,
"sortOrder": "alphabetical",
Expand All @@ -116,7 +129,7 @@
"state": {
"type": "outgoing-link",
"state": {
"file": "Data Warehouse.md",
"file": "Data Factory.md",
"linksCollapsed": false,
"unlinkedCollapsed": true
}
Expand All @@ -139,7 +152,7 @@
"state": {
"type": "outline",
"state": {
"file": "Data Warehouse.md"
"file": "Data Factory.md"
}
}
},
Expand Down Expand Up @@ -182,11 +195,13 @@
"table-editor-obsidian:Advanced Tables Toolbar": false
}
},
"active": "71c4ca83aff589db",
"active": "e0bb5ed8c2fbc3c7",
"lastOpenFiles": [
"Data Factory.md",
"Pasted image 20240906143943.png",
"Data Engineering Notes.md",
"Pasted image 20240904201734.png",
"Data Warehouse.md",
"Pasted image 20240904201734.png",
"SQL.md",
"Pasted image 20240903232241.png",
"Delta Lake.md",
Expand All @@ -209,7 +224,6 @@
"Pasted image 20240207140952.png",
"Pasted image 20240207123809.png",
"Pasted image 20240207123108.png",
"Pasted image 20240206161629.png",
"OAuth.md",
"NextUI.md",
"nexst.md",
Expand All @@ -218,7 +232,6 @@
"body-parser.md",
"Typescript React code snippets.md",
"Props.md",
"React JS.md",
"Untitled 2.canvas",
"Untitled 1.canvas"
]
Expand Down
62 changes: 61 additions & 1 deletion Data Engineering Notes.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,71 @@
[[Database]] [[SQL]] [[Data Lake]] [[Apache Spark]] [[Delta Lake]] [[Data Warehouse]]
[[Database]] [[SQL]] [[Data Lake]] [[Apache Spark]] [[Delta Lake]] [[Data Warehouse]] [[Data Factory]]
## Vocab:
### Relational Table, data lake: data stored in files, data warehouse, data lakehouse, data pipeline, key vaults implementation with pipeline

-  _ACID__atomicity_ (transactions complete as a single unit of work), _consistency_ (transactions leave the database in a consistent state), _isolation_ (in-process transactions can't interfere with one another), and _durability_ (when a transaction completes, the changes it made are persisted). Delta Lake tables can be used as both _sinks_ (destinations) and _sources_ for streaming data.
- **Standard formats and interoperability**. The underlying data for Delta Lake tables is stored in Parquet format, which is commonly used in data lake ingestion pipelines.
------------------------------------------------------------------
![[Pasted image 20240904201734.png]]
## Distribution Strategies in MPP Systems like Azure Synapse

### Overview

In Massively Parallel Processing (MPP) systems, data is distributed across multiple nodes to enable parallel processing and improve query performance. The way data is distributed can significantly impact query speed and resource utilization.

### Key Distribution Strategies

#### 1. Hash Distribution

- **Concept**: Data is distributed across nodes based on a hash of one or more columns.
- **Best for**: Large fact tables and large dimension tables.
- **Advantages**:
- Minimizes data movement during joins if the join key is the same as the distribution key.
- Enables efficient aggregations on the distribution key.
- **Considerations**:
- Choose a column with high cardinality and even distribution.
- Commonly used columns in JOIN, GROUP BY, and DISTINCT queries are good candidates.

#### 2. Round Robin Distribution

- **Concept**: Data is distributed evenly across all nodes in a round-robin fashion.
- **Best for**: Staging tables, temporary tables, or when there's no clear joining key.
- **Advantages**:
- Ensures even data distribution.
- Simple to implement.
- **Considerations**:
- Can lead to more data movement during queries, especially for joins.

#### 3. Replicated Distribution

- **Concept**: A full copy of the table is stored on each compute node.
- **Best for**: Small dimension tables (typically less than 2 GB).
- **Advantages**:
- Eliminates data movement for joins.
- Improves query performance for small, frequently used lookup tables.
- **Considerations**:
- Increases storage requirements.
- Not suitable for large tables or tables with frequent updates.

### Choosing the Right Strategy

1. **For large fact tables**: Use Hash distribution on a key that's frequently used in joins and aggregations.
2. **For large dimension tables**: Use Hash distribution if they're often joined with fact tables on a specific key.
3. **For small dimension tables**: Consider Replicated distribution for faster joins.
4. **For staging or temporary data**: Round Robin is often suitable.

### Impact on Query Performance

- **Collocated joins**: When tables are distributed on the join key, it minimizes data movement.
- **Data skew**: Uneven distribution can lead to performance bottlenecks.
- **Query patterns**: The optimal distribution strategy depends on your most common and critical queries.

### Best Practices

1. Analyze your workload to understand common join and aggregation patterns.
2. Monitor for data skew and adjust distribution as needed.
3. Consider using different distribution strategies for dev/test vs. production environments.
4. Regularly review and optimize your distribution choices as your data and query patterns evolve.
--------------------------------------------------------------
## How Azure combines everything
1. Source Systems: These are the original data sources, such as transactional databases, IoT devices, or application logs.
2. Azure Data Lake Storage:
Expand Down
30 changes: 30 additions & 0 deletions Data Factory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
A fully managed, serverless data integration solution for ingesting, preparing and transforming all of your data at scale.

Data factories are not meant for data migrating, data streaming, or complex data transformations (databriks/spark), or data storage.
![[Pasted image 20240906143943.png]]

We will use azure data factory for all data integration and orchestration. It will run transofrmations with HDInsight and Azure databriks. We have three transformation technologies being used, data flow, HDInsight, Databriks.

[[Dataflow]] gives you a codefree transformation tool which makes it easy to develop and maintain the transoformation.
- Use this for simple and medium level complexity transformations as it lacks the ability to make complex transformations
[[HDInsight]] gives you the ability to write code in a [[SQL]] like language called [[Hive]] and a scripting language called [[Pig]].
[[Databriks]] requires you to write code with [[Python]] or [[SparkSQL]]

Also used is [[Azure Blob Storage]], [[Data Lake]], [[Synapse Analytics]], [[Power BI]], [[Azure Databases]]

## Useful Links & Resources
#### Lecture: Project Overview
ECDC Website for Covid-19 Data - [https://www.ecdc.europa.eu/en/covid-19/data](https://www.ecdc.europa.eu/en/covid-19/data)

Euro Stat Website for Population Data - [https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tps00010.tsv.gz](https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tps00010.tsv.gz)
#### Lecture: Azure Storage Solutions
Introduction to Azure Storage services - [https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction](https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction)

Azure SQL Database - [https://docs.microsoft.com/en-us/azure/azure-sql/database/sql-database-paas-overview](https://docs.microsoft.com/en-us/azure/azure-sql/database/sql-database-paas-overview)

Azure Synapse Analytics - [https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is](https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is)

Azure Cosmos DB - [https://docs.microsoft.com/en-us/azure/cosmos-db/introduction](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction)

Azure Data Lake Storage Gen2 - [https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)
## 2
68 changes: 67 additions & 1 deletion Data Warehouse.md
Original file line number Diff line number Diff line change
Expand Up @@ -630,4 +630,70 @@ The `APPROX_COUNT_DISTINCT` function uses a _HyperLogLog_ algorithm to retri
|2021|304633|
|...|...|

The counts are less accurate, but still sufficient for an approximate comparison of yearly sales. With a large volume of data, the query using the `APPROX_COUNT_DISTINCT` function completes more quickly, and the reduced accuracy may be an acceptable trade-off during basic data exploration.
The counts are less accurate, but still sufficient for an approximate comparison of yearly sales. With a large volume of data, the query using the `APPROX_COUNT_DISTINCT` function completes more quickly, and the reduced accuracy may be an acceptable trade-off during basic data exploration.

## Load dimension tables with CREATE TABLE AS (CTAS)
## Using a CREATE TABLE AS (CTAS) statement

One of the simplest ways to load data into a new dimension table is to use a `CREATE TABLE AS` (_CTAS_) expression. This statement creates a new table based on the results of a SELECT statement.

```
CREATE TABLE dbo.DimProduct
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT ROW_NUMBER() OVER(ORDER BY ProdID) AS ProdKey,
ProdID as ProdAltKey,
ProductName,
ProductCategory,
Color,
Size,
ListPrice,
Discontinued
FROM dbo.StageProduct;
```
You can't use `IDENTITY` to generate a unique integer value for the surrogate key when using a CTAS statement, so this example uses the `ROW_NUMBER` function to generate an incrementing row number for each row in the results ordered by the **ProductID** business key in the staged data.

You can also load a combination of new and updated data into a dimension table by using a CREATE TABLE AS (CTAS) statement to create a new table that UNIONs the existing rows from the dimension table with the new and updated records from the staging table. After creating the new table, you can delete or rename the current dimension table, and rename the new table to replace it.

```
CREATE TABLE dbo.DimProductUpsert
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
-- New or updated rows
SELECT stg.ProductID AS ProductBusinessKey,
stg.ProductName,
stg.ProductCategory,
stg.Color,
stg.Size,
stg.ListPrice,
stg.Discontinued
FROM dbo.StageProduct AS stg
UNION ALL
-- Existing rows
SELECT dim.ProductBusinessKey,
dim.ProductName,
dim.ProductCategory,
dim.Color,
dim.Size,
dim.ListPrice,
dim.Discontinued
FROM dbo.DimProduct AS dim
WHERE NOT EXISTS
( SELECT *
FROM dbo.StageProduct AS stg
WHERE stg.ProductId = dim.ProductBusinessKey
);
RENAME OBJECT dbo.DimProduct TO DimProductArchive;
RENAME OBJECT dbo.DimProductUpsert TO DimProduct;
```

While this technique is effective in merging new and existing dimension data, lack of support for IDENTITY columns means that it's difficult to generate a surrogate key.
Binary file added Pasted image 20240906143943.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a9086c4

Please sign in to comment.