notes

ImTaliesin · Sep 6, 2024 · a9086c4 · a9086c4
1 parent fbd2726
commit a9086c4
Show file tree

Hide file tree

Showing 6 changed files with 181 additions and 12 deletions.
diff --git a/.obsidian/graph.json b/.obsidian/graph.json
@@ -25,6 +25,6 @@
   "repelStrength": 12.2837370242215,
   "linkStrength": 1,
   "linkDistance": 30,
-  "scale": 0.6145026783518123,
+  "scale": 0.6145026783518126,
   "close": true
 }
diff --git a/.obsidian/workspace.json b/.obsidian/workspace.json
@@ -13,7 +13,19 @@
             "state": {
               "type": "markdown",
               "state": {
-                "file": "Data Warehouse.md",
+                "file": "Data Engineering Notes.md",
+                "mode": "source",
+                "source": false
+              }
+            }
+          },
+          {
+            "id": "e0bb5ed8c2fbc3c7",
+            "type": "leaf",
+            "state": {
+              "type": "markdown",
+              "state": {
+                "file": "Data Factory.md",
                 "mode": "source",
                 "source": false
               }
@@ -31,7 +43,8 @@
               }
             }
           }
-        ]
+        ],
+        "currentTab": 1
       }
     ],
     "direction": "vertical"
@@ -99,7 +112,7 @@
             "state": {
               "type": "backlink",
               "state": {
-                "file": "Data Warehouse.md",
+                "file": "Data Factory.md",
                 "collapseAll": false,
                 "extraContext": false,
                 "sortOrder": "alphabetical",
@@ -116,7 +129,7 @@
             "state": {
               "type": "outgoing-link",
               "state": {
-                "file": "Data Warehouse.md",
+                "file": "Data Factory.md",
                 "linksCollapsed": false,
                 "unlinkedCollapsed": true
               }
@@ -139,7 +152,7 @@
             "state": {
               "type": "outline",
               "state": {
-                "file": "Data Warehouse.md"
+                "file": "Data Factory.md"
               }
             }
           },
@@ -182,11 +195,13 @@
       "table-editor-obsidian:Advanced Tables Toolbar": false
     }
   },
-  "active": "71c4ca83aff589db",
+  "active": "e0bb5ed8c2fbc3c7",
   "lastOpenFiles": [
+    "Data Factory.md",
+    "Pasted image 20240906143943.png",
     "Data Engineering Notes.md",
-    "Pasted image 20240904201734.png",
     "Data Warehouse.md",
+    "Pasted image 20240904201734.png",
     "SQL.md",
     "Pasted image 20240903232241.png",
     "Delta Lake.md",
@@ -209,7 +224,6 @@
     "Pasted image 20240207140952.png",
     "Pasted image 20240207123809.png",
     "Pasted image 20240207123108.png",
-    "Pasted image 20240206161629.png",
     "OAuth.md",
     "NextUI.md",
     "nexst.md",
@@ -218,7 +232,6 @@
     "body-parser.md",
     "Typescript React code snippets.md",
     "Props.md",
-    "React JS.md",
     "Untitled 2.canvas",
     "Untitled 1.canvas"
   ]

diff --git a/Data Engineering Notes.md b/Data Engineering Notes.md
@@ -1,11 +1,71 @@
-[[Database]] [[SQL]] [[Data Lake]] [[Apache Spark]] [[Delta Lake]] [[Data Warehouse]]
+[[Database]] [[SQL]] [[Data Lake]] [[Apache Spark]] [[Delta Lake]] [[Data Warehouse]] [[Data Factory]]
 ## Vocab:
 ### Relational Table, data lake: data stored in files, data warehouse,  data lakehouse, data pipeline, key vaults implementation with pipeline
 
 -  _ACID_: _atomicity_ (transactions complete as a single unit of work), _consistency_ (transactions leave the database in a consistent state), _isolation_ (in-process transactions can't interfere with one another), and _durability_ (when a transaction completes, the changes it made are persisted). Delta Lake tables can be used as both _sinks_ (destinations) and _sources_ for streaming data.
 - **Standard formats and interoperability**. The underlying data for Delta Lake tables is stored in Parquet format, which is commonly used in data lake ingestion pipelines. 
 ------------------------------------------------------------------
 ![[Pasted image 20240904201734.png]]
+## Distribution Strategies in MPP Systems like Azure Synapse
+
+### Overview
+
+In Massively Parallel Processing (MPP) systems, data is distributed across multiple nodes to enable parallel processing and improve query performance. The way data is distributed can significantly impact query speed and resource utilization.
+
+### Key Distribution Strategies
+
+#### 1. Hash Distribution
+
+- **Concept**: Data is distributed across nodes based on a hash of one or more columns.
+- **Best for**: Large fact tables and large dimension tables.
+- **Advantages**:
+    - Minimizes data movement during joins if the join key is the same as the distribution key.
+    - Enables efficient aggregations on the distribution key.
+- **Considerations**:
+    - Choose a column with high cardinality and even distribution.
+    - Commonly used columns in JOIN, GROUP BY, and DISTINCT queries are good candidates.
+
+#### 2. Round Robin Distribution
+
+- **Concept**: Data is distributed evenly across all nodes in a round-robin fashion.
+- **Best for**: Staging tables, temporary tables, or when there's no clear joining key.
+- **Advantages**:
+    - Ensures even data distribution.
+    - Simple to implement.
+- **Considerations**:
+    - Can lead to more data movement during queries, especially for joins.
+
+#### 3. Replicated Distribution
+
+- **Concept**: A full copy of the table is stored on each compute node.
+- **Best for**: Small dimension tables (typically less than 2 GB).
+- **Advantages**:
+    - Eliminates data movement for joins.
+    - Improves query performance for small, frequently used lookup tables.
+- **Considerations**:
+    - Increases storage requirements.
+    - Not suitable for large tables or tables with frequent updates.
+
+### Choosing the Right Strategy
+
+1. **For large fact tables**: Use Hash distribution on a key that's frequently used in joins and aggregations.
+2. **For large dimension tables**: Use Hash distribution if they're often joined with fact tables on a specific key.
+3. **For small dimension tables**: Consider Replicated distribution for faster joins.
+4. **For staging or temporary data**: Round Robin is often suitable.
+
+### Impact on Query Performance
+
+- **Collocated joins**: When tables are distributed on the join key, it minimizes data movement.
+- **Data skew**: Uneven distribution can lead to performance bottlenecks.
+- **Query patterns**: The optimal distribution strategy depends on your most common and critical queries.
+
+### Best Practices
+
+1. Analyze your workload to understand common join and aggregation patterns.
+2. Monitor for data skew and adjust distribution as needed.
+3. Consider using different distribution strategies for dev/test vs. production environments.
+4. Regularly review and optimize your distribution choices as your data and query patterns evolve.
+--------------------------------------------------------------
 ## How Azure combines everything
 1. Source Systems: These are the original data sources, such as transactional databases, IoT devices, or application logs.
 2. Azure Data Lake Storage:

diff --git a/Data Factory.md b/Data Factory.md
@@ -0,0 +1,30 @@
+A fully managed, serverless data integration solution for ingesting, preparing and transforming all of your data at scale.
+
+Data factories are not meant for data migrating, data streaming, or complex data transformations (databriks/spark), or data storage.
+![[Pasted image 20240906143943.png]]
+
+We will use azure data factory for all data integration and orchestration. It will run  transofrmations with HDInsight and Azure databriks. We have three transformation technologies being used, data flow, HDInsight, Databriks.
+
+[[Dataflow]] gives you a codefree transformation tool which makes it easy to develop and maintain the transoformation.
+- Use this for simple and medium level complexity transformations as it lacks the ability to make complex transformations
+[[HDInsight]] gives you the ability to write code in a [[SQL]] like language called [[Hive]] and a scripting language called [[Pig]].
+[[Databriks]] requires you to write code with [[Python]] or [[SparkSQL]]
+
+Also used is [[Azure Blob Storage]], [[Data Lake]], [[Synapse Analytics]], [[Power BI]], [[Azure Databases]] 
+
+## Useful Links & Resources
+#### Lecture: Project Overview
+ECDC Website for Covid-19 Data - [https://www.ecdc.europa.eu/en/covid-19/data](https://www.ecdc.europa.eu/en/covid-19/data)
+
+Euro Stat Website for Population Data - [https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tps00010.tsv.gz](https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tps00010.tsv.gz)
+#### Lecture: Azure Storage Solutions
+Introduction to Azure Storage services - [https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction](https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction)
+
+Azure SQL Database - [https://docs.microsoft.com/en-us/azure/azure-sql/database/sql-database-paas-overview](https://docs.microsoft.com/en-us/azure/azure-sql/database/sql-database-paas-overview)
+
+Azure Synapse Analytics - [https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is](https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is)
+
+Azure Cosmos DB - [https://docs.microsoft.com/en-us/azure/cosmos-db/introduction](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction)
+
+Azure Data Lake Storage Gen2 - [https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)
+## 2
diff --git a/Data Warehouse.md b/Data Warehouse.md
@@ -630,4 +630,70 @@ The `APPROX_COUNT_DISTINCT` function uses a _HyperLogLog_ algorithm to retri
 |2021|304633|
 |...|...|
 
-The counts are less accurate, but still sufficient for an approximate comparison of yearly sales. With a large volume of data, the query using the `APPROX_COUNT_DISTINCT` function completes more quickly, and the reduced accuracy may be an acceptable trade-off during basic data exploration.
+The counts are less accurate, but still sufficient for an approximate comparison of yearly sales. With a large volume of data, the query using the `APPROX_COUNT_DISTINCT` function completes more quickly, and the reduced accuracy may be an acceptable trade-off during basic data exploration.
+
+## Load dimension tables with CREATE TABLE AS (CTAS)
+## Using a CREATE TABLE AS (CTAS) statement
+
+One of the simplest ways to load data into a new dimension table is to use a `CREATE TABLE AS` (_CTAS_) expression. This statement creates a new table based on the results of a SELECT statement.
+
+```
+CREATE TABLE dbo.DimProduct
+WITH
+(
+    DISTRIBUTION = REPLICATE,
+    CLUSTERED COLUMNSTORE INDEX
+)
+AS
+SELECT ROW_NUMBER() OVER(ORDER BY ProdID) AS ProdKey,
+    ProdID as ProdAltKey,
+    ProductName,
+    ProductCategory,
+    Color,
+    Size,
+    ListPrice,
+    Discontinued
+FROM dbo.StageProduct;
+```
+You can't use `IDENTITY` to generate a unique integer value for the surrogate key when using a CTAS statement, so this example uses the `ROW_NUMBER` function to generate an incrementing row number for each row in the results ordered by the **ProductID** business key in the staged data.
+
+You can also load a combination of new and updated data into a dimension table by using a CREATE TABLE AS (CTAS) statement to create a new table that UNIONs the existing rows from the dimension table with the new and updated records from the staging table. After creating the new table, you can delete or rename the current dimension table, and rename the new table to replace it.
+
+```
+CREATE TABLE dbo.DimProductUpsert
+WITH
+(
+    DISTRIBUTION = REPLICATE,
+    CLUSTERED COLUMNSTORE INDEX
+)
+AS
+-- New or updated rows
+SELECT  stg.ProductID AS ProductBusinessKey,
+        stg.ProductName,
+        stg.ProductCategory,
+        stg.Color,
+        stg.Size,
+        stg.ListPrice,
+        stg.Discontinued
+FROM    dbo.StageProduct AS stg
+UNION ALL  
+-- Existing rows
+SELECT  dim.ProductBusinessKey,
+        dim.ProductName,
+        dim.ProductCategory,
+        dim.Color,
+        dim.Size,
+        dim.ListPrice,
+        dim.Discontinued
+FROM    dbo.DimProduct AS dim
+WHERE NOT EXISTS
+(   SELECT  *
+    FROM dbo.StageProduct AS stg
+    WHERE stg.ProductId = dim.ProductBusinessKey
+);
+
+RENAME OBJECT dbo.DimProduct TO DimProductArchive;
+RENAME OBJECT dbo.DimProductUpsert TO DimProduct;
+```
+
+While this technique is effective in merging new and existing dimension data, lack of support for IDENTITY columns means that it's difficult to generate a surrogate key.
diff --git a/Pasted image 20240906143943.png b/Pasted image 20240906143943.png