Skip to content

Commit

Permalink
Fixed documentations
Browse files Browse the repository at this point in the history
  • Loading branch information
Akira Kakkar committed Dec 22, 2020
1 parent acfe5f9 commit cf3c533
Show file tree
Hide file tree
Showing 16 changed files with 137 additions and 117 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ on:
pull_request:
paths:
- .github/workflows/documentation.yml
- .markdownlinkcheck.json
- .markdownlint.yml

jobs:
markdown-link-check:
Expand All @@ -16,6 +18,7 @@ jobs:
with:
use-quiet-mode: 'yes'
use-verbose-mode: 'yes'
config-file: '.markdownlinkcheck.json'
file-extension: '.md'
markdown-lint:
runs-on: ubuntu-latest
Expand All @@ -24,3 +27,4 @@ jobs:
- uses: docker://avtodev/markdown-lint:v1
with:
args: '**/*.md'
config: '.markdownlint.yml'
7 changes: 7 additions & 0 deletions .markdownlinkcheck.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"ignorePatterns": [
{
"pattern": "^https://data.melbourne.vic.gov.au/resource/"
}
]
}
1 change: 1 addition & 0 deletions .markdownlint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
MD013: false
32 changes: 15 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@ products:
description: "Code samples showcasing how to apply DevOps concepts to the Modern Data Warehouse Architecture leveraging different Azure Data Technologies."
---

## DataOps for the Modern Data Warehouse
# DataOps for the Modern Data Warehouse

This repository contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) architectural pattern on [Microsoft Azure](https://azure.microsoft.com/en-au/).
This repository contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) architectural pattern on [Microsoft Azure](https://azure.microsoft.com/en-au/).

The samples are either focused on a single azure service or showcases an end to end data pipeline solution built according to the MDW pattern. Each sample contains code and artifacts relating to:

- Build and Release Pipelines (CI/CD)
- Testing
- Observability / Monitoring
Expand All @@ -23,30 +24,27 @@ The samples are either focused on a single azure service or showcases an end to


### End to End samples
- [**Parking Sensor Solution**](e2e_samples/parking_sensors/) - This sample solution demonstrates an end-to-end data pipeline following the MDW architecture, along with a corresponding CI/CD process.
- This was solution was presented at [NDC Sydney 2019](https://ndcsydney.com/). See [here](https://www.youtube.com/watch?v=Xs1-OU5cmsw) for the presentation which includes a detailed walkthrough of the solution.
- **Data Pipeline Architecture**
![Architecture](docs/images/architecture.PNG?raw=true "Architecture")
- **Build and Release Process**
![CI/CD](docs/images/CI_CD_process.PNG?raw=true "CI/CD")

- [**Parking Sensor Solution**](e2e_samples/parking_sensors/) - This sample solution demonstrates an end-to-end data pipeline following the MDW architecture, along with a corresponding CI/CD process.
- This was solution was presented at [NDC Sydney 2019](https://ndcsydney.com/). See [here](https://www.youtube.com/watch?v=Xs1-OU5cmsw) for the presentation which includes a detailed walkthrough of the solution.
- **Data Pipeline Architecture**: ![Architecture](docs/images/architecture.PNG?raw=true "Architecture")
- **Build and Release Process**: ![CI/CD](docs/images/CI_CD_process.PNG?raw=true "CI/CD")

### Single Technology Samples
- [Azure SQL](single_tech_samples/azuresql/)

Coming soon..
- [Data Factory](single_tech_samples/datafactory/)
- [Azure Databricks](single_tech_samples/databricks/)
- [Stream Analytics](single_tech_samples/streamanalytics/)
- [Azure Synapse (formerly SQLDW)](single_tech_samples/synapseanalytics/)
- [CosmosDB](single_tech_samples/cosmosdb/)

- [Azure SQL](single_tech_samples/azuresql/)
Coming soon..
- [Data Factory](single_tech_samples/datafactory/)
- [Azure Databricks](single_tech_samples/databricks/)
- [Stream Analytics](single_tech_samples/streamanalytics/)
- [Azure Synapse (formerly SQLDW)](single_tech_samples/synapseanalytics/)
- [CosmosDB](single_tech_samples/cosmosdb/)

## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
the rights to use your contribution. For details, visit [https://cla.opensource.microsoft.com](https://cla.opensource.microsoft.com).

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
Expand Down
22 changes: 11 additions & 11 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.3 BLOCK -->

## Security
# Security

Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).

If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.

## Reporting Security Issues
# Reporting Security Issues

**Please do not report security vulnerabilities through public GitHub issues.**

Expand All @@ -18,23 +18,23 @@ You should receive a response within 24 hours. If for some reason you do not, pl

Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:

* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue

This information will help us triage your report more quickly.

If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.

## Preferred Languages
# Preferred Languages

We prefer all communications to be in English.

## Policy
# Policy

Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).

Expand Down
22 changes: 9 additions & 13 deletions e2e_samples/parking_sensors/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
[![Build Status](https://dev.azure.com/devlacepub/DataDevOps/_apis/build/status/ddo_transform-ci-artifacts?branchName=master)](https://dev.azure.com/devlacepub/DataDevOps/_build/latest?definitionId=3&branchName=master)

# DataOps - Parking Sensor Demo <!-- omit in toc -->

The sample demonstrate how DevOps principles can be applied end to end Data Pipeline Solution built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) pattern.
Expand Down Expand Up @@ -98,20 +96,20 @@ For a detailed walk-through of the solution and key concepts, watch the followin

The following summarizes key learnings and best practices demonstrated by this sample solution:

### 1. Use Data Tiering in your Data Lake.
### 1. Use Data Tiering in your Data Lake

- Generally, you want to divide your data lake into three major areas which contain your Bronze, Silver and Gold datasets.
1. *Bronze* - This is a landing area for your raw datasets with no to minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store.
2. *Silver* - These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by Data Scientists.
3. *Gold* - These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard Fact and Dimension tables.

### 2. Validate data early in your pipeline.
### 2. Validate data early in your pipeline

- Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all succeeding datasets conform to a specific schema and known data invariants. This also can potentially prevent data pipeline failures in cases of unexpected changes to the input data.
- Data that does not pass this validation stage can be rerouted to a Malformed Record store for diagnostic purpose.
- It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can used to replay the data pipeline for both testing (ei. testing data validation logic) and data recovery purposes (ei. data corruption is introduced due to a bug in the data transformation code and thus pipeline needs to be replayed).

### 3. Make your data pipelines replayable and idempotent.
### 3. Make your data pipelines replayable and idempotent

- Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines replayable and idempotent, you can recover from this state through deployment of code fix and replaying the data pipelines.
- Idempotency also ensures data-duplication is mitigated when replaying your data pipelines.
Expand All @@ -126,7 +124,7 @@ The following summarizes key learnings and best practices demonstrated by this s
- This means including all artifacts needed to build the data pipeline from scratch in source control. This includes infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures, etc), reference/application data, data pipeline definitions, and data validation and transformation logic.
- There should also be a safe, repeatable process to move changes through dev, test and finally production.

### 6. Secure and centralize configuration.
### 6. Secure and centralize configuration

- Maintain a central, secure location for sensitive configuration such as database connection strings that can be access by the appropriate services within the specific environment.
- Any example of this is securing secrets in KeyVault per environment, then having the relevant services query KeyVault for the configuration.
Expand Down Expand Up @@ -192,7 +190,7 @@ More resources:

More resources:

- [pytest-adf](https://aka.ms/pytest-adf) - Pytest helper plugin for integration testing Azure Data Factory
- [pytest-adf](https://github.com/devlace/pytest-adf) - Pytest helper plugin for integration testing Azure Data Factory
- [nutter testing framework](https://github.com/microsoft/nutter) - Testing framework for Databricks notebooks.

### Observability / Monitoring
Expand Down Expand Up @@ -242,7 +240,6 @@ More resources:
### Setup and Deployment

**IMPORTANT NOTE:** As with all Azure Deployments, this will **incur associated costs**. Remember to teardown all related resources after use to avoid unnecessary costs. See [here](#deployed-resources) for list of deployed resources.

*NOTE: This deployment was tested using WSL 2 (Ubuntu 18.04) and Debian GNU/Linux 9.9 (stretch)*

1. **Initial Setup**
Expand All @@ -261,7 +258,7 @@ More resources:
- **RESOURCE_GROUP_LOCATION** - Azure location to deploy resources. *Default*: `westus`.
- **AZURE_SUBSCRIPTION_ID** - Azure subscription id to use to deploy resources. *Default*: default azure subscription. To see your default, run `az account list`.
- **RESOURCE_GROUP_NAME_PREFIX** - name of the resource group. This will automatically be appended with the environment name. For example: `RESOURCE_GROUP_NAME_PREFIX-dev-rg`. *Default*: mdwdo-park-${DEPLOYMENT_ID}.
- **DEPLOYMENT_ID** - string appended to all resource names. This is to ensure uniqueness of azure resource names. *Default*: random five character string.
- **DEPLOYMENT_ID** - string appended to all resource names. This is to ensure uniqueness of azure resource names. *Default*: random five character string.
- **AZDO_PIPELINES_BRANCH_NAME** - git branch where Azure DevOps pipelines definitions are retrieved from. *Default*: master.
- **AZURESQL_SERVER_PASSWORD** - Password of the SQL Server instance. *Default*: semi-random string.

Expand All @@ -287,7 +284,7 @@ More resources:
- Root folder: **/e2e_samples/parking_sensors/adf**
- Import Existing Data Factory resource to repository: **Selected**
- Branch to import resource into: **Use Collaboration**
1. When prompted to select a working branch, select **master**
5. When prompted to select a working branch, select **master**

**IMPORTANT NOTE:** Only the **DEV** Data Factory should be setup with Git integration. Do **NOT** setup git integration in the STG and PROD Data Factories.

Expand Down Expand Up @@ -356,15 +353,14 @@ Notes:

ADLS Gen2 is structured as the following:

------------

---------------------
datalake <- filesystem
/sys/databricks/libs <- contains all libs, jars, wheels needed for processing
/data
/lnd <- Bronze - landing folder where all data files are ingested into.
/interim <- Silver - interim (cleansed) tables
/dw <- Gold - final tables
------------
---------------------

### Known Issues, Limitations and Workarounds

Expand Down
21 changes: 14 additions & 7 deletions e2e_samples/parking_sensors/devops/README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,28 @@

# DevOps Pipelines

## Required Variables:
## Required Variables

The following variables needs to be set in your Release Pipelines.

### Solution Specific Variables:
### Solution Specific Variables

- devAdfName - Name of the of ADF instance which has Git integration enabled.

### Environment Specific Variables:
### Environment Specific Variables

These variables typically changes across environments and are best stored in environment-scoped Variables Groups.

- **azureLocation** - location of deployed resources. ei. "Australia East"
- **rgName** - Target Resource Group of the deployment
- **databricksNotebookPath** - Databricks workspace path where notebooks will be uploaded. (ei. /myworkspace/notebooks)
- **databricksDbfsLibPath** - Databricks DBFS path where Python whl files will be uploaded. (ei. dbfs:/mnt/datalake/sys/databricks/libs)
- **adfName** - Target Azure Data Factory of the deployment
- **apiBaseUrl** - Base API url (ei. https://data.melbourne.vic.gov.au/resource/)
- **apiBaseUrl** - Base API url (ei. [https://data.melbourne.vic.gov.au/resource/](https://data.melbourne.vic.gov.au/resource/))

#### Secure Variables

These are best stored within KeyVault, then [exposed via a Variable Group](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml#link-secrets-from-an-azure-key-vault).

- **sqlsrvrName** - Target SQL server name
- **sqlsrvUsername** - Target SQL server username
- **sqlsrvrPassword** - Target SQL server password
Expand All @@ -30,11 +35,13 @@ These are best stored within KeyVault, then [exposed via a Variable Group](https
- **spAdfId** - Service Principal Id used to run Data Factory Integration tests
- **spAdfPass** - Service Principal password used to run Data Factory Integration tests
- **spAdfTenantId** - Service Principal tenant Id used to run Data Factory Integration tests
- **subscriptionId** - Azure Subscription Id
- **subscriptionId** - Azure Subscription Id

## Require Service Connections

The following are [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml) that need to be in place.

- github_service_connection - used to checkout a pipeline resource (adf_publish).
- azure_service_connection_dev - used to in the release pipeline to deploy to **dev** azure environment.
- azure_service_connection_stg - used to in the release pipeline to deploy to **stg** azure environment.
- azure_service_connection_prod - used to in the release pipeline to deploy to **prod** azure environment.
- azure_service_connection_prod - used to in the release pipeline to deploy to **prod** azure environment.
43 changes: 22 additions & 21 deletions e2e_samples/parking_sensors/docs/CI_CD.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,27 @@
### Build Pipelines
# Build Pipelines

1. **Build - Quality Assurance**
- Purpose: Ensure code quality and integrity
- Trigger: Pull Request to Master
- Steps:
1. Build Python packages
2. Run units tests
3. Code Coverage
4. Linting
2. **Build - Artifacts**
- Purpose: To produce necessary artifacts for Release
- Trigger: Commit to Master
- Steps:
1. Build and create Python Wheel
2. Publish artifacts:
- Python Wheel
- Databricks Notebooks and cluster configuration
- Data Factory pipeline definitions
- IaC - ARM templates, Bash scripts
- 3rd party library dependencies (JARs, etc)
- Purpose: Ensure code quality and integrity
- Trigger: Pull Request to Master
- Steps:
1. Build Python packages
2. Run units tests
3. Code Coverage
4. Linting

1. **Build - Artifacts**
- Purpose: To produce necessary artifacts for Release
- Trigger: Commit to Master
- Steps:
1. Build and create Python Wheel
2. Publish artifacts:
- Python Wheel
- Databricks Notebooks and cluster configuration
- Data Factory pipeline definitions
- IaC - ARM templates, Bash scripts
- 3rd party library dependencies (JARs, etc)

### Release Pipelines
## Release Pipelines

Currently, there is one multi-stage release pipeline with the following stages. Each stage deploys to a different environment.

Expand Down Expand Up @@ -56,4 +57,4 @@ Currently, there is one multi-stage release pipeline with the following stages.
- Upload Jars, Python wheels to DBFS
- Install libraries on cluster
2. Deploy ADF Pipeline
3. Swap between existing deployment and newly released deployment
3. Swap between existing deployment and newly released deployment
Loading

0 comments on commit cf3c533

Please sign in to comment.