Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Azure Data Factory module #397

Merged
merged 5 commits into from
Jul 31, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add ADF module
  • Loading branch information
helayoty committed Jul 29, 2020
commit 5d387e7b01bc520372c2eb532ccf12932de77040
130 changes: 130 additions & 0 deletions infra/modules/providers/azure/data-factory/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Data Factory

This Terrafom based data-factory module grants templates the ability to create Data factory instance along with the main components.

## _More on Data Factory_

Azure Data Factory is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

Additionally, you can publish your transformed data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions.

For more information, Please check Microsoft Azure Data Factory [Documentation](https://docs.microsoft.com/en-us/azure/data-factory/introduction).

## Characteristics

An instance of the `data-factory` module deploys the _**Data Factory**_ in order to provide templates with the following:

- Ability to provision a single Data Factory instance
- Ability to provision a configurable Pipeline
- Ability to configure Trigger
- Ability to configure SQL server Dataset
- Ability to configure SQL server Linked Service

## Out Of Scope

The following are not support in the time being

- Creating Multiple pipelines
- Only SQL server Dataset/Linked Service are implemented.

## Definition

Terraform resources used to define the `data-factory` module include the following:

- [azurerm_data_factory](https://www.terraform.io/docs/providers/azurerm/r/data_factory.html)
- [azurerm_data_factory_integration_runtime_managed](https://www.terraform.io/docs/providers/azurerm/r/data_factory_integration_runtime_managed.html)
- [azurerm_data_factory_pipeline](https://www.terraform.io/docs/providers/azurerm/r/data_factory_pipeline.html)
- [azurerm_data_factory_trigger_schedule](https://www.terraform.io/docs/providers/azurerm/r/data_factory_trigger_schedule.html)
- [azurerm_data_factory_dataset_sql_server](https://www.terraform.io/docs/providers/azurerm/r/data_factory_dataset_sql_server_table.html)
- [azurerm_data_factory_linked_service_sql_server](https://www.terraform.io/docs/providers/azurerm/r/data_factory_linked_service_sql_server.html)

## Usage

Data Factory usage example:

```terraform
module "data_factory" {
helayoty marked this conversation as resolved.
Show resolved Hide resolved
source = "../../modules/providers/azure/data-factory"
data_factory_name = "adf"
resource_group_name = "rg"
data_factory_runtime_name = "adfrt"
node_size = "Standard_D2_v3"
number_of_nodes = 1
edition = "Standard"
max_parallel_executions_per_node = 1
vnet_integration = {
vnet_id = "/subscriptions/resourceGroups/providers/Microsoft.Network/virtualNetworks/testvnet"
subnet_name = "default"
}
data_factory_pipeline_name = "adfpipeline"
data_factory_trigger_name = "adftrigger"
data_factory_trigger_interval = 1
data_factory_trigger_frequency = "Minute"
data_factory_dataset_sql_name = "adfsqldataset"
data_factory_dataset_sql_table_name = "adfsqldatasettable"
data_factory_dataset_sql_folder = ""
data_factory_linked_sql_name = "adfsqllinked"
data_factory_linked_sql_connection_string = "Server=tcp:adfsql..."
}
```

## Outputs

The value will have the following schema:

```terraform
helayoty marked this conversation as resolved.
Show resolved Hide resolved

output "resource_group_name" {
description = "The resource group name of the Service Bus namespace."
value = data.azurerm_resource_group.main.name
}

output "data_factory_name" {
description = "The name of the Azure Data Factory created"
value = azurerm_data_factory.main.name
}

output "data_factory_id" {
description = "The ID of the Azure Data Factory created"
value = azurerm_data_factory.main.id
}

output "identity_principal_id" {
description = "The ID of the principal(client) in Azure active directory"
value = azurerm_data_factory.main.identity[0].principal_id
}

output "pipeline_name" {
description = "the name of the pipeline created"
value = azurerm_data_factory_pipeline.main.name
}

output "trigger_interval" {
description = "the trigger interval time for the pipeline created"
value = azurerm_data_factory_trigger_schedule.main.interval
}

output "sql_dataset_id" {
description = "The ID of the SQL server dataset created"
value = azurerm_data_factory_dataset_sql_server_table.main.id
}

output "sql_linked_service_id" {
description = "The ID of the SQL server Linked service created"
value = azurerm_data_factory_linked_service_sql_server.main.id
}

output "adf_identity_principal_id" {
description = "The ID of the principal(client) in Azure active directory"
value = azurerm_data_factory.main.identity[0].principal_id
}

output "adf_identity_tenant_id" {
description = "The Tenant ID for the Service Principal associated with the Managed Service Identity of this App Service."
value = azurerm_data_factory.main.identity[0].tenant_id
}
```

## Argument Reference

Supported arguments for this module are available in [variables.tf](variables.tf)
8 changes: 8 additions & 0 deletions infra/modules/providers/azure/data-factory/datasets.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

resource "azurerm_data_factory_dataset_sql_server_table" "main" {
name = var.data_factory_dataset_sql_name
resource_group_name = data.azurerm_resource_group.main.name
data_factory_name = azurerm_data_factory.main.name
linked_service_name = azurerm_data_factory_linked_service_sql_server.main.name
table_name = var.data_factory_dataset_sql_table_name
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

resource "azurerm_data_factory_linked_service_sql_server" "main" {
name = var.data_factory_linked_sql_name
resource_group_name = data.azurerm_resource_group.main.name
data_factory_name = azurerm_data_factory.main.name
connection_string = var.data_factory_linked_sql_connection_string
integration_runtime_name = azurerm_data_factory_integration_runtime_managed.main.name
}
51 changes: 51 additions & 0 deletions infra/modules/providers/azure/data-factory/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
module "azure-provider" {
source = "../provider"
}

data "azurerm_resource_group" "main" {
name = var.resource_group_name
}

resource "azurerm_data_factory" "main" {
#required
name = var.data_factory_name
resource_group_name = data.azurerm_resource_group.main.name
location = data.azurerm_resource_group.main.location

# This will be static as "SystemAssigned" is the only identity available now
identity {
type = "SystemAssigned"
}
}

resource "azurerm_data_factory_integration_runtime_managed" "main" {
name = var.data_factory_runtime_name
data_factory_name = azurerm_data_factory.main.name
resource_group_name = data.azurerm_resource_group.main.name
location = data.azurerm_resource_group.main.location
node_size = var.node_size
number_of_nodes = var.number_of_nodes
edition = var.edition
max_parallel_executions_per_node = var.max_parallel_executions_per_node

vnet_integration {
vnet_id = var.vnet_integration.vnet_id
subnet_name = var.vnet_integration.subnet_name
}
}

resource "azurerm_data_factory_pipeline" "main" {
name = var.data_factory_pipeline_name
resource_group_name = data.azurerm_resource_group.main.name
data_factory_name = azurerm_data_factory.main.name
}

resource "azurerm_data_factory_trigger_schedule" "main" {
name = var.data_factory_trigger_name
data_factory_name = azurerm_data_factory.main.name
resource_group_name = data.azurerm_resource_group.main.name
pipeline_name = azurerm_data_factory_pipeline.main.name

interval = var.data_factory_trigger_interval
frequency = var.data_factory_trigger_frequency
}
50 changes: 50 additions & 0 deletions infra/modules/providers/azure/data-factory/output.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
output "resource_group_name" {
description = "The resource group name of the Service Bus namespace."
value = data.azurerm_resource_group.main.name
}

output "data_factory_name" {
description = "The name of the Azure Data Factory created"
value = azurerm_data_factory.main.name
}

output "data_factory_id" {
description = "The ID of the Azure Data Factory created"
value = azurerm_data_factory.main.id
}

output "identity_principal_id" {
description = "The ID of the principal(client) in Azure active directory"
value = azurerm_data_factory.main.identity[0].principal_id
}

output "pipeline_name" {
description = "the name of the pipeline created"
value = azurerm_data_factory_pipeline.main.name
}

output "trigger_interval" {
description = "the trigger interval time for the pipeline created"
value = azurerm_data_factory_trigger_schedule.main.interval
}

output "sql_dataset_id" {
description = "The ID of the SQL server dataset created"
value = azurerm_data_factory_dataset_sql_server_table.main.id
}

output "sql_linked_service_id" {
description = "The ID of the SQL server Linked service created"
value = azurerm_data_factory_linked_service_sql_server.main.id
}

output "adf_identity_principal_id" {
description = "The ID of the principal(client) in Azure active directory"
value = azurerm_data_factory.main.identity[0].principal_id
}

output "adf_identity_tenant_id" {
description = "The Tenant ID for the Service Principal associated with the Managed Service Identity of this App Service."
value = azurerm_data_factory.main.identity[0].tenant_id
}

12 changes: 12 additions & 0 deletions infra/modules/providers/azure/data-factory/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
resource_group_name = ""
helayoty marked this conversation as resolved.
Show resolved Hide resolved
data_factory_name = ""
data_factory_runtime_name = ""
data_factory_pipeline_name = ""
data_factory_dataset_sql_name = ""
data_factory_dataset_sql_table_name = ""
data_factory_linked_sql_name = ""
data_factory_linked_sql_connection_string = ""
vnet_integration = {
vnet_id = ""
subnet_name = ""
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
RESOURCE_GROUP_NAME="..."
STORAGE_ACCOUNT_NAME="..."
CONTAINER_NAME="..."
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
package integraton

import (
"os"
"testing"

"github.com/microsoft/cobalt/infra/modules/providers/azure/data-factory/tests"
"github.com/microsoft/terratest-abstraction/integration"
)

var subscription = os.Getenv("ARM_SUBSCRIPTION_ID")

func TestDataFactory(t *testing.T) {
testFixture := integration.IntegrationTestFixture{
GoTest: t,
TfOptions: tests.DataFactoryTFOptions,
ExpectedTfOutputCount: 8,
TfOutputAssertions: []integration.TerraformOutputValidation{
VerifyCreatedDataFactory(subscription,
helayoty marked this conversation as resolved.
Show resolved Hide resolved
"resource_group_name",
"data_factory_name",
),
VerifyCreatedPipeline(subscription,
"resource_group_name",
"data_factory_name",
"pipeline_name",
),
VerifyCreatedDataset(subscription,
"resource_group_name",
"data_factory_name",
"sql_dataset_id",
),
VerifyCreatedLinkedService(subscription,
"resource_group_name",
"data_factory_name",
"sql_linked_service_id",
),
},
}
integration.RunIntegrationTests(&testFixture)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
package integraton

import (
"testing"

"github.com/microsoft/cobalt/test-harness/terratest-extensions/modules/azure"
"github.com/microsoft/terratest-abstraction/integration"
"github.com/stretchr/testify/require"
)

// healthCheck - Asserts that the deployment was successful.
func healthCheck(t *testing.T, provisionState *string) {
require.Equal(t, "Succeeded", *provisionState, "The deployment hasn't succeeded.")
}

// VerifyCreatedDataFactory - validate the created data factory
func VerifyCreatedDataFactory(subscriptionID, resourceGroupOutputName, dataFactoryOutputName string) func(goTest *testing.T, output integration.TerraformOutput) {
return func(goTest *testing.T, output integration.TerraformOutput) {

dataFactory := output[dataFactoryOutputName].(string)
resourceGroup := output[resourceGroupOutputName].(string)

dataFactoryNameFromAzure := azure.GetDataFactoryNameByResourceGroup(
goTest,
subscriptionID,
resourceGroup)

require.Equal(goTest, dataFactoryNameFromAzure, dataFactory, "The data factory does not exist")
}
}

// VerifyCreatedPipeline - validate the pipeline name for the created data factory
func VerifyCreatedPipeline(subscriptionID, resourceGroupOutputName, dataFactoryOutputName, pipelineOutputName string) func(goTest *testing.T, output integration.TerraformOutput) {
return func(goTest *testing.T, output integration.TerraformOutput) {
pipelineNameFromOutput := output[pipelineOutputName].(string)

dataFactory := output[dataFactoryOutputName].(string)
resourceGroup := output[resourceGroupOutputName].(string)

pipelineNameFromAzure := azure.GetPipeLineNameByDataFactory(
goTest,
subscriptionID,
resourceGroup,
dataFactory)

require.Equal(goTest, pipelineNameFromAzure, pipelineNameFromOutput, "The pipeline does not exist in the data factory")
}
}

// VerifyCreatedDataset - validate the SQL dataset for the created pipeline
func VerifyCreatedDataset(subscriptionID, resourceGroupOutputName, dataFactoryOutputName, datasetOutputID string) func(goTest *testing.T, output integration.TerraformOutput) {
return func(goTest *testing.T, output integration.TerraformOutput) {
datasetIDFromOutput := output[datasetOutputID].(string)

dataFactory := output[dataFactoryOutputName].(string)
resourceGroup := output[resourceGroupOutputName].(string)

datasetIDFromAzure := azure.ListDatasetIDByDataFactory(goTest,
subscriptionID,
resourceGroup,
dataFactory)

require.Contains(goTest, *datasetIDFromAzure, datasetIDFromOutput, "The dataset does not exist")
}
}

// VerifyCreatedLinkedService - validate the SQL dataset for the created pipeline
func VerifyCreatedLinkedService(subscriptionID, resourceGroupOutputName, dataFactoryOutputName, linkedServiceIDOutputName string) func(goTest *testing.T, output integration.TerraformOutput) {
return func(goTest *testing.T, output integration.TerraformOutput) {
linkedServiceIDFromOutput := output[linkedServiceIDOutputName].(string)

dataFactory := output[dataFactoryOutputName].(string)
resourceGroup := output[resourceGroupOutputName].(string)

linkedServiceIDFromAzure := azure.ListLinkedServicesIDByDataFactory(goTest,
subscriptionID,
resourceGroup,
dataFactory)

require.Contains(goTest, *linkedServiceIDFromAzure, linkedServiceIDFromOutput, "The Linked Servicee does not exist")
}
}
Loading