Skip to content
This repository was archived by the owner on Jan 14, 2021. It is now read-only.
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 15 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,31 @@ This repo contains a generalized solution for running a social post processor on
For the description behind the architecture and investigation behind this solution, follow the [Code Story](https://www.microsoft.com/developerblog/2018/12/12/databricks-ci-cd-pipeline-using-travis/).

The Data Pipeline consists of:

- Ingesting tweets from Twitter
- Enriching tweets with *Language* and *Associated Entities*
- Identifying recent trends (last 15 minutes)
- Identifying long term trends (over the span of a week or a month)
- Saving historical data in an SQL database
- Sending an email (or triggering an Azure Function event) on new alerts

This repo also integrates a **CI/CD Pipeline** as part of the generalized solution with e-2-e testing.
This repo also integrates a **CI/CD Pipeline** as part of the generalized solution with an e-2-e testing.
The CI/CD Pipeline consists of:

- TravisCI based process(See [.travis.yml](.travis.yml))
- A Build Status Tag (To see if the last build/PR is successful or faulty)
- Building of artifacts
- Deploying notebooks and artifacts into Azure Databricks test environment (using databricks-cli)
- Executing the pipeline on test environment
- Executing the pipeline on the test environment
- Observing the generated alerts to determine success/fail
- Cleanup solution

## Data Pipeline Architecture

![Pipelin Architecture](/docs/ci-cd-pipeline-cloud-architecture.png)

## CI/CD Pipeline Architecture

![CI/CD Pipeline Architecture](/docs/ci-cd-pipeline-ci-cd-diagram.png)

# Deployment
Expand All @@ -43,14 +47,15 @@ Ensure you are in the root of the repository and logged in to the Azure cli by r
- Check the requirements.txt for list of necessary Python packages. (will be installed by `make requirements`)

## Deployment Machine

The deployment is done using [Python Virtual Environment](https://docs.python-guide.org/dev/virtualenvs/).

- The following works with [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
- `virtualenv .` This creates a python virtual environment to work in.
- `source bin/activate` This activates the virtual environment.
- TODO: Add _ext.env
- `make requirements`. This installs python dependencies in the virtual environment.
- WARNING: The line endings of the two shell scripts `deploy.sh` and `databricks/configure.sh` may cause errors in your interpreter. You can change the line endings by opening the files in VS Code, and changing in the botton right of the editor.
- WARNING: The line endings of the two shell scripts `deploy.sh` and `databricks/configure.sh` may cause errors in your interpreter. You can change the line endings by opening the files in VS Code, and changing in the bottom right of the editor.

## Deploy Entire Solution

Expand Down Expand Up @@ -86,19 +91,20 @@ DBENV_TWITTER_OAUTH_TOKEN_SECRET={FROM_TWITTER}

# Integration Tests

Main Assumption: The current design of the integration test pipeline, enables only one test to run e-2-e at any given moment, becuase of shared resources.
That said, in case the integration tests are able to spin-up/down an entire environment, that would not be an issue, since each test runs on an encapsulated environment. The injest notebook allows you to input a custom source and run the pipeline on this source.
Main Assumption: The current design of the integration test pipeline, enables only one test to run e-2-e at any given moment, because of shared resources.
That said, in case the integration tests can spin-up/down an entire environment, that would not be an issue since each test runs on an encapsulated environment. The ingest notebook allows you to input a custom source and run the pipeline on this source.

## Deploying a Test environment
To create a new secondary environment that's ready for integration testing, it is necessary to deploy a new environment, but there's no need to configure it.

To create a new secondary environment that's ready for integration testing, you need to deploy a new environment, but there's no need to configure it.
For that purpose you can run the following commands:

```sh
make deploy_resources resource-group-name=test-social-rg region=westeurope subscription-id=5b86ec85-0709-4021-b73c-7a089d413ff0
make create_secrets
```

Those two commands, will deploy a new environment to Azure, then configure the Databricks environment with the appropriate secrets.
Those two commands will deploy a new environment to Azure, then configure the Databricks environment with the appropriate secrets.
You will also need to create a local file `databricks.env` in the root of the project, containing:

```
Expand All @@ -111,6 +117,7 @@ DBENV_SQL_JDBC_PORT=1433
(You can use the full file with the twitter production configuration as well. Those keys will simply be ignored in the test environment).

## Connect to Travis-CI

This project displays how to connect [Travis-CI](https://travis-ci.org) to enable continuous integration and e2e validation.
To achieve that you need to perform the following tasks:

Expand All @@ -137,7 +144,7 @@ The [test.sh](/.travis/test.sh) script, run by Travis, activate the make command

> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 4 times, most recent failure: Lost task 0.3 in stage 145.0 (TID 1958, 10.139.64.4, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => string)

This issue may be `Caused by: org.apache.http.client.HttpResponseException: Too Many Requests` due to cognitive services throtteling limit on API requests.
This issue may be `Caused by: org.apache.http.client.HttpResponseException: Too Many Requests` due to cognitive services throttling limit on API requests.

> java.util.NoSuchElementException: An error occurred while enumerating the result, check the original exception for details.

Expand Down