Skip to content

Commit

Permalink
docs(ingestion): add data programatically - EUBFR-127 (#80)
Browse files Browse the repository at this point in the history
* Start docs programatic ingestion

* Updates

* Updates

* Improve a bit

* Refer to existing guidelines

* Documentation updates

* Remove repeated information

* Correct explanation
  • Loading branch information
kalinchernev authored Jan 24, 2018
1 parent c0aae25 commit e01bfdf
Show file tree
Hide file tree
Showing 5 changed files with 99 additions and 38 deletions.
47 changes: 9 additions & 38 deletions HOW_TO_TEST.md → docs/HOW_TO_TEST.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,7 @@ If the environment has already been setup, you can skip to [Send data](#send-dat

## Setup your environment

Get latest version of eubfr (clone from github)

Copy config.example.json to config.json and set the values according to your environment:

```json
{
"eubfr_env": "test",
"region": "eu-central-1",
"stage": "<username><n>"
}

```

For example:

```json
{
"eubfr_env": "test",
"region": "eu-central-1",
"stage": "chernka3"
}
```
Follow [these instructions](../README.md) to start your staged development environment.

## Get your AWS credentials

Expand All @@ -36,31 +15,23 @@ To get an account at the project's AWS namespace, please contact a PM to have an
When you receive your initial credentials for your account related to the project, please do not forget to change your password and enable a MFA following the [best practices](http://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html).

Given you have access to the project's namespace in AWS, bear in mind the following regarding your account:
- We use a shared AWS account with several users. Each user has his own stage variable to separate his work.
- [Create an access key](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) for your account to be able to use it programatically.
- Get also your private key, it comes hand in hand with your personal id.

* We use a shared AWS account with several users. Each user has his own stage variable to separate his work.
* [Create an access key](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) for your account to be able to use it programatically.
* Get also your private key, it comes hand in hand with your personal id.

By the end of the process of organizing your account at AWS, you must come up with 2 environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` for managing sessions programatically.

In your terminal application, write the following:
- `export AWS_ACCESS_KEY_ID=<your_access_key>`
- `export AWS_SECRET_ACCESS_KEY=<your_secret_key>`

* `export AWS_ACCESS_KEY_ID=<your_access_key>`
* `export AWS_SECRET_ACCESS_KEY=<your_secret_key>`

For more information about access key setup, see [this serverless guide](https://serverless.com/framework/docs/providers/aws/guide/credentials/)

## Deploy

Make sure your local environment meets the [requirements](https://github.com/ec-europa/eubfr-data-lake#requirements)

Make sure your environment is ready
- `yarn`

Deploy on root
- go to root of the project
- `yarn deploy`
- `yarn deploy-demo`

It automatically creates everything you need (bucket, database, ...)
Follow [these instructions](../README.md) to deploy a project under your name and stage.

## Send data

Expand Down
90 changes: 90 additions & 0 deletions docs/PUSHING_DATA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Pushing information to EUBFR data lake

This is a high-level guide explaining the low-level approach of ingesting data programatically. Because the work will be done through the HTTP protocol via RESTful requests, you can either use the same tools as described here, or use any other custom script or tool that work with the protocol and is more convenient for you.

## Getting credentials

To receive AWS access key id and secret, please contact EUBFR PM. These credentials will be provided privately.

## Ingestion-related endpoints

The information necessary for managing the ingestion process for a producer is currently divided into 2 APIs:

* **Storage API**: managing the physical files used for the ingestion
* **Meta Index API**: managing the meta data for the physical files

Note that root endpoints change in time. Please request this information about a given stage and endpoint you need for your implementation. You will be notified each time there is a change in address.

### Storage API

All request headers should comply with the [signature version 4 signing process](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html) which gives producers a secured temporary access to AWS resources. On top of that, some methods require specific keys to know which specific information is to be managed.

Here's an example endpoint for `test` stage environment:

`API -> https://ti5rsoocwg.execute-api.eu-central-1.amazonaws.com/test/`

| # | Operation | Endpoint | Method | Headers on top of AWS signature |
| --- | --------------------- | -------------------------- | ------ | ------------------------------- |
| 1 | Get signed upload URL | {`API`}/storage/signed_url | GET | `x-amz-meta-producer-key` |
| 2 | Download | {`API`}/storage/download | GET | `x-amz-meta-computed-key` |
| 3 | Delete | {`API`}/storage/delete | GET | `x-amz-meta-computed-key` |

### Meta Index API

Here's an example endpoint for `test` stage environment:

`API -> https://search-test-meta-{domainId}.eu-central-1.es.amazonaws.com/`

This endpoint is based on [AWS (managed) Elasticsearch](https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-supported-es-operations.html), so you can refer to [official API documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html) when working with this interface.

## Ingestion workflows

Below you can find a short overview on how to achieve CRUD operations while managing the ingestion for your projects.

### Uploading data

In order to upload new data, you will need to start with operation (1) which will provide you with a signed URL. This URL gives you temporary permissions to add new physical file to the AWS S3 bucket. The file formats supported so far are: `csv`, `json`, `xls` and `xml`.

Adding a physical file to a specific S3 bucket on AWS is the first stage of the ingestion process, the rest of the extraction, transformation and loading operations are automatically done in relation to the S3 bucket. The producer does not need to take any further actions than managing his data files.

![Getting a signed upload URL](./assets/signed-upload-flow.gif)

When you get the signed URL, use the generated URL with `PUT`, attaching a file matching exactly the signed `x-amz-meta-producer-key`.

![Uploading data from a signed URL](./assets/upload-data.gif)

If you have missing or wrong header keys or if the request validity has expired, you will get a response with a warning as a status code and message describing the issue at hand. In case of success, the client will receive status code `200`.

### Getting information about existing data

When you plan to update existing information on the data lake, first you need to see what are the existing files. For this, use the meta index API.

Here's an example response:

```json
[
{
"metadata": {},
"content_length": 259051,
"producer_id": "agri",
"original_key": "agri_history.csv",
"content_type": "binary/octet-stream",
"computed_key": "agri/8e387bde-76b8-426b-afe4-c96d8b360b90.csv",
"status": "parsed",
"message": "ETL successful",
"last_modified": "2018-01-23T12:49:17.000Z"
}
]
```

### Downloading file with existing data

When you need to update information regarding your projects, you are able to download the file holding the data which is present in the data lake. Correcting and re-uploading a file will update the information about your projects in the data lake. For downloading the file with existing data for corrections, use operation (2).

![Download file with existing data](./assets/downloading-data.gif)

If you don't provide the correct header, you will get feedback in the response body. Double-check the information from the meta index when in doubt.

### Deleting data

When you would like to delete the file, and thus data related to it from the data lake, use operation (3).
Binary file added docs/assets/downloading-data.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/signed-upload-flow.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/upload-data.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e01bfdf

Please sign in to comment.