Optimizing blueprints of cloud data platforms

Research papers

This repository contains the implementation of the following research paper:

Francia, Matteo, Golfarelli Matteo, and Pasini Manuele. "Process-Driven Design of Cloud Data Platforms". Submitted to Information Systems (2024)

Getting Started

Main features

The application works with scenarios, each of them defining a data pipeline to be implemented with respect to a certain service ecosystem and a taxonomy of tags.
The match and select algorithm has been implemented in Python 3.9 and thus requires it to be run. The application relies on GraphDB (reachable via browser at http://127.0.0.1:7200) to store knowledge graphs and SPARQL queries to perform the matching part of the algorithm.
The application returns all optimal solution to the optimization problem.
Upon execution, the matched and selected graphs will be available in the /dataplatform_design/src/test/scenarios/scenario_{scenario_name}/output directory in the form of .json knowledge graphs and a visual .png representation.

Running the approach

The steps necessary to run the approach can be found in the Github action build
The list of mandatory (Python) dependencies to successfully run the script is explicited inside requirements.txt file in the project root directory.
Alternatively. the approach, data and figures mentioned in the article can be reproduced through Docker by opening a shell within this project directory and running
```
docker compose up --abort-on-container-exit
```

Scenario setup

A template for a design scenario can be found in /dataplatform_design/resources/scenario_template. Each scenario is organized as follows:

scenario_0/: main directory of a scenario.
scenario_0/configs/: contains config files.
- config.yml: Stores the algorithm parameters (e.g. GraphDB ip address, ontologies namespaces, etc.);
- repo-config.ttl: Stores the GraphDB repository configs, such as the ruleset (default OWL-Max).
scenario_0/input/: defines scenario inputs.
- adds_constraint/: represents additional constraints such as preferences and mandatories constraints.
  - preferences.ttl: Describes services to be preferred in algorithm solution.
- ontologies/: Contains all ontologies needed to run scenario.
  - DPDO.ttl: Data Platform Design ontology, should never be changed ;
  - ServiceEcosystem.ttl: Describes the services among which to choose the optimal subset. (default: AWS service ecosystem).
  - TagTaxonomy.ttl: Describes the tag taxonomy through which categorize services and data pipelines;
  - DFD.ttl: Describes the data pipeline to be implemented.
- solution/:
  - solution.ttl: Expected scenario solution.
scenario_0/output/: where the algorithm stores computed scenario solutions.
- matched_graph.json: describes the matched graph in a knowledge graph .json file;
- matched_graph.png: visual representation of matched graph;
- selected_graph_solution_{number}.json: json representation for solution {number};
- selected_graph_solution_{number}.png: visual representation of solution {number}.

A new scenario can be created by running

python dataplatform_design/src/test/create_test_scenario.py --scenario_name {scenario_name}

Upon creation, users can then define the DFD.ttl ontology to reflect the pipeline to be implemented. The above script optionally takes a further set of parameters defining paths of user-defined .ttl files for scenario creation, such as:

--service_ecosystem
--tag_taxonomy
--solution
--preferences
--dfd

Please note that in case of using user-defined ontologies with different namespaces than the default ones, such ontologies' namespaces and prefixes must be updated in /dataplatform_design/src/test/scenarios/scenario_{scenario_name}/configs/config.yml.

Testing scenarios

Once scenarios have been defined, all of them can be tested by running:

 docker compose up --abort-on-container-exit

During execution, the script will compute the optimal set of services to implement the DFD for each scenario and compare the computed solutions to the proposed ones. The test is considered successful if, for each scenario, at least one computed solution matches the proposed one.

Additionally, based on the iteration parameters specified in the .env file, the algorithm will be evaluated for each iteration and scenario. The results will include:

Detailed statistics for each scenario and iteration, saved as .csv files in: /dataplatform_design/dataplatform_design/run_statistics/
Charts summarizing the statistics, available in: /dataplatform_design/dataplatform_design/run_statistics/plots

Name		Name	Last commit message	Last commit date
Latest commit History 421 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
dataplatform_design		dataplatform_design
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
release.config.js		release.config.js
renovate.json		renovate.json
requirements.txt		requirements.txt
run_evaluation.sh		run_evaluation.sh
wait-for-it.sh		wait-for-it.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Optimizing blueprints of cloud data platforms

Research papers

Getting Started

Scenario setup

Testing scenarios

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

big-unibo/DataPlatformDesign

Folders and files

Latest commit

History

Repository files navigation

Optimizing blueprints of cloud data platforms

Research papers

Getting Started

Scenario setup

Testing scenarios

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages