GitHub - mariobehling/kuwala: Kuwala is a tool for integrating third-party data into models and products

The Vision of a Global Liquid Data Economy

With Kuwala, we want to enable the global liquid data economy. You probably also envision a future of smart cities, autonomously driving cars, and sustainable living. For all of that, we need to leverage the power of data. Unfortunately, many promising data projects fail, however. That's because too many resources are necessary for gathering and cleaning data. Kuwala supports you as a data engineer, data scientist, or business analyst to create a holistic view of your ecosystem by integrating third-party data seamlessly.

How Kuwala works

Kuwala explicitly focuses on integrating third-party data, so data that is not under your company's influence, e.g., weather or population information. To easily combine several domains, we further narrow it down to data with a geo-component which still includes many sources. For matching data on different aggregation levels, such as POIs to a moving thunderstorm, we leverage Uber's H3 spatial indexing.

Pipelines wrap individual data sources. Within the pipeline, raw data is cleaned and preprocessed. Then, the preprocessed data is loaded into a graph to establish connections between the different data points. Based on the graph, Kuwala will create a data lake from which you can load the data to a data warehouse, for example. Alternatively, it will also be possible to query the graph through a GraphQL endpoint.

Quickstart with Docker (OSM and population data)

Prerequisites

Installed version of Python3, Docker and docker-compose (Go here for instructions)

Note: We recommend giving Docker at least 8 GB of RAM (On Docker Desktop you can go under settings -> resources)

You can either [A] use a preprocessed demo with data for Portugal, or [B] build and run the pipelines yourself for which ever country you like.

Process data

From inside the root directory, change directory to kuwala/scripts

cd kuwala/scripts

Build CLI and Docker images (this may take several minutes)

For [A], the demo:

sh build_cli.sh

For [B], the individual pipelines:

sh initialize_components.sh

Run CLI to download and process data

sh run_cli.sh

WARNING: If you decide to run the google-poi pipeline the scraper may run for several minutes up to several hours depending on the country. You can always see the requests made by the scraper in the logs of the google-poi-api container in Docker Desktop.

Errors of the following type can be ignored. This is a known bug in the neo4j-pyspark package. The queries are compiled and executed correctly.

ERROR SchemaService: Query not compiled because of the following exception:
org.neo4j.driver.exceptions.ClientException: Variable `event` not defined

Query data

Currently, you can query the graph database directly using Cypher. To launch the Neo4j instance run the following:

From inside the root directory, change directory to kuwala/

cd kuwala/

Launch Neo4j

docker-compose --profile core up

There already is a PR (#55) open for a Jupyter Notebook environment with convenience functions to query and visualize the data.

How you can contribute

Be part of our community

The best first step to get involved is to join the Kuwala Community on Slack. There we discuss everything related to data integration and new pipelines. Every pipeline will be open-source. We entirely decide, based on you, our community, which sources to integrate. You can reach out to us on Slack or email to request a new pipeline or contribute yourself.

Contribute to the project

If you want to contribute yourself, you can use your choice's programming language and database technology. We have the only requirement that it is possible to run the pipeline locally and use Uber's H3 functionality to handle geographical transformations. We will then take the responsibility to maintain your pipeline.

Note: To submit a pull request, please fork the project and then submit a PR to the base repo.

Liberating the work with data

By working together as a community of data enthusiasts, we can create a network of seamlessly integratable pipelines. It is now causing headaches to integrate third-party data into applications. But together, we will make it straightforward to combine, merge and enrich data sources for powerful models.

What's coming next for the pipelines?

Based on the use-cases we have discussed in the community and potential users, we have identified a variety of data sources to connect with next:

Semi-structured data

Already structured data but not adapted to the Kuwala framework:

Google Trends - https://github.com/GeneralMills/pytrends
Instascraper - https://github.com/chris-greening/instascrape
GDELT - https://www.gdeltproject.org/
Worldwide Administrative boundaries - https://index.okfn.org/dataset/boundaries/
Worldwide scaled calendar events (e.g. bank holidays, school holidays) - https://github.com/commenthol/date-holidays

Unstructured data

Unstructured data becomes structured data:

Building Footprints from satellite images

Wishlist

Data we would like to integrate, but a scalable approach is still missing:

Small scale events (e.g., a festival, movie premiere, nightclub events)

Using existing pipelines

To use our published pipelines clone this repository and navigate to ./kuwala/pipelines. There is a separate README for each pipeline on how to get started with it.

We currently have the following pipelines published:

osm-poi: Global collection of point of interests (POIs)
population-density: Detailed population and demographic data
google-poi: Scraping API to retrieve POI information from Google (incl. popularity score)

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
docs/images		docs/images
kuwala		kuwala
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Vision of a Global Liquid Data Economy

How Kuwala works

Quickstart with Docker (OSM and population data)

Prerequisites

Process data

Query data

How you can contribute

Be part of our community

Contribute to the project

Liberating the work with data

What's coming next for the pipelines?

Semi-structured data

Unstructured data

Wishlist

Using existing pipelines

About

Releases

Packages

Languages

License

mariobehling/kuwala

Folders and files

Latest commit

History

Repository files navigation

The Vision of a Global Liquid Data Economy

How Kuwala works

Quickstart with Docker (OSM and population data)

Prerequisites

Process data

Query data

How you can contribute

Be part of our community

Contribute to the project

Liberating the work with data

What's coming next for the pipelines?

Semi-structured data

Unstructured data

Wishlist

Using existing pipelines

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages