Prototype End-to-end Analytics Solution on Cloud for the Nigeria Health Facility Registry (HFR)

This is an end-to-end data engineering project that entails the design and implementation of a data warehouse for the Nigeria Health Facility Registry which is itself a program under the Nigeria Ministry of Health.

This project development emcompasses:

The design of a data ingestion pipeline architecture showing the different Google Cloud Serevices, tools and framework used.
The design of the ERD(Entity Relationship Diagram) of the system and schema of my final data mart used for reporting.
The implementation of Data Extraction, Transformation and Loading (ETL) with Google Cloud Integration
The Automation and Scheduling of the processes
Basic analytics

About Nigeria HFR

The Nigeria Health Facility Registry (HFR) was developed in 2017 as part of effort to dynamically manage the Master Health Facility List (MFL) in the country. The MFL "is a complete listing of health facilities in a country (both public and private) and is comprised of a set of identification items for each facility (signature domain) and basic information on the service capacity of each facility (service domain)".

The Federal Ministry of Health had previously identified the need for an information system to manage the MFL in light of different shortcomings encountered in maintaining an up-to-date paper based MFL. The benefits of the HFR are numerous including serving as the hub for connecting different information systems thereby enabling integration and interoperability, eliminating duplication of health facility lists and for planning the establishment of new health facilities.

Project Development

Data Ingestion Pipeline Architecture

This design has been made using https://www.app.diagrams.net/ .

The data source as provided remains the HFR website
The Data Extraction shall be carried out by utilizing the Selenium Python web automation framework
Data staging on Google Cloud Storage (GCS)
Data Transformation using DataProc- PySpark
The workflow management or orchestration tool of choice for the Scheduling and Automation is the Cloud Composer
And the Google Big Query (GBQ) used for the final data warehouse while the Google Colaboratory with python has been used to analyse and visualize the data for answers

ERD(Entity Relationship Diagram) Design

The ERD of the system as shown below can also be accessed here

From considering the HFR requirements and studying the value types and forms of data available on the HFR website:

I have designed a Star schema as my final data mart showing six (6) dimension-tables.Having performed 3 levels of normalization (1NF, 2NF, 3NF) where applicable.
The model has also been designed to provide information for all possible grains. i.e the fact table rows provide a high level of details.
This stage I would say has been the most tasking.

Data Extraction, Transformation and Loading (ETL) Implementaion

The python frameworks and packages leveraged for the web scraping are Selenium, Pandas, Numpy and BeautifulSoup.
To speed up the webscraping process, multithreading has been employed
Output of the scraper is saved as raw_hfr_data.csv while being partitioned by dates in the cloud storage.
Spark has been largely utilized for the whole ETL process
A doctors.parquet file was written (i.e data of health institution that has at least one doctor)
Data Loading into GBQ has been performed as proposed in the ERD and schema design

Automation and Scheduling

Two separate DAGS were written, one for the scraper to run daily and the other to perform ingestion, transfommation and staging, and loading into the GBQ.

Analytics

The Data Warehouse was Queried as shown below to get names of institutions/health facilities and the number of Doctors present:

and then exported to Colab for visualization and basic descriptive analysis. The plot below shows the "top ten institutions with the highest number of doctors":

References

About HFR
Architecture Design
prisma
Google Cloud Skills Boost

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Table of Contents

Prototype End-to-end Analytics Solution on Cloud for the Nigeria Health Facility Registry (HFR)

About Nigeria HFR

Project Development

Data Ingestion Pipeline Architecture

ERD(Entity Relationship Diagram) Design

Data Extraction, Transformation and Loading (ETL) Implementaion

Automation and Scheduling

Analytics

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Table of Contents

Prototype End-to-end Analytics Solution on Cloud for the Nigeria Health Facility Registry (HFR)

About Nigeria HFR

Project Development

Data Ingestion Pipeline Architecture

ERD(Entity Relationship Diagram) Design

Data Extraction, Transformation and Loading (ETL) Implementaion

Automation and Scheduling

Analytics

References