Skip to content

WorldFishCenter/peskas.kenya.data.pipeline

Repository files navigation

peskas.kenya.data.pipeline

R-CMD-check Codecov test coverage

The goal of peskas.kenya.data.pipeline is to implement, deploy, and execute the data and modelling pipelines that underpin Peskas in Kenya, a partnership between WorldFish and Wildlife Conservation Society.

The pipeline is an R package

peskas.kenya.data.pipeline is structured as an R package because it makes it easier to write production-grade software. Specifically, structuring the code as an R package allows us to:

  • better handle system and package dependencies,
  • forces us to split the code into functions,
  • makes it easier to document the code, and
  • makes it easier to test the code

We make heavy use of tidyverse style conventions and the usethis package to automate tasks during project setup and deployment.

For more information about the rationale of structuring the pipeline as a package check Chapter 3 in Engineering Production-Grade Shiny Apps. The book is focused on Shiny applications but the rationale also applies to data pipelines and production-ready code in general.

How the pipeline works

The pipeline is composed of different modules:

  1. Data Collection: On site fishing landing surveys and continuous, solar-powered GPS vessel trackers to collect and send data in near real-time, alongside fishery metadata for a thorough data-gathering process.

  2. Pre-processing: Data formatting, shaping, and standardisation to prepare the raw data for analysis.

  3. Validation: Outlier detection and error identification, and includes an alert system to maintain data quality.

  4. Analytics: Modelling fisheries indicators, nutritional characterization, and data mining to extract valuable insights.

  5. Data export: Automated dissemination of processed and analysed fisheries data to ensure accessibility and comprehension. This involves restructuring data for dashboard integration and open publication.

  6. isualisation: Tools for data reporting and sharing of insights through a comprehensive dedicated web app dashboard (not hosted in this repository).

See Peskas: Automated analytics for small-scale, data-deficient fisheries for further details.

Getting Started

This package uses a configuration file config.yml to manage environment-specific settings and connections. To get started, familiarize yourself with the package structure, particularly the R directory where the main functions are located.

Each function typically reads the configuration using read_config() to access necessary parameters. To work on this package locally, you’ll need to set up the required authentication files in the auth/ directory and ensure your environment variables are properly set. Remember to run devtools::load_all() when testing changes locally. If you’re new to R package development, consider reviewing the R packages book by Hadley Wickham and Jenny Brian.

Quick Guide for Contributors

To keep our repository clean and efficient, please keep these guidelines in mind:

  • Always work on a new branch, not directly on main.
  • Write clear, concise commit messages.
  • Avoid storing intermediate and garbage files, especially in the root folder.
  • Strive for soft-coded solutions.
  • Maintain consistent code style throughout the project.
  • Document your code well - future you (and others) will thank you.
  • Test your changes thoroughly before submitting a pull request.
  • Keep your fork synced with the main repository.

These practices help us maintain a clean, efficient codebase that’s easier for everyone to work with. For more detailed guidelines, check out our CONTRIBUTING.md file.