Skip to content

A small, production-style data engineering pipeline written in Scala 3.

Notifications You must be signed in to change notification settings

emeadows/event-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Event Pipeline (Scala 3 + Cats Effect + fs2)

A small, production-style data engineering pipeline written in Scala 3. The aim is to demonstrate clear functional design, separation of concerns, testing discipline, and general engineering practices.

This project provides:

  • A modern Cats / Cats Effect codebase
  • A streaming pipeline built with fs2
  • Clean domain modelling and algebras
  • Configurable and testable application structure
  • Unit and end-to-end tests
  • A VS Code dev container for consistent tooling

Architecture Overview

This project implements a small, realistic data-engineering pipeline using functional Scala. The pipeline ingests raw CSV transactions, validates the input, aggregates the data, and writes daily metrics to an output file in JSONL format. Everything is built using pure, testable components with clear separation of concerns.

 ┌──────────────────────────┐
 │      CSV File (input)    │
 └───────────────┬──────────┘
                 │
                 v
 ┌────────────────────────────────────┐
 │  FileTransactionSource             │
 │  (stream file line by line)        │
 └──────────────────┬─────────────────┘
                    │
                    v
 ┌────────────────────────────────────┐
 │  parseRow                          │
 │  CSV → RawTransaction              │
 └──────────────────┬─────────────────┘
                    │
                    v
 ┌────────────────────────────────────┐
 │  validateRow                       │
 │  RawTransaction → Transaction      │
 │  (logs errors, drops invalid rows) │
 └──────────────────┬─────────────────┘
                    │
                    v
 ┌────────────────────────────────────┐
 │       Aggregator.aggregate         │
 │   group, sum, count by (customer,  │
 │   date)                            │
 └──────────────────┬─────────────────┘
                    │
                    v
 ┌────────────────────────────────────┐
 │        FileMetricsSink             │
 │   write daily metrics as JSONL     │
 └──────────────────┬─────────────────┘
                    │
                    v
 ┌──────────────────────────┐
 │  Output File (JSONL)     │
 │  daily_metrics.jsonl     │
 └──────────────────────────┘

Getting Started

This project uses a VS Code Dev Container to ensure a consistent, reproducible development environment for all contributors. The container provides:

  • a pre-configured Docker-based image
  • the correct versions of Java, sbt, Scala, and Metals
  • a predictable build environment regardless of host OS
  • isolation from local machine configuration issues This means every developer runs the same toolchain, avoiding the usual “works on my machine” problems.

Testing

The project includes:

  • Unit tests for:
    • validation logic
    • aggregator business rules
    • parsing behaviour
    • file source and sink

Pipeline tests using in-memory stubs End-to-end tests running the real file-based pipeline Dev container for reproducible builds

Running the Pipeline

sbt run

or inside the VS Code dev container:

sbt test
sbt run

The output metrics are written to the configured JSONL output file.

Possible Improvements

  • Generalise the pipeline from IO to any F[_]: Async to show deeper FP abstraction.
  • Align naming between the interfaces directory and the algebra package for clarity.
  • Add summary metrics (processed rows, validation failures, dropped lines) for better observability.
  • Make malformed-row handling configurable (drop, halt, or send to a separate sink).
  • Add property-based tests for the Aggregator to demonstrate stronger correctness guarantees.
  • Extend end-to-end tests to cover multiple customers, multiple days, and mixed valid/invalid data.
  • Add a second output sink (for example a snapshot file) to demonstrate fan-out.
  • Include a simple CI pipeline (formatting, tests) for completeness.

About

A small, production-style data engineering pipeline written in Scala 3.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published