A small, production-style data engineering pipeline written in Scala 3. The aim is to demonstrate clear functional design, separation of concerns, testing discipline, and general engineering practices.
This project provides:
- A modern Cats / Cats Effect codebase
- A streaming pipeline built with fs2
- Clean domain modelling and algebras
- Configurable and testable application structure
- Unit and end-to-end tests
- A VS Code dev container for consistent tooling
This project implements a small, realistic data-engineering pipeline using functional Scala. The pipeline ingests raw CSV transactions, validates the input, aggregates the data, and writes daily metrics to an output file in JSONL format. Everything is built using pure, testable components with clear separation of concerns.
┌──────────────────────────┐
│ CSV File (input) │
└───────────────┬──────────┘
│
v
┌────────────────────────────────────┐
│ FileTransactionSource │
│ (stream file line by line) │
└──────────────────┬─────────────────┘
│
v
┌────────────────────────────────────┐
│ parseRow │
│ CSV → RawTransaction │
└──────────────────┬─────────────────┘
│
v
┌────────────────────────────────────┐
│ validateRow │
│ RawTransaction → Transaction │
│ (logs errors, drops invalid rows) │
└──────────────────┬─────────────────┘
│
v
┌────────────────────────────────────┐
│ Aggregator.aggregate │
│ group, sum, count by (customer, │
│ date) │
└──────────────────┬─────────────────┘
│
v
┌────────────────────────────────────┐
│ FileMetricsSink │
│ write daily metrics as JSONL │
└──────────────────┬─────────────────┘
│
v
┌──────────────────────────┐
│ Output File (JSONL) │
│ daily_metrics.jsonl │
└──────────────────────────┘
This project uses a VS Code Dev Container to ensure a consistent, reproducible development environment for all contributors. The container provides:
- a pre-configured Docker-based image
- the correct versions of Java, sbt, Scala, and Metals
- a predictable build environment regardless of host OS
- isolation from local machine configuration issues This means every developer runs the same toolchain, avoiding the usual “works on my machine” problems.
The project includes:
- Unit tests for:
- validation logic
- aggregator business rules
- parsing behaviour
- file source and sink
Pipeline tests using in-memory stubs End-to-end tests running the real file-based pipeline Dev container for reproducible builds
sbt run
or inside the VS Code dev container:
sbt test
sbt run
The output metrics are written to the configured JSONL output file.
- Generalise the pipeline from
IOto anyF[_]: Asyncto show deeper FP abstraction. - Align naming between the
interfacesdirectory and thealgebrapackage for clarity. - Add summary metrics (processed rows, validation failures, dropped lines) for better observability.
- Make malformed-row handling configurable (drop, halt, or send to a separate sink).
- Add property-based tests for the
Aggregatorto demonstrate stronger correctness guarantees. - Extend end-to-end tests to cover multiple customers, multiple days, and mixed valid/invalid data.
- Add a second output sink (for example a snapshot file) to demonstrate fan-out.
- Include a simple CI pipeline (formatting, tests) for completeness.