Skip to content

SoftwareObservatorium/observation-lakehouse

Repository files navigation

Observation Lakehouse

TL;DR - A data-lakehouse implementation that stores stimulus-response observations from controlled experiments (LASSO arena) and from real-world CI pipelines. It provides the first practical, scalable realization of the continual Stimulus‑Response Cube (SRC, formlery SRH) introduced in our TOSEM article "Morescient GAI for Software Engineering". This work has been accepted to the SANER 2026 Tool Demo Track.

Overview

Observation Lakehouse is a Python library that:

  • Ingests raw LASSO arena runs (or any CI-pipeline output) into three Iceberg-managed tables - observations, tests, and code_implementations.
  • Stores the data as partitioned Parquet files (data_set_id, problem_id) and keeps full Iceberg metadata for ACID guarantees and schema evolution.
  • Provides a single-query interface (DuckDB + Arrow) to materialise:
    • SRM output views (the "stimulus-response matrix" for a given coding problem),
    • Behavioural clustering (equivalence classes of implementations),
    • Consensus oracles, (majority voting)
    • Three-way joins across all three tables.
  • Achieves ~155k records/s ingestion and sub-100-ms interactive query latency on a local laptop.

The repository contains everything needed to reproduce the performance assessment, extend the platform with new datasets, ideas etc.

Project Structure

observation-lakehouse/
├── olake/                         # Main Python package
│   ├── ingest/                    # Data-ingestion helpers
│   │   ├── arena.py               # Utilities to ingest data from LASSO arena
│   ├── lakehouse.py               # Main lakehouse implementation
├── notebooks/                     # Jupyter Notebooks
│   ├── analysis.ipynb             # End-to-end query walkthrough
│   └── benchmark_stats.ipynb      # Statistics for performance evaluation
├── benchmark_*.py                 # replication scripts for the three performance assessments
├── lasso_arena_ingest.py          # script to import data from LASSO run (also contains a simple time tracker for ingestion times)
├── pyproject.toml                 # Project dependencies (managed by uv)
└── README.md                      # This file

All notebooks and benchmark scripts are self-contained - they spin up a temporary DuckDB connection to the data in warehouse/.

Setup Instructions

Prerequisites

Tool Minimum version
Python 3.12 (or newer)
Git any recent version
uv fast dependency manager (recommended)

Installation

This project uses uv for dependency management:

# Install uv
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone & enter the repo
git clone https://github.com/SoftwareObservatorium/observation-lakehouse.git
cd observation-lakehouse

# Resolve & install all dependencies
uv sync                # creates .venv and writes a lockfile
source .venv/bin/activate   # Linux/macOS
# .venv\Scripts\activate    # Windows

Importing the LASSO Replication Dataset

The dataset used in the paper (~1.8GiB) is provided as a single zip containing raw exported data from a recent test-driven software experiment with LASSO (assessment of four code LLMs based on two benchmarks: HumanEval and MBPP)

  1. Download the replication package
    👉 https://zenodo.org/records/17791444
  2. Unzip its contents directly into the repository root.
  3. Run the ingestion script to create the Iceberg catalog, load the tables and ingest the dataset above:
# depending on your local setup, run "lasso_arena_ingest.py" directly
python3 lasso_arena_ingest.py

# or using 'uv'
uv run lasso_arena_ingest.py

What happens?

  • arena.py parses the JSON-encoded tests, and implementations, and reads the raw observation data from large Parquet files generated by the LASSO arena.
  • lakehouse.py writes the data as partitioned Parquet files and creates the accompanying Iceberg metadata under warehouse/db/.

After completion you will see the following layout based on Hive Paritioning:

observation-lakehouse/
├─ warehouse/
│  └─ db/
│     ├─ observations/
│     │   ├─ data/ # partitioned by data_set_id and problem_id
│     │   │   ├─ data_set_id=HumanEval/
│     │   │   │   └─ problem_id=HumanEval_0_has_close_elements/
│     │   │   │        └─ *.parquet # SRM for problem
│     │   │   └─ ...
│     │   └─ metadata/          # Iceberg snapshots, manifests, ...
│     ├─ code_implementations/
│     │   └─ (same partitioning as observations)
│     └─ tests/
│         └─ (same partitioning as observations)
└─ iceberg_catalog.db            # SQLite-based Iceberg catalog used by DuckDB

Note - The data_set_id and problem_id partitions allow partition pruning, which is the core reason for the sub-100ms query latencies reported.

Re-producing the Performance Benchmarks

All benchmark scripts are located in benchmark_*.py. They report the numbers that appear in Table 2 of the paper (average latency per problem, cold-cache run).

# SRM output view reconstruction (Q1)
uv run benchmark_srm_output_view.py

# Example: behavioural clustering (Q2 in the paper)
uv run benchmark_behavioral_clustering.py

# Full three-way join (Q3)
uv run benchmark_three_way_join.py

Each script produces a CSV. The notebook in notebooks/benchmark_stats.ipynb can be used to create descriptive statistics for these.

Querying the Observation Lakehouse

The analysis notebook (notebooks/analysis.ipynb) walks you through the exact SQL statements:

Section Goal
SRM output view Re-creates the SRM output view
Behavioural clustering Groups implementations by identical output trace
Three‑way join Joins observations, tests, code_implementations
Consensus oracle Computes the majority output per test case (see Behavioural clustering)

To run the notebook:

uv run jupyter-lab

# In the browser, open notebooks/analysis.ipynb and follow the markdown cells.

All queries are pure SQL (DuckDB).

Adding a New Dataset

  1. Analyze the table schemata in olake/lakehouse.py
  2. Write an ingestion driver in olake/ingest/ that yields rows for the three tables (observations, tests, code_implementations). The existing arena.py can be used as a template.

Known Limitations

  • Serialization - The current JSON-based stimulus/response format cannot natively represent binary streams or non-JSON-friendly objects (e.g., file handles).
  • Equivalence checking - Complex values (e.g., exceptions with different stack traces) are treated as different unless preprocessed like we do in LASSO's Sequence Sheet protocol
  • Real-world CI ingestion - The provided driver only supports the LASSO arena. Test-driver extension prototypes for JUnit5 and PyTest are under development and are going to be made available in a next step.

All of these will be addressed in future releases.

Contributing

We welcome community contributions:

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/my-new-driver).
  3. Submit a Pull Request with a clear description and, if applicable, updated benchmark numbers.

Citation

If you use Observation Lakehouse in your research, you can cite our tool paper (SANER 2026 Tool Demo Track), and/or our work that introduces the continual SRC:

@misc{kessel2025observationlakehouseslivinginteractive,
      title={Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior}, 
      author={Marcus Kessel},
      year={2025},
      eprint={2512.02795},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2512.02795}, 
}

see Arxiv

@article{kessel2025,
author = {Kessel, Marcus and Atkinson, Colin},
title = {Morescient {GAI} for Software Engineering},
year = {2025},
issue_date = {June 2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {34},
number = {5},
issn = {1049-331X},
url = {https://doi.org/10.1145/3709354},
doi = {10.1145/3709354},
journal = {ACM Trans. Softw. Eng. Methodol.},
month = may,
articleno = {123},
numpages = {17},
}

A pre-print is available at: Arxiv


Contact

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors