TL;DR - A data-lakehouse implementation that stores stimulus-response observations from controlled experiments (LASSO arena) and from real-world CI pipelines. It provides the first practical, scalable realization of the continual Stimulus‑Response Cube (SRC, formlery SRH) introduced in our TOSEM article "Morescient GAI for Software Engineering". This work has been accepted to the SANER 2026 Tool Demo Track.
Observation Lakehouse is a Python library that:
- Ingests raw LASSO arena runs (or any CI-pipeline output) into three Iceberg-managed tables -
observations,tests, andcode_implementations. - Stores the data as partitioned Parquet files (
data_set_id,problem_id) and keeps full Iceberg metadata for ACID guarantees and schema evolution. - Provides a single-query interface (DuckDB + Arrow) to materialise:
- SRM output views (the "stimulus-response matrix" for a given coding problem),
- Behavioural clustering (equivalence classes of implementations),
- Consensus oracles, (majority voting)
- Three-way joins across all three tables.
- Achieves ~155k records/s ingestion and sub-100-ms interactive query latency on a local laptop.
The repository contains everything needed to reproduce the performance assessment, extend the platform with new datasets, ideas etc.
observation-lakehouse/
├── olake/ # Main Python package
│ ├── ingest/ # Data-ingestion helpers
│ │ ├── arena.py # Utilities to ingest data from LASSO arena
│ ├── lakehouse.py # Main lakehouse implementation
├── notebooks/ # Jupyter Notebooks
│ ├── analysis.ipynb # End-to-end query walkthrough
│ └── benchmark_stats.ipynb # Statistics for performance evaluation
├── benchmark_*.py # replication scripts for the three performance assessments
├── lasso_arena_ingest.py # script to import data from LASSO run (also contains a simple time tracker for ingestion times)
├── pyproject.toml # Project dependencies (managed by uv)
└── README.md # This file
All notebooks and benchmark scripts are self-contained - they spin up a temporary DuckDB connection to the data in warehouse/.
| Tool | Minimum version |
|---|---|
| Python | 3.12 (or newer) |
| Git | any recent version |
uv |
fast dependency manager (recommended) |
This project uses uv for dependency management:
# Install uv
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone & enter the repo
git clone https://github.com/SoftwareObservatorium/observation-lakehouse.git
cd observation-lakehouse
# Resolve & install all dependencies
uv sync # creates .venv and writes a lockfile
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # WindowsThe dataset used in the paper (~1.8GiB) is provided as a single zip containing raw exported data from a recent test-driven software experiment with LASSO (assessment of four code LLMs based on two benchmarks: HumanEval and MBPP)
- Download the replication package
👉 https://zenodo.org/records/17791444 - Unzip its contents directly into the repository root.
- Run the ingestion script to create the Iceberg catalog, load the tables and ingest the dataset above:
# depending on your local setup, run "lasso_arena_ingest.py" directly
python3 lasso_arena_ingest.py
# or using 'uv'
uv run lasso_arena_ingest.pyWhat happens?
arena.pyparses the JSON-encoded tests, and implementations, and reads the raw observation data from large Parquet files generated by the LASSO arena.lakehouse.pywrites the data as partitioned Parquet files and creates the accompanying Iceberg metadata underwarehouse/db/.
After completion you will see the following layout based on Hive Paritioning:
observation-lakehouse/
├─ warehouse/
│ └─ db/
│ ├─ observations/
│ │ ├─ data/ # partitioned by data_set_id and problem_id
│ │ │ ├─ data_set_id=HumanEval/
│ │ │ │ └─ problem_id=HumanEval_0_has_close_elements/
│ │ │ │ └─ *.parquet # SRM for problem
│ │ │ └─ ...
│ │ └─ metadata/ # Iceberg snapshots, manifests, ...
│ ├─ code_implementations/
│ │ └─ (same partitioning as observations)
│ └─ tests/
│ └─ (same partitioning as observations)
└─ iceberg_catalog.db # SQLite-based Iceberg catalog used by DuckDB
Note - The
data_set_idandproblem_idpartitions allow partition pruning, which is the core reason for the sub-100ms query latencies reported.
All benchmark scripts are located in benchmark_*.py. They report the numbers that appear in Table 2 of the paper (average latency per problem, cold-cache run).
# SRM output view reconstruction (Q1)
uv run benchmark_srm_output_view.py
# Example: behavioural clustering (Q2 in the paper)
uv run benchmark_behavioral_clustering.py
# Full three-way join (Q3)
uv run benchmark_three_way_join.pyEach script produces a CSV. The notebook in notebooks/benchmark_stats.ipynb can be used to create descriptive statistics for these.
The analysis notebook (notebooks/analysis.ipynb) walks you through the exact SQL statements:
| Section | Goal |
|---|---|
| SRM output view | Re-creates the SRM output view |
| Behavioural clustering | Groups implementations by identical output trace |
| Three‑way join | Joins observations, tests, code_implementations |
| Consensus oracle | Computes the majority output per test case (see Behavioural clustering) |
To run the notebook:
uv run jupyter-lab
# In the browser, open notebooks/analysis.ipynb and follow the markdown cells.All queries are pure SQL (DuckDB).
- Analyze the table schemata in
olake/lakehouse.py - Write an ingestion driver in
olake/ingest/that yields rows for the three tables (observations,tests,code_implementations). The existingarena.pycan be used as a template.
- Serialization - The current JSON-based stimulus/response format cannot natively represent binary streams or non-JSON-friendly objects (e.g., file handles).
- Equivalence checking - Complex values (e.g., exceptions with different stack traces) are treated as different unless preprocessed like we do in LASSO's Sequence Sheet protocol
- Real-world CI ingestion - The provided driver only supports the LASSO arena. Test-driver extension prototypes for JUnit5 and PyTest are under development and are going to be made available in a next step.
All of these will be addressed in future releases.
We welcome community contributions:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/my-new-driver). - Submit a Pull Request with a clear description and, if applicable, updated benchmark numbers.
If you use Observation Lakehouse in your research, you can cite our tool paper (SANER 2026 Tool Demo Track), and/or our work that introduces the continual SRC:
@misc{kessel2025observationlakehouseslivinginteractive,
title={Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior},
author={Marcus Kessel},
year={2025},
eprint={2512.02795},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2512.02795},
}see Arxiv
@article{kessel2025,
author = {Kessel, Marcus and Atkinson, Colin},
title = {Morescient {GAI} for Software Engineering},
year = {2025},
issue_date = {June 2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {34},
number = {5},
issn = {1049-331X},
url = {https://doi.org/10.1145/3709354},
doi = {10.1145/3709354},
journal = {ACM Trans. Softw. Eng. Methodol.},
month = may,
articleno = {123},
numpages = {17},
}A pre-print is available at: Arxiv
- Project Lead - Marcus Kessel (marcus.kessel@uni-mannheim.de)
- GitHub - https://github.com/SoftwareObservatorium/observation-lakehouse
- Issue Tracker - Use the GitHub Issues tab for bugs, feature requests, or questions.