Distributed coffee shop data analysis system using Docker, RabbitMQ, and Go.
| Name | Padrón | |
|---|---|---|
| Castro Martinez, Jose Ignacio | 106957 | jcastrom@fi.uba.ar |
| Diem, Walter Gabriel | 105618 | wdiem@fi.uba.ar |
| Gestoso, Ramiro | 105950 | rgestoso@fi.uba.ar |
You don't really need to have golang installed to run the project since all the nodes are Dockerized. golang is required to be installed on the local machine only for certain tasks (e.g. running tests natively or running go mod tidy manually).
You need to have the following dependencies installed:
- Docker: to run each container of each node in the system in a virtual machine with all the system requirements.
- Docker Compose v2: to orchestrate and simplify the startup of the system and configure the environment variables properly. You can distinguish if you have docker compose v1 or v2 based on the command name:
docker-composeis v1 anddocker composeis v2. - Make: to simplify and automate the commands to run.
- Python v3.12+: to run the end2end tests that exercise the entire system. A basic version of Python is needed as well to generate the Docker Compose YAML file.
The system is designed to work with the following Kaggle dataset: https://www.kaggle.com/datasets/geraldooizx/g-coffee-shop-transaction-202307-to-202506/data, from now on referred to as the full dataset.
Some info regarding the data:
This dataset provides a synthetically generated, comprehensive record of coffee shop transactions spanning July 2023 to June 2025. It is specifically designed to simulate the crucial period following the launch of a new customer membership program and mobile application, offering a unique lens into the evolving dynamics of customer engagement and purchasing behavior.
For testing purposes, a reduced set of data is presented containing all necessary metadata (menu_items, payment_methods, stores, vouchers and users) and some transactions and transaction_items, this dataset will be called the reduced dataset:
.
├── menu_items.csv
├── payment_methods.csv
├── stores.csv
├── transaction_items_202401.csv
├── transaction_items_202501.csv
├── transactions_202401.csv
├── transactions_202501.csv
├── users_202307.csv
├── users_202308.csv
├── users_202309.csv
├── users_202310.csv
├── users_202311.csv
├── users_202312.csv
├── users_202401.csv
├── users_202402.csv
├── users_202403.csv
├── users_202404.csv
├── users_202405.csv
├── users_202406.csv
├── users_202407.csv
├── users_202408.csv
├── users_202409.csv
├── users_202410.csv
├── users_202411.csv
├── users_202412.csv
├── users_202501.csv
├── users_202502.csv
├── users_202503.csv
├── users_202504.csv
├── users_202505.csv
├── users_202506.csv
└── vouchers.csv
This script provides a user-friendly interface to the Python Docker Compose generator. It calls the Python script with provided arguments and interprets the exit codes to provide clear feedback to the user.
-
./gen.sh- This script provides a user-friendly interface to the Python Docker Compose generator. It generates a Docker Compose YAML file with the number of nodes provided as arguments.Usage:
./gen.sh <output_file> <num_clients> <num_filters_by_year> <num_filters_by_hour> <num_filters_by_amount> <num_group_by_year_month> <num_group_by_semester> <num_join_items> <num_join_store> <num_topk>
Usage example:
./gen.sh docker-compose-dev.yaml 1 1 1 1 1 1 1 1 1
Expected output:
Compose file 'docker-compose-dev.yaml' generated with: - Clients: 1 - Filters by Year: 1 - Filters by Hour: 1 - Filters by Amount: 1 - Group by Year: 1 - Group by Semester: 1 - Join Items: 1 - Join Store: 1 - Join Users: 1 - Top K: 1 ✅ docker compose file generated successfullyIf you want to generate a compose file with crasher enabled, you can use the following command:
CRASHER_ENABLED=true ./gen.sh docker-compose-dev.yaml 1 1 1 1 1 1 1 1 1
-
make up- Start all services with rebuild- Uses
docker-compose-dev.yamlby default - Runs containers in detached mode with
--buildflag
- Uses
-
make down- Stop and remove all services- Gracefully stops containers with 3s timeout, then removes them
-
make logs- View real-time logs from servicesmake logs- All services (default)make logs no-rabbit- All services except rabbitmqmake logs only-rabbit- Only rabbitmq service- Follows log output continuously
./gen.sh docker-compose-dev.yaml 1 1 1 1 1 1 1 1 1(or any other node number combination)
make updocker compose -f ./docker-compose-dev.yaml logs client1 --followYou will boot the whole system and check the live tail of logs coming from the client1.
make test- Run tests for common/middleware using Docker (no Go installation required)- Runs containerized tests with testcontainers support
- Includes coverage reporting
make test-v- Run tests with verbose output- Same as
make testbut with detailed test output
- Same as
make raw-test- Run tests directly with Go (requires Go installation)- Runs tests for common/middleware and filters/lib
make raw-test-v- Run tests directly with Go and verbose output- Same as
make raw-testbut with detailed test output
- Same as
Note: Containerized tests (make test) run in a Docker container and require Docker socket access for testcontainers. Raw tests (make raw-test) require Go to be installed locally.
make init-env- Initialize Python virtual environment and install dependencies- Creates a
.venvdirectory and installs packages fromtests/requirements.txt
- Creates a
make activate-env- Display instructions to activate the virtual environment- Prints command to run:
source .venv/bin/activate
- Prints command to run:
make deactivate-env- Display instructions to deactivate the virtual environment- Prints command to run:
deactivate
- Prints command to run:
make pytest- Run Python tests using pytest- Runs all tests in the
tests/directory
- Runs all tests in the
make pytest-verbose- Run Python tests with verbose output- Same as
make pytestbut with-v -ssflags for detailed output
- Same as
Note: Python testing commands are used for full client execution end-to-end tests.
make pytest-verbose runs a series of tests that exercise the system end2end, regenerating the docker-compose.yaml files to try different node and client combinations. The tests are:
test_server_with_one_node_each: 1 client, 1 node in each pipeline stagetest_server_with_two_nodes_each_full: 1 client, 2 nodes in each pipeline stagetest_two_clients_one_each: 2 clients, 1 node in each pipeline stagetest_three: 3 clients, 3 nodes in each pipeline stagetest_five: 5 clients, 5 nodes in each pipeline stage
First of all you need to have the dataset in a directory called testData in the root of the project. Eventually, the real client results are compared against the expected results for that dataset, which is composed of ~30% of the entire dataset, composed of the next files:
.
├── menu_items.csv
├── payment_methods.csv
├── stores.csv
├── transaction_items_202401.csv
├── transaction_items_202402.csv
├── transaction_items_202403.csv
├── transaction_items_202404.csv
├── transaction_items_202501.csv
├── transaction_items_202502.csv
├── transaction_items_202503.csv
├── transaction_items_202504.csv
├── transactions_202401.csv
├── transactions_202402.csv
├── transactions_202403.csv
├── transactions_202404.csv
├── transactions_202501.csv
├── transactions_202502.csv
├── transactions_202503.csv
├── transactions_202504.csv
├── users_202307.csv
├── users_202308.csv
├── users_202309.csv
├── users_202310.csv
├── users_202311.csv
├── users_202312.csv
├── users_202401.csv
├── users_202402.csv
├── users_202403.csv
├── users_202404.csv
├── users_202405.csv
├── users_202406.csv
├── users_202407.csv
├── users_202408.csv
├── users_202409.csv
├── users_202410.csv
├── users_202411.csv
├── users_202412.csv
├── users_202501.csv
├── users_202502.csv
├── users_202503.csv
├── users_202504.csv
├── users_202505.csv
├── users_202506.csv
└── vouchers.csv
The expected results are located at ./tests/expected_results.
make init-envmake activate-envIt will prompt you to run:
source .venv/bin/activateexport REPO_PATH=$(pwd)make pytest-verboseThe tests run on the ~30% dataset, so it may take a while for all the tests to finish. The results taken directly from the clients' files output are compared to the expected results for these tests.
This section describes manual testing procedures to verify the system's functionality, particularly focusing on client restart capabilities and result consistency.
This test verifies that the system produces consistent results when a client is restarted after initial processing.
Prerequisites:
- Ensure the
testDatadirectory contains the reduced dataset (see Dataset section) - Ensure expected results are available in
./tests/expected_results/
Test Steps:
-
Start the normal system:
make up
-
Wait for client 1 to complete processing: Monitor the logs to ensure client 1 has finished processing all data:
docker compose -f ./docker-compose-dev.yaml logs client1 --follow
-
Compare results with expected output: Run the comparison script to verify correctness:
python3 ./scripts/compare_results.py 1
Expected output should show all results matching (✅ indicators for each query).
-
While the rest of the system is up, restart client 1 using the bootc (from bootclient) script: Use the
bootc.shscript to start a standalone client with the test data:./bootc.sh 1 ./testData
-
Verify results consistency: Run the comparison script again to ensure the restarted client produces identical results:
python3 ./scripts/compare_results.py 1
Expected Behavior:
- Initial system run should produce correct results matching expected output
- Restarted client should produce identical results, confirming system consistency
- All comparison outputs should show ✅ for successful matches
Warning
Use with caution
make clean- Basic cleanup (containers + unused images)make clean-containers- Remove all stopped containersmake clean-images- Remove unused Docker images onlymake clean-all-images- Remove ALL Docker images (use with caution)make clean-system- Complete system cleanup including volumes- Removes everything: containers, images, volumes, networks
In both the root and scripts directories, there are tools that allow testing of the system.
The Chaos Monkey is a fault injection tool designed to test the system's resilience by randomly terminating containers during execution. This tool helps validate the system's fault tolerance capabilities by simulating real-world failures and ensuring the system can recover gracefully.
The chaos monkey script (chaos_monkey.sh) can be used to randomly kill containers while the system is running, allowing you to observe how the system handles unexpected failures and validates the robustness of the distributed processing pipeline.
In order to use this tool, you should run the system first and then invoke the script with:
./chaos_monkey.sh <docker compose file> <amount of rounds> [optional: time between attacks]
./chaos_monkey.sh docker-compose-dev.yaml 5
Default time between attacks is 15 seconds
This script will attack filters, groupers, and joins of all types.
The Boom script is a targeted fault injection tool that allows precise control over container termination for testing system resilience. Unlike the Chaos Monkey which attacks containers automatically in sequences, Boom provides manual control for strategic testing scenarios.
- Multiple operation modes: random, target, and group-based container termination
- Smart filtering: Automatically excludes critical containers (RabbitMQ, clients) from random selection
- Group operations: Target multiple instances of the same service type
- Docker Compose integration: Works with any Docker Compose file
./scripts/boom.sh [options]Available Options:
-t <container_name>- Target a specific container by name--mode <mode>- Operation mode (random, target, group)-f <compose_file>- Specify Docker Compose file (defaults to docker-compose-dev.yaml)
1. Random Mode (default)
./scripts/boom.sh
./scripts/boom.sh --mode randomRandomly selects and kills a container from eligible services (excludes RabbitMQ and clients).
2. Target Mode
./scripts/boom.sh -t filter-year1
./scripts/boom.sh --mode target -t group-semester2Kills a specific container by name. Mode is automatically inferred when using -t.
3. Group Mode
./scripts/boom.sh --mode group -t filter-yearRandomly kills one container from a group of services with the same base name (e.g., filter-year1, filter-year2, etc.).
# Kill a random eligible container
./scripts/boom.sh
# Kill a specific container
./scripts/boom.sh -t filter-amount2
# Kill a random container from the year filter group
./scripts/boom.sh --mode group -t filter-year
# Use with custom compose file
./scripts/boom.sh -f custom-compose.yaml -t join-items1- Protected containers: RabbitMQ and client containers are excluded from random selection to maintain system core functionality
- Group validation: Ensures multiple containers exist in a group before random selection
- Error handling: Graceful handling of invalid targets or missing containers
The Compare Results tool validates the correctness of the distributed system's output by comparing actual results against expected results. This tool is essential for ensuring data integrity and verifying that the system produces accurate analytics across all four queries.
- Multi-query validation: Compares results for all four analytical queries (Q1-Q4)
- Detailed difference reporting: Shows exactly which results differ between actual and expected outputs
- Error tolerance: Continues validation even if individual queries fail
- Format normalization: Handles floating-point precision and formatting differences automatically
./scripts/compare_results.sh <client_id>or
python3 ./scripts/compare_results.py <client_id>Parameters:
<client_id>- The client ID to validate results for (e.g., 1, 2, 3)
# Compare results for client 1
./scripts/compare_results.sh 1
# Compare results for client 3
python3 ./scripts/compare_results.py 3
# Example output showing successful validation
Comparando resultados para client_id 1
✅ results_q1: Todos los resultados coinciden (8 filas).
✅ results_q2_best_sellers: Todos los resultados coinciden (24 filas).
✅ results_q3: Todos los resultados coinciden (16 filas).
✅ results_q4: Todos los resultados coinciden (40 filas).The tool expects the following file structure:
- Actual results:
./results/results_q{1-4}_{client_id}.txt - Expected results:
./scripts/expected_results/results_q{1-4}.csv
