This project demonstrates how to process and analyze web server logs using Apache Beam.
It simulates log data, parses it, validates it, and generates meaningful insights such as request counts, response times, and top endpoints.
- Simulates or ingests log lines in format.
- Log validation: detects malformed lines (missing fields, bad timestamps, etc).
Parses valid lines into structured records:timestampmethodendpointstatus_coderesponse_time.
- Classifies status codes into categories: 2xx (success), 4xx (client error), 5xx (server error).
- Computes metrics such as:
- Requests per minute
- Average response time
- Top N most requested endpoints
- Outputs reports in CSV and JSON formats:
logs_summary.csv→ per-minute metrics.top_endpoints.csv→ ranking of endpoints.log_validation_report.csv/.json→ list of invalid log lines with details.
- Python (>= 3.11).
- Apache Beam (batch mode).
- Faker (for synthetic log generation).
- CSV / JSON for report output.
.
└── src
├── data
│ ├── clean
│ │ ├── clean_logs_report.log
│ │ └── logs_clean.log
│ ├── raw
│ │ └── logs.log
│ └── validation
│ ├── log_validation_report.csv
│ └── log_validation_report.json
├── clean_logs.py
├── generate_logs.py
└── validation.py
├── LICENSE
├── README.md
└── requirements.txt
- Logs are simulated in a
logs.logfile (generated with Faker or taken from an Apache/Nginx dataset). ⚠️ The dataset is not clean and intentionally contains some invalid or corrupted log lines, such as:- Missing fields.
- Malformed timestamps.
This ensures the pipeline also handles data quality validation.
2025-09-07 22:08:59 DELETE /api/cart 200 241ms
2025-09-07 22:08:18 POST /api/payments 200 94ms
BAD_LOG_LINE
2025-09-07 22:08:44 POST /api/reviews 200 66ms
2025/09/07-22:10:23 PUT /api/orders 200 74ms
- Format:
TIMESTAMP METHOD ENDPOINT STATUS_CODE RESPONSE_TIME
-
Pipeline:
-
Parse each line into a dictionary:
timestampmethodendpointstatus_coderesponse_time
-
Classify status codes:
2xx→ success4xx→ client error5xx→ server error
-
Metrics & Analysis:
- Count requests per minute.
- Find Top 3 most requested endpoints.
- Calculate average response_time.
-
-
Output:
logs_summary.csv→ metrics per minute.top_endpoints.csv→ endpoint ranking.log_validation_report.csv→ list of invalid log lines with error details.log_validation_report.json→ list of invalid log lines with error details.
- You can customize the log pattern (
LOG_PATTERN) if your log format differs. - Adjust the window or aggregation logic in Apache Beam if you want e.g., sliding windows, hourly metrics, etc.
- Modify the thresholds or top-N value for endpoint ranking.
- Extend output formats or integrate with a streaming runner if needed.
- 🚀 Rapid setup of a log-processing pipeline using Apache Beam.
- ✅ Includes validation logic so you don’t just parse logs blindly — you also catch bad/malformed lines.
- 📊 Produces actionable metrics:
- counts.
- response times.
- endpoint popularity.
- 🔧 Easily extensible to other log formats or more advanced analytics.
Author: Camila Javiera
Feel free to open an issue or contact me. Contributions are welcome!