-
Couldn't load subscription status.
- Fork 2
Description
Documentation Enhancement: Testing, Templates, and Configuration Guidance
Summary
Based on a comprehensive audit of documentation coverage vs. repository content, this issue tracks several high-value documentation improvements that would help users leverage the full power of our production-ready configurations without needing to dig through the repository.
Background
Currently, users can learn concepts from our docs but must navigate to the repository to find:
- Complete alert rule definitions with production-tested thresholds
- Alert testing infrastructure and methodology
- Working docker-compose demo environments
- Complete OpenTelemetry Collector configurations
Proposed Improvements
1. Testing & Validation Guide (High Priority)
Create: docs/modules/ROOT/pages/guides/testing-alerts.adoc
Content:
- How to use the test files in
prometheus_v2/tests/ - Document testing methodology (using
promtool test rules) - Best practices for validating alerts before production deployment
- Example test workflow:
# Validate alert rules syntax promtool check rules prometheus_v2/rules/*.yml # Run alert rule tests promtool test rules prometheus_v2/tests/*.yml
- Link to all 20+ test files with explanations of what each validates
- CI/CD integration examples for automated alert testing
Why: The repository contains a complete alert testing infrastructure (prometheus_v2/tests/) that isn't mentioned in docs. Users should know this exists and how to use it.
References:
- Existing tests:
prometheus_v2/tests/ - Alert rules:
prometheus_v2/rules/
2. Quick Start Templates Page (High Priority)
Create: docs/modules/ROOT/pages/quick-start-templates.adoc
Content:
- "Copy this entire config and customize" approach
- Working docker-compose examples from
grafana_v2/demo_v2/ - Complete setup walkthroughs:
- Local Demo Environment: Step-by-step using docker-compose
- Kubernetes Setup: Using provided manifests
- Cloud Deployments: Using Terraform modules in
newrelic_v2/kickstarter/terraform/
- Each template should include:
- What it does
- Prerequisites
- Copy-paste ready commands
- Customization points
- Expected output/verification
Example structure:
== Local Demo Environment
=== What You Get
* Prometheus scraping Redis Enterprise (v1 and v2 endpoints)
* Grafana with pre-loaded dashboards
* Pre-configured alerts
=== Setup (5 minutes)
[source,bash]
----
git clone https://github.com/redis-field-engineering/redis-enterprise-observability
cd grafana_v2/demo_v2
docker-compose up -d
----
=== Access Points
* Grafana: http://localhost:3000 (admin/admin)
* Prometheus: http://localhost:9090
* Alertmanager: http://localhost:9093
Why: Users want working examples they can deploy immediately, then customize. Our demo environments provide this but aren't prominent in docs.
3. Configuration Decision Tree & Production Checklist (Medium Priority)
Create: docs/modules/ROOT/pages/guides/production-deployment.adoc
Content:
Configuration Decision Tree
Help users choose the right approach:
- Alerts: When to use consolidated
alerts.ymlvs. separate category files - Metrics: v1 vs. v2 endpoint differences and migration path
- Service Discovery: Static configs vs. Kubernetes vs. Consul
- Data Source: Direct Prometheus vs. remote write vs. federation
Example decision tree:
Are you monitoring multiple clusters?
├─ Yes: Use separate scrape jobs with relabeling
└─ No: Single scrape job is sufficient
Do you need long-term retention?
├─ Yes: Configure remote write to external storage
└─ No: Local TSDB with retention policy
Is Redis Enterprise on Kubernetes?
├─ Yes: Use kubernetes_sd_configs
└─ No: Use static_configs or consul_sd_configs
Production Readiness Checklist
Essential Alert Rules (must have):
- Database down alerts (
prometheus_v2/rules/node-alerts.yml) - Memory capacity alerts (
prometheus_v2/rules/capacity-alerts.yml) - Connection monitoring (
prometheus_v2/rules/connection-alerts.yml) - Latency thresholds (
prometheus_v2/rules/latency-alerts.yml)
Configuration Requirements:
- Alert rules tested with
promtool test rules - Alertmanager notification channels configured
- Recording rules for common queries (reduces query load)
- Retention policy matches data requirements
- TLS/authentication configured for production
Recommended Thresholds by Deployment Size:
| Metric | Small (<10 DBs) | Medium (10-50 DBs) | Large (>50 DBs) |
|---|---|---|---|
| Scrape interval | 30s | 30s | 60s |
| Retention | 15d | 30d | 90d |
| Memory alert threshold | 85% | 80% | 75% |
Testing Requirements Before Production:
- All alert rules validated against test files
- Test alerts sent to notification channels
- Grafana dashboards load without errors
- Prometheus scrape targets all healthy
- Recording rules evaluated successfully
Why: Users need guidance on configuration choices and confidence that they haven't missed critical setup steps.
4. Enhanced Platform Pages
Updates to existing platform pages:
Dynatrace
- Add section: "Building and Deploying the Extension"
- How to use
extension.yaml - Customizing metric definitions
- Deployment process
- Metric tagging and categorization explained
- How to use
All Platform Pages
- Add section: "Version Compatibility"
- Which config files to use for v1 vs. v2 metrics
- Migration guidance
- Breaking changes between versions
Splunk
- Add section: "OpenTelemetry Collector Configuration Walkthrough"
- Explain each section of the 300-line config
- Receivers vs. processors vs. exporters
- Pipeline configuration
- Environment variable customization
Why: Platform-specific details exist in repo configs but lack explanatory documentation.
Implementation Plan
Phase 1 (Immediate)
- Add "Example Configurations" sections to platform pages (COMPLETED)
- Create testing-alerts.adoc guide
Phase 2 (Short-term)
- Create quick-start-templates.adoc
- Add production-deployment.adoc with decision tree
Phase 3 (Medium-term)
- Enhance platform pages with detailed configuration walkthroughs
- Add video/tutorial demonstrating complete setup
Phase 4 (Future/Nice-to-have)
- Interactive configuration generator tool
- Automated config validation in CI/CD
- Example integration tests
Success Metrics
Documentation quality:
- Users can complete a production deployment using only documentation
- Reduced questions about "which config file should I use"
- Fewer issues related to misconfiguration
User journey:
- Read platform page → understand concepts ✓ (already good)
- Get working config → deploy immediately ✓ (improved with examples section)
- Test configuration → validate before production ⚠ (needs testing guide)
- Customize for environment → make informed decisions ⚠ (needs decision tree)
- Deploy to production → confidence checklist ⚠ (needs production guide)
References
- Audit findings: [Link to analysis]
- Repository structure: https://github.com/redis-field-engineering/redis-enterprise-observability
- Current documentation: https://redis-field-engineering.github.io/redis-enterprise-observability/
Labels
documentation, enhancement, good-first-issue (for individual sub-tasks)