Documentation Enhancement: Testing, Templates, and Configuration Guidance

# Documentation Enhancement: Testing, Templates, and Configuration Guidance

## Summary

Based on a comprehensive audit of documentation coverage vs. repository content, this issue tracks several high-value documentation improvements that would help users leverage the full power of our production-ready configurations without needing to dig through the repository.

## Background

Currently, users can learn concepts from our docs but must navigate to the repository to find:
- Complete alert rule definitions with production-tested thresholds
- Alert testing infrastructure and methodology
- Working docker-compose demo environments
- Complete OpenTelemetry Collector configurations

## Proposed Improvements

### 1. Testing & Validation Guide (High Priority)

**Create:** `docs/modules/ROOT/pages/guides/testing-alerts.adoc`

**Content:**
- How to use the test files in `prometheus_v2/tests/`
- Document testing methodology (using `promtool test rules`)
- Best practices for validating alerts before production deployment
- Example test workflow:
  ```bash
  # Validate alert rules syntax
  promtool check rules prometheus_v2/rules/*.yml
  
  # Run alert rule tests
  promtool test rules prometheus_v2/tests/*.yml
  ```
- Link to all 20+ test files with explanations of what each validates
- CI/CD integration examples for automated alert testing

**Why:** The repository contains a complete alert testing infrastructure (`prometheus_v2/tests/`) that isn't mentioned in docs. Users should know this exists and how to use it.

**References:**
- Existing tests: `prometheus_v2/tests/`
- Alert rules: `prometheus_v2/rules/`

---

### 2. Quick Start Templates Page (High Priority)

**Create:** `docs/modules/ROOT/pages/quick-start-templates.adoc`

**Content:**
- "Copy this entire config and customize" approach
- Working docker-compose examples from `grafana_v2/demo_v2/`
- Complete setup walkthroughs:
  - **Local Demo Environment**: Step-by-step using docker-compose
  - **Kubernetes Setup**: Using provided manifests
  - **Cloud Deployments**: Using Terraform modules in `newrelic_v2/kickstarter/terraform/`
- Each template should include:
  - What it does
  - Prerequisites
  - Copy-paste ready commands
  - Customization points
  - Expected output/verification

**Example structure:**
```
== Local Demo Environment

=== What You Get
* Prometheus scraping Redis Enterprise (v1 and v2 endpoints)
* Grafana with pre-loaded dashboards
* Pre-configured alerts

=== Setup (5 minutes)
[source,bash]
----
git clone https://github.com/redis-field-engineering/redis-enterprise-observability
cd grafana_v2/demo_v2
docker-compose up -d
----

=== Access Points
* Grafana: http://localhost:3000 (admin/admin)
* Prometheus: http://localhost:9090
* Alertmanager: http://localhost:9093
```

**Why:** Users want working examples they can deploy immediately, then customize. Our demo environments provide this but aren't prominent in docs.

---

### 3. Configuration Decision Tree & Production Checklist (Medium Priority)

**Create:** `docs/modules/ROOT/pages/guides/production-deployment.adoc`

**Content:**

#### Configuration Decision Tree
Help users choose the right approach:
- **Alerts**: When to use consolidated `alerts.yml` vs. separate category files
- **Metrics**: v1 vs. v2 endpoint differences and migration path
- **Service Discovery**: Static configs vs. Kubernetes vs. Consul
- **Data Source**: Direct Prometheus vs. remote write vs. federation

Example decision tree:
```
Are you monitoring multiple clusters?
├─ Yes: Use separate scrape jobs with relabeling
└─ No: Single scrape job is sufficient

Do you need long-term retention?
├─ Yes: Configure remote write to external storage
└─ No: Local TSDB with retention policy

Is Redis Enterprise on Kubernetes?
├─ Yes: Use kubernetes_sd_configs
└─ No: Use static_configs or consul_sd_configs
```

#### Production Readiness Checklist

**Essential Alert Rules** (must have):
- [ ] Database down alerts (`prometheus_v2/rules/node-alerts.yml`)
- [ ] Memory capacity alerts (`prometheus_v2/rules/capacity-alerts.yml`)
- [ ] Connection monitoring (`prometheus_v2/rules/connection-alerts.yml`)
- [ ] Latency thresholds (`prometheus_v2/rules/latency-alerts.yml`)

**Configuration Requirements**:
- [ ] Alert rules tested with `promtool test rules`
- [ ] Alertmanager notification channels configured
- [ ] Recording rules for common queries (reduces query load)
- [ ] Retention policy matches data requirements
- [ ] TLS/authentication configured for production

**Recommended Thresholds by Deployment Size**:
| Metric | Small (<10 DBs) | Medium (10-50 DBs) | Large (>50 DBs) |
|--------|-----------------|-------------------|-----------------|
| Scrape interval | 30s | 30s | 60s |
| Retention | 15d | 30d | 90d |
| Memory alert threshold | 85% | 80% | 75% |

**Testing Requirements Before Production**:
- [ ] All alert rules validated against test files
- [ ] Test alerts sent to notification channels
- [ ] Grafana dashboards load without errors
- [ ] Prometheus scrape targets all healthy
- [ ] Recording rules evaluated successfully

**Why:** Users need guidance on configuration choices and confidence that they haven't missed critical setup steps.

---

### 4. Enhanced Platform Pages

**Updates to existing platform pages:**

#### Dynatrace
- **Add section:** "Building and Deploying the Extension"
  - How to use `extension.yaml`
  - Customizing metric definitions
  - Deployment process
  - Metric tagging and categorization explained

#### All Platform Pages
- **Add section:** "Version Compatibility"
  - Which config files to use for v1 vs. v2 metrics
  - Migration guidance
  - Breaking changes between versions

#### Splunk
- **Add section:** "OpenTelemetry Collector Configuration Walkthrough"
  - Explain each section of the 300-line config
  - Receivers vs. processors vs. exporters
  - Pipeline configuration
  - Environment variable customization

**Why:** Platform-specific details exist in repo configs but lack explanatory documentation.

---

## Implementation Plan

### Phase 1 (Immediate)
- [x] Add "Example Configurations" sections to platform pages (COMPLETED)
- [ ] Create testing-alerts.adoc guide

### Phase 2 (Short-term)
- [ ] Create quick-start-templates.adoc
- [ ] Add production-deployment.adoc with decision tree

### Phase 3 (Medium-term)
- [ ] Enhance platform pages with detailed configuration walkthroughs
- [ ] Add video/tutorial demonstrating complete setup

### Phase 4 (Future/Nice-to-have)
- [ ] Interactive configuration generator tool
- [ ] Automated config validation in CI/CD
- [ ] Example integration tests

## Success Metrics

**Documentation quality:**
- Users can complete a production deployment using only documentation
- Reduced questions about "which config file should I use"
- Fewer issues related to misconfiguration

**User journey:**
1. Read platform page → understand concepts ✓ (already good)
2. Get working config → deploy immediately ✓ (improved with examples section)
3. Test configuration → validate before production ⚠ (needs testing guide)
4. Customize for environment → make informed decisions ⚠ (needs decision tree)
5. Deploy to production → confidence checklist ⚠ (needs production guide)

## References

- Audit findings: [Link to analysis]
- Repository structure: https://github.com/redis-field-engineering/redis-enterprise-observability
- Current documentation: https://redis-field-engineering.github.io/redis-enterprise-observability/

## Labels

documentation, enhancement, good-first-issue (for individual sub-tasks)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Documentation Enhancement: Testing, Templates, and Configuration Guidance #75

Documentation Enhancement: Testing, Templates, and Configuration Guidance

Summary

Background

Proposed Improvements

1. Testing & Validation Guide (High Priority)

2. Quick Start Templates Page (High Priority)

3. Configuration Decision Tree & Production Checklist (Medium Priority)

Configuration Decision Tree

Production Readiness Checklist

4. Enhanced Platform Pages

Dynatrace

All Platform Pages

Splunk

Implementation Plan

Phase 1 (Immediate)

Phase 2 (Short-term)

Phase 3 (Medium-term)

Phase 4 (Future/Nice-to-have)

Success Metrics

References

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Small (<10 DBs)	Medium (10-50 DBs)	Large (>50 DBs)
Scrape interval	30s	30s	60s
Retention	15d	30d	90d
Memory alert threshold	85%	80%	75%

Uh oh!

Documentation Enhancement: Testing, Templates, and Configuration Guidance #75

Description

Documentation Enhancement: Testing, Templates, and Configuration Guidance

Summary

Background

Proposed Improvements

1. Testing & Validation Guide (High Priority)

2. Quick Start Templates Page (High Priority)

3. Configuration Decision Tree & Production Checklist (Medium Priority)

Configuration Decision Tree

Production Readiness Checklist

4. Enhanced Platform Pages

Dynatrace

All Platform Pages

Splunk

Implementation Plan

Phase 1 (Immediate)

Phase 2 (Short-term)

Phase 3 (Medium-term)

Phase 4 (Future/Nice-to-have)

Success Metrics

References

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions