Skip to content

Documentation Enhancement: Testing, Templates, and Configuration Guidance #75

@joshrotenberg

Description

@joshrotenberg

Documentation Enhancement: Testing, Templates, and Configuration Guidance

Summary

Based on a comprehensive audit of documentation coverage vs. repository content, this issue tracks several high-value documentation improvements that would help users leverage the full power of our production-ready configurations without needing to dig through the repository.

Background

Currently, users can learn concepts from our docs but must navigate to the repository to find:

  • Complete alert rule definitions with production-tested thresholds
  • Alert testing infrastructure and methodology
  • Working docker-compose demo environments
  • Complete OpenTelemetry Collector configurations

Proposed Improvements

1. Testing & Validation Guide (High Priority)

Create: docs/modules/ROOT/pages/guides/testing-alerts.adoc

Content:

  • How to use the test files in prometheus_v2/tests/
  • Document testing methodology (using promtool test rules)
  • Best practices for validating alerts before production deployment
  • Example test workflow:
    # Validate alert rules syntax
    promtool check rules prometheus_v2/rules/*.yml
    
    # Run alert rule tests
    promtool test rules prometheus_v2/tests/*.yml
  • Link to all 20+ test files with explanations of what each validates
  • CI/CD integration examples for automated alert testing

Why: The repository contains a complete alert testing infrastructure (prometheus_v2/tests/) that isn't mentioned in docs. Users should know this exists and how to use it.

References:

  • Existing tests: prometheus_v2/tests/
  • Alert rules: prometheus_v2/rules/

2. Quick Start Templates Page (High Priority)

Create: docs/modules/ROOT/pages/quick-start-templates.adoc

Content:

  • "Copy this entire config and customize" approach
  • Working docker-compose examples from grafana_v2/demo_v2/
  • Complete setup walkthroughs:
    • Local Demo Environment: Step-by-step using docker-compose
    • Kubernetes Setup: Using provided manifests
    • Cloud Deployments: Using Terraform modules in newrelic_v2/kickstarter/terraform/
  • Each template should include:
    • What it does
    • Prerequisites
    • Copy-paste ready commands
    • Customization points
    • Expected output/verification

Example structure:

== Local Demo Environment

=== What You Get
* Prometheus scraping Redis Enterprise (v1 and v2 endpoints)
* Grafana with pre-loaded dashboards
* Pre-configured alerts

=== Setup (5 minutes)
[source,bash]
----
git clone https://github.com/redis-field-engineering/redis-enterprise-observability
cd grafana_v2/demo_v2
docker-compose up -d
----

=== Access Points
* Grafana: http://localhost:3000 (admin/admin)
* Prometheus: http://localhost:9090
* Alertmanager: http://localhost:9093

Why: Users want working examples they can deploy immediately, then customize. Our demo environments provide this but aren't prominent in docs.


3. Configuration Decision Tree & Production Checklist (Medium Priority)

Create: docs/modules/ROOT/pages/guides/production-deployment.adoc

Content:

Configuration Decision Tree

Help users choose the right approach:

  • Alerts: When to use consolidated alerts.yml vs. separate category files
  • Metrics: v1 vs. v2 endpoint differences and migration path
  • Service Discovery: Static configs vs. Kubernetes vs. Consul
  • Data Source: Direct Prometheus vs. remote write vs. federation

Example decision tree:

Are you monitoring multiple clusters?
├─ Yes: Use separate scrape jobs with relabeling
└─ No: Single scrape job is sufficient

Do you need long-term retention?
├─ Yes: Configure remote write to external storage
└─ No: Local TSDB with retention policy

Is Redis Enterprise on Kubernetes?
├─ Yes: Use kubernetes_sd_configs
└─ No: Use static_configs or consul_sd_configs

Production Readiness Checklist

Essential Alert Rules (must have):

  • Database down alerts (prometheus_v2/rules/node-alerts.yml)
  • Memory capacity alerts (prometheus_v2/rules/capacity-alerts.yml)
  • Connection monitoring (prometheus_v2/rules/connection-alerts.yml)
  • Latency thresholds (prometheus_v2/rules/latency-alerts.yml)

Configuration Requirements:

  • Alert rules tested with promtool test rules
  • Alertmanager notification channels configured
  • Recording rules for common queries (reduces query load)
  • Retention policy matches data requirements
  • TLS/authentication configured for production

Recommended Thresholds by Deployment Size:

Metric Small (<10 DBs) Medium (10-50 DBs) Large (>50 DBs)
Scrape interval 30s 30s 60s
Retention 15d 30d 90d
Memory alert threshold 85% 80% 75%

Testing Requirements Before Production:

  • All alert rules validated against test files
  • Test alerts sent to notification channels
  • Grafana dashboards load without errors
  • Prometheus scrape targets all healthy
  • Recording rules evaluated successfully

Why: Users need guidance on configuration choices and confidence that they haven't missed critical setup steps.


4. Enhanced Platform Pages

Updates to existing platform pages:

Dynatrace

  • Add section: "Building and Deploying the Extension"
    • How to use extension.yaml
    • Customizing metric definitions
    • Deployment process
    • Metric tagging and categorization explained

All Platform Pages

  • Add section: "Version Compatibility"
    • Which config files to use for v1 vs. v2 metrics
    • Migration guidance
    • Breaking changes between versions

Splunk

  • Add section: "OpenTelemetry Collector Configuration Walkthrough"
    • Explain each section of the 300-line config
    • Receivers vs. processors vs. exporters
    • Pipeline configuration
    • Environment variable customization

Why: Platform-specific details exist in repo configs but lack explanatory documentation.


Implementation Plan

Phase 1 (Immediate)

  • Add "Example Configurations" sections to platform pages (COMPLETED)
  • Create testing-alerts.adoc guide

Phase 2 (Short-term)

  • Create quick-start-templates.adoc
  • Add production-deployment.adoc with decision tree

Phase 3 (Medium-term)

  • Enhance platform pages with detailed configuration walkthroughs
  • Add video/tutorial demonstrating complete setup

Phase 4 (Future/Nice-to-have)

  • Interactive configuration generator tool
  • Automated config validation in CI/CD
  • Example integration tests

Success Metrics

Documentation quality:

  • Users can complete a production deployment using only documentation
  • Reduced questions about "which config file should I use"
  • Fewer issues related to misconfiguration

User journey:

  1. Read platform page → understand concepts ✓ (already good)
  2. Get working config → deploy immediately ✓ (improved with examples section)
  3. Test configuration → validate before production ⚠ (needs testing guide)
  4. Customize for environment → make informed decisions ⚠ (needs decision tree)
  5. Deploy to production → confidence checklist ⚠ (needs production guide)

References

Labels

documentation, enhancement, good-first-issue (for individual sub-tasks)

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions