LifeRPG_v2.0/modern/ops/RUNBOOK.md
TLimoges33 7fe4ae5365
🧙‍♂️ Transform LifeRPG into The Wizard's Grimoire - Production-Ready Application
 Major Features Added:
- Complete magical theming and rebranding from LifeRPG to The Wizard's Grimoire
- Production-grade React frontend with Tailwind CSS v4 and magical aesthetics
- Comprehensive analytics dashboard with Recharts integration (ScryingPortal)
- Push notifications system with PWA service worker support
- Drag & drop functionality using @dnd-kit for habit reordering
- Social features with friends system and leaderboards
- Performance optimization tools and monitoring
- Mobile app enhancement with PWA installation support

🏗️ Technical Infrastructure:
- Advanced service worker with offline support and background sync
- Zustand state management for scalable application state
- Production-ready UI component system with enhanced Button, Card, Input
- Progressive Web App (PWA) with manifest and app installation
- FastAPI backend with comprehensive API endpoints
- Docker containerization and CI/CD pipeline setup

📱 Progressive Web App Features:
- Offline functionality with intelligent caching
- Push notification support for habit reminders
- App installation on mobile and desktop platforms
- Background sync for offline data management
- Performance monitoring and optimization tools

🎨 User Experience:
- Magical wizard/grimoire theming throughout application
- Responsive design optimized for all device sizes
- Drag & drop habit management with smooth animations
- Interactive analytics with multiple chart types
- Social connectivity with friends and competitive features
- Comprehensive notification and performance settings

🔧 Developer Experience:
- Modern development stack with Vite and React
- Comprehensive testing setup and CI/CD pipelines
- Code quality tools with pre-commit hooks
- Docker development environment
- Detailed documentation and implementation guides

This represents a complete transformation from prototype to production-ready application with enterprise-grade features and magical user experience.
2025-08-30 17:32:42 +00:00

2.4 KiB

LifeRPG Ops Runbook

This runbook summarizes common operational signals and actions.

Key metrics and dashboards

  • HTTP: request rate (http_requests_total), p95 latency (http_request_duration_seconds), in-progress gauge.
  • Jobs: jobs_processed_total{status}.
  • Integrations: integration_sync_total{provider,result}, integration_sync_by_integration_total{integration_id,result}.
  • Backpressure: sync_enqueue_skips_total{reason}, sync_queue_depth{provider}, sync_inflight{provider}.
  • Logs: structured JSON logs for requests and jobs; ship via Promtail to Loki.

Grafana dashboard: ops/grafana-dashboard.json (import into Grafana and configure PROM_DS and LOKI_DS).

Common symptoms

  1. High enqueue skips
  • Symptom: sync_enqueue_skips_total rate > 0.2 for >10m.
  • Likely causes: provider concurrency cap, duplicate enqueues (guard), or downstream slowness.
  • Actions:
    • Check sync_inflight{provider} vs cap (env SYNC_MAX_CONCURRENCY_PER_PROVIDER).
    • Temporarily raise the cap if safe, or reduce scheduler cadence (sync_interval_seconds).
    • Inspect job logs in Loki for adapter errors or rate limits.
  1. Queue depth rising
  • Symptom: increase(sync_queue_depth[15m]) > 50.
  • Actions:
    • Scale workers or increase per-provider cap cautiously.
    • Pause non-critical providers by increasing intervals.
    • Check external API health/rate limits.
  1. Elevated request latency
  • Symptom: p95 > 500ms sustained.
  • Actions:
    • Inspect recent deployments, DB CPU/IO, and external dependencies.
    • Enable sampling/profiling; consider caching.

Configuration

  • Concurrency cap per provider: SYNC_MAX_CONCURRENCY_PER_PROVIDER (default 4).
  • Default scheduler interval: DEFAULT_SYNC_INTERVAL_SECONDS (default 900s). Per-integration override: integration.config.sync_interval_seconds.
  • Close mode: INTEGRATION_CLOSE_MODE (archive default; delete opt-in).

On-call checklist

  • Confirm alerts and correlate with Grafana panels.
  • Review recent logs for event=enqueued|start|success|fail in Loki.
  • Take one mitigating action at a time; document in the incident log.

Playbooks

  • Raise provider cap:
    • Set SYNC_MAX_CONCURRENCY_PER_PROVIDER and restart worker.
  • Slow the scheduler:
    • PATCH integration config {"sync_interval_seconds": <value>} for noisy integrations.
  • Toggle close policy:
    • POST /api/v1/admin/settings { "integration_close_mode": "archive|delete" }.