Monitoring
Effective monitoring ensures Rack Gateway remains available and helps identify issues before they impact users.
Health Endpoints
Section titled “Health Endpoints”Rack Gateway exposes health endpoints for monitoring and orchestration:
curl https://gateway.example.com/api/v1/healthResponse:
{ "status": "ok"}Use for:
- Container/pod liveness probes
- Basic uptime monitoring
- Load balancer health checks
Kubernetes Probes
Section titled “Kubernetes Probes”Configure health checks in your Kubernetes deployment:
apiVersion: apps/v1kind: Deploymentmetadata: name: rack-gatewayspec: template: spec: containers: - name: gateway livenessProbe: httpGet: path: /api/v1/health port: 8443 initialDelaySeconds: 10 periodSeconds: 15 failureThreshold: 3 readinessProbe: httpGet: path: /api/v1/health port: 8443 initialDelaySeconds: 5 periodSeconds: 10 failureThreshold: 3Convox Health Checks
Section titled “Convox Health Checks”Configure in convox.yml:
services: gateway: health: path: /api/v1/health interval: 15 timeout: 5Error Tracking with Sentry
Section titled “Error Tracking with Sentry”Rack Gateway integrates with Sentry for error tracking and performance monitoring.
Configuration
Section titled “Configuration”# Backend error trackingSENTRY_DSN=https://abc123@sentry.io/123456
# Frontend error trackingSENTRY_JS_DSN=https://def456@sentry.io/789012
# Environment tagSENTRY_ENVIRONMENT=productionWhat Gets Tracked
Section titled “What Gets Tracked”| Event Type | Details |
|---|---|
| Panics | Unhandled errors with stack traces |
| HTTP Errors | 5xx responses with request context |
| Database Errors | Connection failures, query timeouts |
| External Failures | OAuth, Convox API errors |
Filtering Sensitive Data
Section titled “Filtering Sensitive Data”Sentry automatically filters sensitive fields. Additional scrubbing is configured for:
- Session tokens
- API tokens
- OAuth tokens
- Environment variable values
Log Monitoring
Section titled “Log Monitoring”Log Format
Section titled “Log Format”Rack Gateway emits two kinds of logs:
- Application logs use standard text output (Go
logformat) - Audit logs are structured JSON written to stdout for CloudWatch ingestion
Example application log:
2025/01/29 12:34:56 WebAuthn enabled: rpid=gateway.example.com origin=https://gateway.example.comExample audit log (JSON):
{ "ts": "2024-01-15T10:30:00Z", "user_email": "alice@example.com", "action_type": "convox", "action": "deploy.create", "resource": "myapp", "resource_type": "app", "status": "success", "rbac_decision": "allow", "http_status": 200, "latency_ms": 1250, "ip_address": "192.168.1.100", "user_agent": "rack-gateway-cli/1.0.0", "event_count": 1}Log Levels
Section titled “Log Levels”| Level | When Used |
|---|---|
error | Failures requiring attention |
warn | Potential issues, degraded functionality |
info | Normal operations, request logging |
debug | Detailed diagnostic information |
Configure with:
LOG_LEVEL=info # Options: debug, info, warn, errorLog Aggregation
Section titled “Log Aggregation”For AWS deployments, logs are automatically available in CloudWatch:
# View logs in CloudWatchaws logs tail /ecs/rack-gateway --followAdd the Datadog agent and configure log collection:
logs: - type: docker source: rack-gateway service: rack-gatewayUse Filebeat to ship logs:
filebeat.inputs: - type: container paths: - /var/lib/docker/containers/*/*.log processors: - add_kubernetes_metadata: ~Key Metrics
Section titled “Key Metrics”Monitor these metrics for Rack Gateway health:
Application Metrics
Section titled “Application Metrics”| Metric | Description | Alert Threshold |
|---|---|---|
| Request latency | p50, p95, p99 response times | p99 > 2s |
| Error rate | 5xx responses / total requests | > 1% |
| Request volume | Requests per second | Baseline deviation |
Infrastructure Metrics
Section titled “Infrastructure Metrics”| Metric | Description | Alert Threshold |
|---|---|---|
| CPU usage | Container CPU utilization | > 80% sustained |
| Memory usage | Container memory utilization | > 80% |
| Database connections | Active connection count | Near max_connections |
Business Metrics
Section titled “Business Metrics”| Metric | Description | Alert Threshold |
|---|---|---|
| Active sessions | Concurrent authenticated users | Capacity planning |
| API token usage | Requests per token | Anomaly detection |
| Failed authentications | OAuth/MFA failures | Spike detection |
Alerting
Section titled “Alerting”Critical Alerts (Page On-Call)
Section titled “Critical Alerts (Page On-Call)”- Health endpoint returns non-200
- Error rate exceeds 5%
- Database connection failures
- Authentication system unavailable
Warning Alerts (Notify Team)
Section titled “Warning Alerts (Notify Team)”- Error rate exceeds 1%
- Latency p99 exceeds SLA
- CPU/memory approaching limits
- Unusual access patterns detected
Example Alerting Rules
Section titled “Example Alerting Rules”groups: - name: rack-gateway rules: - alert: GatewayHighErrorRate expr: | sum(rate(http_requests_total{job="rack-gateway",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="rack-gateway"}[5m])) > 0.01 for: 5m labels: severity: warning annotations: summary: "High error rate on Rack Gateway"
- alert: GatewayDown expr: up{job="rack-gateway"} == 0 for: 1m labels: severity: critical annotations: summary: "Rack Gateway is down"aws cloudwatch put-metric-alarm \ --alarm-name rack-gateway-health \ --metric-name HealthCheckStatus \ --namespace AWS/Route53 \ --statistic Minimum \ --period 60 \ --threshold 1 \ --comparison-operator LessThanThreshold \ --evaluation-periods 2 \ --alarm-actions arn:aws:sns:us-east-1:123456789:alertsDashboard Example
Section titled “Dashboard Example”Key panels for a monitoring dashboard:
| Category | Metric | Value |
|---|---|---|
| Traffic | Request Rate | 100 req/s |
| Error Rate | 0.5% | |
| Latency (p99) | 150ms | |
| Users | Active Users | 50 |
| API Token Usage | 200 req/min | |
| MFA Events | 10 | |
| Resources | Database Connections | 15/50 |
| CPU Usage | 45% | |
| Memory | 60% |
Best Practices
Section titled “Best Practices”- Set up health check monitoring first - Basic availability is critical
- Configure error tracking early - Catch issues before users report them
- Create runbooks for alerts - Document response procedures
- Review logs regularly - Don’t just alert, understand patterns
- Test alerting - Verify alerts reach the right people
Further Reading
Section titled “Further Reading”- Troubleshooting - Common issues and solutions
- Production Checklist - Pre-deployment verification