Monitoring

Effective monitoring ensures Rack Gateway remains available and helps identify issues before they impact users.

Health Endpoints

Rack Gateway exposes health endpoints for monitoring and orchestration:

curl https://gateway.example.com/api/v1/health

Response:

{
  "status": "ok"
}

Use for:

Container/pod liveness probes
Basic uptime monitoring
Load balancer health checks

Kubernetes Probes

Configure health checks in your Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rack-gateway
spec:
  template:
    spec:
      containers:
        - name: gateway
          livenessProbe:
            httpGet:
              path: /api/v1/health
              port: 8443
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /api/v1/health
              port: 8443
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3

Convox Health Checks

Configure in convox.yml:

services:
  gateway:
    health:
      path: /api/v1/health
      interval: 15
      timeout: 5

Error Tracking with Sentry

Rack Gateway integrates with Sentry for error tracking and performance monitoring.

Configuration

# Backend error tracking
SENTRY_DSN=https://abc123@sentry.io/123456

# Frontend error tracking
SENTRY_JS_DSN=https://def456@sentry.io/789012

# Environment tag
SENTRY_ENVIRONMENT=production

What Gets Tracked

Event Type	Details
Panics	Unhandled errors with stack traces
HTTP Errors	5xx responses with request context
Database Errors	Connection failures, query timeouts
External Failures	OAuth, Convox API errors

Filtering Sensitive Data

Sentry automatically filters sensitive fields. Additional scrubbing is configured for:

Session tokens
API tokens
OAuth tokens
Environment variable values

Log Monitoring

Log Format

Rack Gateway emits two kinds of logs:

Application logs use standard text output (Go log format)
Audit logs are structured JSON written to stdout for CloudWatch ingestion

Example application log:

2025/01/29 12:34:56 WebAuthn enabled: rpid=gateway.example.com origin=https://gateway.example.com

Example audit log (JSON):

{
  "ts": "2024-01-15T10:30:00Z",
  "user_email": "alice@example.com",
  "action_type": "convox",
  "action": "deploy.create",
  "resource": "myapp",
  "resource_type": "app",
  "status": "success",
  "rbac_decision": "allow",
  "http_status": 200,
  "latency_ms": 1250,
  "ip_address": "192.168.1.100",
  "user_agent": "rack-gateway-cli/1.0.0",
  "event_count": 1
}

Log Levels

Level	When Used
`error`	Failures requiring attention
`warn`	Potential issues, degraded functionality
`info`	Normal operations, request logging
`debug`	Detailed diagnostic information

Configure with:

LOG_LEVEL=info  # Options: debug, info, warn, error

Log Aggregation

For AWS deployments, logs are automatically available in CloudWatch:

# View logs in CloudWatch
aws logs tail /ecs/rack-gateway --follow

Add the Datadog agent and configure log collection:

logs:
  - type: docker
    source: rack-gateway
    service: rack-gateway

Use Filebeat to ship logs:

filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_kubernetes_metadata: ~

Key Metrics

Monitor these metrics for Rack Gateway health:

Application Metrics

Metric	Description	Alert Threshold
Request latency	p50, p95, p99 response times	p99 > 2s
Error rate	5xx responses / total requests	> 1%
Request volume	Requests per second	Baseline deviation

Infrastructure Metrics

Metric	Description	Alert Threshold
CPU usage	Container CPU utilization	> 80% sustained
Memory usage	Container memory utilization	> 80%
Database connections	Active connection count	Near max_connections

Business Metrics

Metric	Description	Alert Threshold
Active sessions	Concurrent authenticated users	Capacity planning
API token usage	Requests per token	Anomaly detection
Failed authentications	OAuth/MFA failures	Spike detection

Alerting

Critical Alerts (Page On-Call)

Health endpoint returns non-200
Error rate exceeds 5%
Database connection failures
Authentication system unavailable

Warning Alerts (Notify Team)

Error rate exceeds 1%
Latency p99 exceeds SLA
CPU/memory approaching limits
Unusual access patterns detected

groups:
  - name: rack-gateway
    rules:
      - alert: GatewayHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="rack-gateway",status=~"5.."}[5m]))
          / sum(rate(http_requests_total{job="rack-gateway"}[5m])) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on Rack Gateway"

      - alert: GatewayDown
        expr: up{job="rack-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Rack Gateway is down"

aws cloudwatch put-metric-alarm \
  --alarm-name rack-gateway-health \
  --metric-name HealthCheckStatus \
  --namespace AWS/Route53 \
  --statistic Minimum \
  --period 60 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

Dashboard Example

Key panels for a monitoring dashboard:

Category	Metric	Value
Traffic	Request Rate	100 req/s
	Error Rate	0.5%
	Latency (p99)	150ms
Users	Active Users	50
	API Token Usage	200 req/min
	MFA Events	10
Resources	Database Connections	15/50
	CPU Usage	45%
	Memory	60%

Best Practices

Set up health check monitoring first - Basic availability is critical
Configure error tracking early - Catch issues before users report them
Create runbooks for alerts - Document response procedures
Review logs regularly - Don’t just alert, understand patterns
Test alerting - Verify alerts reach the right people