Skip to main content

Health Monitoring

This guide covers Anava's health check endpoints, integration with monitoring systems, and alert configuration for maintaining operational visibility.

Overview

Anava exposes standardized health endpoints for monitoring system availability and readiness. These endpoints follow industry best practices and integrate with common monitoring platforms.

Health Endpoint Types

EndpointPurposeWhen to Use
/healthLiveness checkVerify service is running
/readyReadiness checkVerify service can handle requests

Health Endpoints

Health Endpoint (/health)

The /health endpoint provides a lightweight liveness check. It verifies the service process is running and can respond to requests.

Request

curl -X GET https://api.anava.ai/health

Response Format

Healthy Response (200 OK):

{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "2.4.1",
"uptime": 86400
}

Unhealthy Response (503 Service Unavailable):

{
"status": "unhealthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "2.4.1",
"error": "Service initialization failed"
}

Response Fields

FieldTypeDescription
statusstringhealthy or unhealthy
timestampISO 8601Current server time
versionstringService version
uptimeintegerSeconds since service start
errorstringError message (only when unhealthy)

Status Codes

CodeMeaningAction
200Service healthyNo action needed
503Service unhealthyInvestigate immediately

Readiness Endpoint (/ready)

The /ready endpoint performs comprehensive dependency checks. It verifies the service can successfully handle requests by validating all critical dependencies.

Request

curl -X GET https://api.anava.ai/ready

Response Format

Ready Response (200 OK):

{
"status": "ready",
"timestamp": "2024-01-15T10:30:00Z",
"version": "2.4.1",
"checks": {
"database": {
"status": "healthy",
"latency_ms": 12
},
"mqtt": {
"status": "healthy",
"latency_ms": 8
},
"storage": {
"status": "healthy",
"latency_ms": 25
},
"auth": {
"status": "healthy",
"latency_ms": 15
}
}
}

Not Ready Response (503 Service Unavailable):

{
"status": "not_ready",
"timestamp": "2024-01-15T10:30:00Z",
"version": "2.4.1",
"checks": {
"database": {
"status": "healthy",
"latency_ms": 12
},
"mqtt": {
"status": "unhealthy",
"error": "Connection refused",
"latency_ms": null
},
"storage": {
"status": "healthy",
"latency_ms": 25
},
"auth": {
"status": "healthy",
"latency_ms": 15
}
}
}

Dependency Checks

CheckWhat It Validates
databaseFirestore connection and read access
mqttMQTT broker connectivity
storageCloud Storage bucket access
authFirebase Auth service availability

Status Codes

CodeMeaningAction
200All dependencies healthyNo action needed
503One or more dependencies unhealthyCheck failed dependencies

Latency Thresholds

CheckWarningCritical
Database> 100ms> 500ms
MQTT> 50ms> 200ms
Storage> 200ms> 1000ms
Auth> 100ms> 500ms

Monitoring Integration

Google Cloud Monitoring

Create an uptime check in Cloud Monitoring:

Console Setup:

  1. Go to Monitoring > Uptime checks
  2. Click Create Uptime Check
  3. Configure:
    • Protocol: HTTPS
    • Resource Type: URL
    • Hostname: api.anava.ai
    • Path: /health
    • Check frequency: 1 minute
  4. Configure alerting policy

Terraform Configuration:

resource "google_monitoring_uptime_check_config" "health" {
display_name = "Anava Health Check"
timeout = "10s"
period = "60s"

http_check {
path = "/health"
port = 443
use_ssl = true
validate_ssl = true

accepted_response_status_codes {
status_class = "STATUS_CLASS_2XX"
}
}

monitored_resource {
type = "uptime_url"
labels = {
project_id = var.project_id
host = "api.anava.ai"
}
}
}

resource "google_monitoring_uptime_check_config" "readiness" {
display_name = "Anava Readiness Check"
timeout = "30s"
period = "300s"

http_check {
path = "/ready"
port = 443
use_ssl = true
validate_ssl = true

accepted_response_status_codes {
status_class = "STATUS_CLASS_2XX"
}
}

monitored_resource {
type = "uptime_url"
labels = {
project_id = var.project_id
host = "api.anava.ai"
}
}
}

Datadog

Configure Datadog HTTP checks:

datadog.yaml:

init_config:

instances:
- name: Anava Health
url: https://api.anava.ai/health
method: GET
timeout: 10
http_response_status_code: 200
collect_response_time: true
tags:
- service:anava
- env:production
- check_type:liveness

- name: Anava Readiness
url: https://api.anava.ai/ready
method: GET
timeout: 30
http_response_status_code: 200
collect_response_time: true
tags:
- service:anava
- env:production
- check_type:readiness

Dashboard Query:

avg:http.can_connect{service:anava,check_type:liveness} by {env}

Prometheus

Configure Prometheus blackbox exporter:

blackbox.yml:

modules:
http_health:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
preferred_ip_protocol: "ip4"

http_ready:
prober: http
timeout: 30s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
preferred_ip_protocol: "ip4"

prometheus.yml:

scrape_configs:
- job_name: 'anava-health'
metrics_path: /probe
params:
module: [http_health]
static_configs:
- targets:
- https://api.anava.ai/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115

- job_name: 'anava-readiness'
metrics_path: /probe
params:
module: [http_ready]
static_configs:
- targets:
- https://api.anava.ai/ready
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115

Prometheus Alert Rules:

groups:
- name: anava-health
rules:
- alert: AnavaServiceDown
expr: probe_success{job="anava-health"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Anava service is down"
description: "Health check has been failing for more than 2 minutes"

- alert: AnavaServiceNotReady
expr: probe_success{job="anava-readiness"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Anava service is not ready"
description: "Readiness check has been failing for more than 5 minutes"

- alert: AnavaHighLatency
expr: probe_duration_seconds{job="anava-health"} > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Anava health check latency is high"
description: "Health check latency exceeds 1 second"

Alert Configuration

Alert Severity Levels

SeverityConditionResponse TimeNotification
CriticalService down > 2 minImmediatePagerDuty, SMS
WarningDegraded > 5 min15 minutesEmail, Slack
InfoLatency spikeNext business dayEmail

Alert Policies

Cloud Monitoring Alert Policy (Terraform):

resource "google_monitoring_alert_policy" "health_alert" {
display_name = "Anava Health Alert"
combiner = "OR"

conditions {
display_name = "Health Check Failure"

condition_threshold {
filter = "metric.type=\"monitoring.googleapis.com/uptime_check/check_passed\" AND resource.type=\"uptime_url\""
comparison = "COMPARISON_LT"
threshold_value = 1
duration = "120s"

aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_NEXT_OLDER"
}
}
}

notification_channels = [
google_monitoring_notification_channel.email.name,
google_monitoring_notification_channel.pagerduty.name,
]

alert_strategy {
auto_close = "1800s"
}
}

Notification Channels

Configure multiple notification channels for redundancy:

ChannelUse CaseConfiguration
EmailAll alertsTeam distribution list
SlackWarning+#ops-alerts channel
PagerDutyCriticalOn-call rotation
SMSCriticalOn-call phone

Example Slack Webhook Integration:

# Test notification
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H 'Content-Type: application/json' \
-d '{
"text": "Anava Health Alert",
"attachments": [{
"color": "danger",
"fields": [{
"title": "Status",
"value": "Service Unavailable",
"short": true
}, {
"title": "Environment",
"value": "Production",
"short": true
}]
}]
}'

Best Practices

Health Check Frequency

Environment/health Interval/ready Interval
Production30-60 seconds1-5 minutes
Staging1-2 minutes5-10 minutes
Development5 minutes15 minutes

Timeout Configuration

EndpointRecommended TimeoutMax Timeout
/health5 seconds10 seconds
/ready15 seconds30 seconds

Monitoring Dashboard

Create a unified dashboard with these panels:

  1. Service Availability - Uptime percentage over time
  2. Response Latency - P50, P95, P99 latency
  3. Dependency Health - Individual dependency status
  4. Error Rate - Failed checks over time
  5. Alerts - Active and recent alerts

Health Dashboard

Compliance Coverage

Health monitoring supports the following compliance requirements:

SOC 2 Type II

ControlRequirementHow Health Monitoring Addresses
A1.2System availability monitoring/health endpoint provides continuous availability verification with configurable alerting
A1.2Recovery proceduresAlert policies enable rapid incident response; documented escalation paths
CC7.2Monitoring for security eventsHealth checks detect service disruptions that may indicate security incidents

ISO 27001:2022

ControlRequirementHow Health Monitoring Addresses
A.12.1.3Capacity monitoring/ready endpoint monitors dependency health and latency thresholds
A.12.4.1Event loggingHealth check results logged for audit trail
A.17.1.1Information security continuityAutomated monitoring ensures rapid detection of availability issues

Audit Evidence

Health monitoring provides audit evidence through:

  1. Uptime Reports - Historical availability metrics
  2. Alert History - Record of incidents and response times
  3. Latency Trends - Performance over time
  4. Dependency Status - Component-level health history

Export Health Metrics for Audit:

# Export last 30 days of health check results
gcloud monitoring time-series list \
--project=anava-ai \
--filter='metric.type="monitoring.googleapis.com/uptime_check/check_passed"' \
--interval='start="2024-01-01T00:00:00Z",end="2024-01-31T23:59:59Z"' \
--format=json > health-audit-report.json

Troubleshooting

Health Check Failing

  1. Check service logs

    gcloud functions logs read --project=anava-ai --limit=50
  2. Verify network connectivity

    curl -v https://api.anava.ai/health
  3. Check for deployment issues

    firebase functions:list --project=anava-ai

Readiness Check Failing

  1. Identify failed dependency

    curl -s https://api.anava.ai/ready | jq '.checks | to_entries[] | select(.value.status != "healthy")'
  2. Check individual service status

    • Database: Check Firestore console
    • MQTT: Verify broker VM status
    • Storage: Check bucket permissions
    • Auth: Verify Firebase Auth service
  3. Review dependency latency

    curl -s https://api.anava.ai/ready | jq '.checks | to_entries[] | {name: .key, latency: .value.latency_ms}'

High Latency

  1. Check regional performance

    • Run checks from multiple regions
    • Compare latency across locations
  2. Review dependency performance

    • Check individual component latency
    • Identify bottlenecks
  3. Scale resources if needed

    • Increase function memory/instances
    • Review database indexes