Health Checks
Overview
Health Checks is a proactive monitoring pattern in enterprise integration architectures that systematically verifies the operational status, availability, and functional capability of system components, services, and dependencies through automated diagnostic procedures. Like a comprehensive medical examination that assesses various aspects of patient health from vital signs to organ function, health checks provide continuous assessment of system wellness by evaluating critical indicators of system health, performance, and readiness to serve requests. This pattern is essential for ensuring system reliability, enabling automated failure detection, supporting load balancing decisions, facilitating graceful degradation, and maintaining operational excellence in complex distributed environments where manual monitoring is impractical.
Theoretical Foundation
Health Checks is grounded in systems monitoring theory, fault detection principles, availability engineering, and proactive maintenance strategies. It incorporates concepts from heartbeat monitoring, synthetic transaction testing, dependency verification, and operational readiness assessment to provide a comprehensive framework for automated system health assessment. The pattern addresses the fundamental need for continuous, automated verification of system operational capability and the early detection of issues that could impact system availability, performance, or functionality.
Core Principles
1. Multi-Level Health Assessment
Comprehensive evaluation of system health at different layers and granularities: - Service-level health - overall service operational status and readiness - Component-level health - individual component functionality and performance - Dependency health - external service and infrastructure dependency status - Business function health - end-to-end business capability verification
2. Automated Diagnostic Procedures
Systematic execution of health verification procedures: - Synthetic transactions - automated execution of representative business transactions - Connectivity testing - verification of network connectivity and communication paths - Resource availability - assessment of critical system resources and capacity - Data integrity verification - validation of data consistency and accessibility
3. Continuous Monitoring and Assessment
Regular, ongoing health evaluation and status reporting: - Periodic health checks - scheduled execution of health verification procedures - Real-time health monitoring - continuous assessment of system health indicators - Health status aggregation - consolidation of multiple health indicators into overall status - Health history tracking - maintenance of health status history for trend analysis
4. Actionable Health Information
Provision of meaningful, actionable health status information: - Health status reporting - clear communication of current system health state - Failure diagnosis - detailed information about detected health issues - Recovery guidance - recommendations for addressing identified health problems - Impact assessment - evaluation of health issues' impact on system operations
Why Health Checks are Essential in Integration Architecture
1. Proactive Issue Detection
In complex distributed systems, health checks provide: - Early failure detection - identification of issues before they impact users - Cascade failure prevention - detection of dependency failures before they spread - Performance degradation alerts - early warning of performance issues - Capacity threshold monitoring - alerting when resource limits are approached
2. Operational Automation Support
Supporting automated operational procedures and decisions: - Load balancer integration - informing load balancing decisions about service availability - Auto-scaling triggers - providing data for automated scaling decisions - Circuit breaker coordination - supporting circuit breaker pattern implementation - Service mesh integration - enabling intelligent traffic routing and failure handling
3. Service Level Agreement (SLA) Management
Ensuring compliance with service level commitments: - Availability monitoring - tracking service availability against SLA commitments - Performance verification - ensuring response times meet SLA requirements - Quality assurance - monitoring service quality characteristics - Compliance reporting - generating reports for SLA compliance verification
4. Development and Operations Integration
Supporting DevOps practices and continuous delivery: - Deployment verification - validating successful deployments through health checks - Rollback triggers - automatically triggering rollbacks when health checks fail - Environment validation - verifying environment readiness before deployments - Testing automation - incorporating health checks into automated testing pipelines
Benefits in Integration Contexts
1. Technical Advantages
- Automated monitoring - continuous, automated assessment of system health
- Early detection - identification of issues before they impact operations
- Dependency visibility - clear understanding of external dependency health
- Integration reliability - improved reliability through proactive issue detection
2. Operational Benefits
- Reduced downtime - faster issue detection and resolution
- Operational efficiency - automated health assessment reducing manual monitoring
- Improved reliability - better system reliability through proactive monitoring
- Cost optimization - reduced operational costs through automation and early detection
3. Integration Enablement
- Service discovery - supporting dynamic service discovery and registration
- Traffic management - enabling intelligent traffic routing based on health status
- Integration monitoring - comprehensive monitoring of integration flows and dependencies
- Quality assurance - ensuring integration quality through systematic health verification
4. Business Value
- Service reliability - improved service reliability and availability for customers
- Risk mitigation - reduced business risk through proactive issue detection
- Customer satisfaction - better customer experience through reliable services
- Compliance assurance - meeting regulatory and business compliance requirements
Integration Architecture Applications
1. Microservices Health Monitoring
Comprehensive health checks for microservices architecture:
// Health Check Configuration
@Configuration
@EnableConfigurationProperties(HealthCheckProperties.class)
public class HealthCheckConfiguration {
@Bean
public HealthIndicatorRegistry healthIndicatorRegistry() {
return new DefaultHealthIndicatorRegistry();
}
@Bean
public HealthAggregator healthAggregator() {
return new OrderedHealthAggregator();
}
@Bean
public CompositeHealthIndicator compositeHealthIndicator(
HealthAggregator healthAggregator,
HealthIndicatorRegistry healthIndicatorRegistry) {
return new CompositeHealthIndicator(healthAggregator, healthIndicatorRegistry);
}
@Bean
public HealthCheckManager healthCheckManager() {
return new HealthCheckManager();
}
}
// Custom Health Indicators
@Component
public class DatabaseHealthIndicator implements HealthIndicator {
@Autowired
private DataSource dataSource;
@Override
public Health health() {
try {
// Test database connectivity
try (Connection connection = dataSource.getConnection()) {
// Execute a simple query to verify database functionality
try (PreparedStatement statement = connection.prepareStatement("SELECT 1")) {
ResultSet resultSet = statement.executeQuery();
if (resultSet.next()) {
long startTime = System.currentTimeMillis();
// Test database performance
try (PreparedStatement perfStatement = connection.prepareStatement(
"SELECT COUNT(*) FROM orders WHERE created_date > ?")) {
perfStatement.setTimestamp(1, Timestamp.valueOf(LocalDateTime.now().minusMinutes(5)));
ResultSet perfResult = perfStatement.executeQuery();
long queryTime = System.currentTimeMillis() - startTime;
if (perfResult.next()) {
int recentOrders = perfResult.getInt(1);
return Health.up()
.withDetail("database", "PostgreSQL")
.withDetail("connection_pool_active", getActiveConnections())
.withDetail("connection_pool_idle", getIdleConnections())
.withDetail("query_time_ms", queryTime)
.withDetail("recent_orders", recentOrders)
.withDetail("last_check", LocalDateTime.now())
.build();
}
}
}
}
}
return Health.down()
.withDetail("error", "Database query failed")
.withDetail("last_check", LocalDateTime.now())
.build();
} catch (Exception e) {
return Health.down()
.withDetail("error", e.getMessage())
.withDetail("exception", e.getClass().getSimpleName())
.withDetail("last_check", LocalDateTime.now())
.build();
}
}
private int getActiveConnections() {
// Implementation to get active connection count
try {
if (dataSource instanceof HikariDataSource) {
return ((HikariDataSource) dataSource).getHikariPoolMXBean().getActiveConnections();
}
return -1;
} catch (Exception e) {
return -1;
}
}
private int getIdleConnections() {
// Implementation to get idle connection count
try {
if (dataSource instanceof HikariDataSource) {
return ((HikariDataSource) dataSource).getHikariPoolMXBean().getIdleConnections();
}
return -1;
} catch (Exception e) {
return -1;
}
}
}
@Component
public class ExternalServiceHealthIndicator implements HealthIndicator {
@Autowired
private RestTemplate restTemplate;
@Value("${external.inventory.service.url}")
private String inventoryServiceUrl;
@Value("${external.payment.service.url}")
private String paymentServiceUrl;
@Override
public Health health() {
Health.Builder health = Health.up();
// Check inventory service
ServiceHealthStatus inventoryHealth = checkService("inventory", inventoryServiceUrl + "/health");
health.withDetail("inventory_service", inventoryHealth);
// Check payment service
ServiceHealthStatus paymentHealth = checkService("payment", paymentServiceUrl + "/health");
health.withDetail("payment_service", paymentHealth);
// Check overall external service health
boolean allServicesHealthy = inventoryHealth.isHealthy() && paymentHealth.isHealthy();
if (allServicesHealthy) {
health.withDetail("external_services_status", "ALL_HEALTHY");
} else {
health = Health.down();
health.withDetail("external_services_status", "SOME_UNHEALTHY");
}
health.withDetail("last_check", LocalDateTime.now());
return health.build();
}
private ServiceHealthStatus checkService(String serviceName, String healthUrl) {
try {
long startTime = System.currentTimeMillis();
ResponseEntity<Map> response = restTemplate.getForEntity(healthUrl, Map.class);
long responseTime = System.currentTimeMillis() - startTime;
boolean isHealthy = response.getStatusCode().is2xxSuccessful();
ServiceHealthStatus status = new ServiceHealthStatus();
status.setServiceName(serviceName);
status.setHealthy(isHealthy);
status.setResponseTime(responseTime);
status.setStatusCode(response.getStatusCode().value());
status.setLastCheck(LocalDateTime.now());
if (response.getBody() != null) {
status.setDetails(response.getBody());
}
return status;
} catch (Exception e) {
ServiceHealthStatus status = new ServiceHealthStatus();
status.setServiceName(serviceName);
status.setHealthy(false);
status.setError(e.getMessage());
status.setLastCheck(LocalDateTime.now());
return status;
}
}
}
@Component
public class CacheHealthIndicator implements HealthIndicator {
@Autowired
private RedisTemplate<String, Object> redisTemplate;
@Override
public Health health() {
try {
// Test Redis connectivity and performance
long startTime = System.currentTimeMillis();
String testKey = "health-check-" + System.currentTimeMillis();
String testValue = "test-value";
// Test SET operation
redisTemplate.opsForValue().set(testKey, testValue, Duration.ofMinutes(1));
// Test GET operation
String retrievedValue = (String) redisTemplate.opsForValue().get(testKey);
// Test DELETE operation
redisTemplate.delete(testKey);
long operationTime = System.currentTimeMillis() - startTime;
if (testValue.equals(retrievedValue)) {
// Get Redis info
Properties redisInfo = redisTemplate.getConnectionFactory().getConnection().info();
return Health.up()
.withDetail("cache_type", "Redis")
.withDetail("operation_time_ms", operationTime)
.withDetail("redis_version", redisInfo.getProperty("redis_version"))
.withDetail("connected_clients", redisInfo.getProperty("connected_clients"))
.withDetail("used_memory_human", redisInfo.getProperty("used_memory_human"))
.withDetail("keyspace_hits", redisInfo.getProperty("keyspace_hits"))
.withDetail("keyspace_misses", redisInfo.getProperty("keyspace_misses"))
.withDetail("last_check", LocalDateTime.now())
.build();
} else {
return Health.down()
.withDetail("error", "Cache operation verification failed")
.withDetail("expected", testValue)
.withDetail("actual", retrievedValue)
.withDetail("last_check", LocalDateTime.now())
.build();
}
} catch (Exception e) {
return Health.down()
.withDetail("error", e.getMessage())
.withDetail("exception", e.getClass().getSimpleName())
.withDetail("last_check", LocalDateTime.now())
.build();
}
}
}
@Component
public class BusinessFunctionHealthIndicator implements HealthIndicator {
@Autowired
private OrderService orderService;
@Autowired
private InventoryService inventoryService;
@Autowired
private PaymentService paymentService;
@Override
public Health health() {
Health.Builder health = Health.up();
Map<String, Object> healthDetails = new HashMap<>();
// Test order creation functionality
boolean orderCreationHealthy = testOrderCreationFunction();
healthDetails.put("order_creation", orderCreationHealthy ? "HEALTHY" : "UNHEALTHY");
// Test inventory check functionality
boolean inventoryHealthy = testInventoryFunction();
healthDetails.put("inventory_check", inventoryHealthy ? "HEALTHY" : "UNHEALTHY");
// Test payment processing functionality
boolean paymentHealthy = testPaymentFunction();
healthDetails.put("payment_processing", paymentHealthy ? "HEALTHY" : "UNHEALTHY");
// Overall business function health
boolean allFunctionsHealthy = orderCreationHealthy && inventoryHealthy && paymentHealthy;
if (allFunctionsHealthy) {
healthDetails.put("business_functions_status", "ALL_HEALTHY");
} else {
health = Health.down();
healthDetails.put("business_functions_status", "SOME_UNHEALTHY");
}
healthDetails.put("last_check", LocalDateTime.now());
return health.withDetails(healthDetails).build();
}
private boolean testOrderCreationFunction() {
try {
// Create a test order validation request
CreateOrderRequest testRequest = createTestOrderRequest();
// Validate the order creation logic (without actually creating the order)
OrderValidationResult result = orderService.validateOrder(testRequest);
return result.isValid();
} catch (Exception e) {
log.warn("Order creation health check failed", e);
return false;
}
}
private boolean testInventoryFunction() {
try {
// Test inventory availability check
List<OrderItem> testItems = createTestOrderItems();
InventoryCheckResult result = inventoryService.checkAvailability(testItems);
return result.isSuccessful();
} catch (Exception e) {
log.warn("Inventory health check failed", e);
return false;
}
}
private boolean testPaymentFunction() {
try {
// Test payment method validation
PaymentMethod testPaymentMethod = createTestPaymentMethod();
PaymentValidationResult result = paymentService.validatePaymentMethod(testPaymentMethod);
return result.isValid();
} catch (Exception e) {
log.warn("Payment health check failed", e);
return false;
}
}
private CreateOrderRequest createTestOrderRequest() {
CreateOrderRequest request = new CreateOrderRequest();
request.setCustomerId("health-check-customer");
request.setItems(createTestOrderItems());
request.setShippingAddress(createTestShippingAddress());
request.setPaymentMethod(createTestPaymentMethod());
return request;
}
private List<OrderItem> createTestOrderItems() {
OrderItem item = new OrderItem();
item.setProductId("health-check-product");
item.setQuantity(1);
return Arrays.asList(item);
}
private ShippingAddress createTestShippingAddress() {
ShippingAddress address = new ShippingAddress();
address.setStreet("123 Health Check St");
address.setCity("Test City");
address.setZipCode("12345");
return address;
}
private PaymentMethod createTestPaymentMethod() {
PaymentMethod method = new PaymentMethod();
method.setType("CREDIT_CARD");
method.setCardNumber("****-****-****-1234");
return method;
}
}
2. Apache Camel Route Health Monitoring
Health checks for Camel integration routes:
@Component
public class CamelHealthCheckRoute extends RouteBuilder {
@Autowired
private HealthCheckManager healthCheckManager;
@Override
public void configure() throws Exception {
// Enable Camel health checks
getContext().setUseMDCLogging(true);
getContext().setMessageHistoryFactory(new MicrometerMessageHistoryFactory());
// Health check endpoint
from("timer://camel-health-check?period=30000")
.routeId("camel-health-check")
.process(exchange -> {
CamelHealthStatus healthStatus = new CamelHealthStatus();
healthStatus.setTimestamp(Instant.now());
// Check route status
List<Route> routes = getContext().getRoutes();
int totalRoutes = routes.size();
int startedRoutes = 0;
List<String> stoppedRoutes = new ArrayList<>();
for (Route route : routes) {
if (route.getRouteContext().getStatus().isStarted()) {
startedRoutes++;
} else {
stoppedRoutes.add(route.getId());
}
}
healthStatus.setTotalRoutes(totalRoutes);
healthStatus.setStartedRoutes(startedRoutes);
healthStatus.setStoppedRoutes(stoppedRoutes);
// Check component status
Map<String, Component> components = getContext().getComponentMap();
int totalComponents = components.size();
int activeComponents = 0;
List<String> inactiveComponents = new ArrayList<>();
for (Map.Entry<String, Component> entry : components.entrySet()) {
try {
ServiceStatus status = entry.getValue().getStatus();
if (status.isStarted()) {
activeComponents++;
} else {
inactiveComponents.add(entry.getKey());
}
} catch (Exception e) {
inactiveComponents.add(entry.getKey());
}
}
healthStatus.setTotalComponents(totalComponents);
healthStatus.setActiveComponents(activeComponents);
healthStatus.setInactiveComponents(inactiveComponents);
// Check endpoint connectivity
List<EndpointHealthStatus> endpointStatuses = checkEndpoints();
healthStatus.setEndpointStatuses(endpointStatuses);
// Overall health determination
boolean isHealthy = startedRoutes == totalRoutes &&
activeComponents == totalComponents &&
endpointStatuses.stream().allMatch(EndpointHealthStatus::isHealthy);
healthStatus.setOverallHealthy(isHealthy);
exchange.getIn().setBody(healthStatus);
log.info("Camel health check completed - Healthy: {}, Routes: {}/{}, Components: {}/{}",
isHealthy, startedRoutes, totalRoutes, activeComponents, totalComponents);
})
.choice()
.when(simple("${body.overallHealthy} == false"))
.to("direct:handleCamelHealthIssues")
.otherwise()
.log("All Camel components healthy")
.end()
.marshal().json(JsonLibrary.Jackson)
.to("kafka:camel-health-status");
from("direct:handleCamelHealthIssues")
.routeId("handle-camel-health-issues")
.log("Camel health issues detected")
.process(exchange -> {
CamelHealthStatus healthStatus = exchange.getIn().getBody(CamelHealthStatus.class);
CamelHealthAlert alert = new CamelHealthAlert();
alert.setTimestamp(Instant.now());
alert.setSeverity(AlertSeverity.WARNING);
alert.setMessage("Camel health issues detected");
List<String> issues = new ArrayList<>();
if (!healthStatus.getStoppedRoutes().isEmpty()) {
issues.add("Stopped routes: " + String.join(", ", healthStatus.getStoppedRoutes()));
}
if (!healthStatus.getInactiveComponents().isEmpty()) {
issues.add("Inactive components: " + String.join(", ", healthStatus.getInactiveComponents()));
}
List<EndpointHealthStatus> unhealthyEndpoints = healthStatus.getEndpointStatuses()
.stream()
.filter(status -> !status.isHealthy())
.collect(Collectors.toList());
if (!unhealthyEndpoints.isEmpty()) {
issues.add("Unhealthy endpoints: " + unhealthyEndpoints.stream()
.map(EndpointHealthStatus::getEndpointUri)
.collect(Collectors.joining(", ")));
}
alert.setIssues(issues);
exchange.getIn().setBody(alert);
})
.marshal().json(JsonLibrary.Jackson)
.to("kafka:camel-health-alerts")
.log("Camel health alert sent: ${body}");
// Route-specific health checks
from("timer://route-health-check?period=60000")
.routeId("route-health-check")
.process(exchange -> {
List<RouteHealthStatus> routeStatuses = new ArrayList<>();
for (Route route : getContext().getRoutes()) {
RouteHealthStatus status = new RouteHealthStatus();
status.setRouteId(route.getId());
status.setStatus(route.getRouteContext().getStatus().name());
try {
// Get route statistics
MBeanServer server = ManagementFactory.getPlatformMBeanServer();
ObjectName objectName = new ObjectName(
"org.apache.camel:context=" + getContext().getManagementName() +
",type=routes,name=\"" + route.getId() + "\"");
if (server.isRegistered(objectName)) {
Long exchangesTotal = (Long) server.getAttribute(objectName, "ExchangesTotal");
Long exchangesCompleted = (Long) server.getAttribute(objectName, "ExchangesCompleted");
Long exchangesFailed = (Long) server.getAttribute(objectName, "ExchangesFailed");
Long meanProcessingTime = (Long) server.getAttribute(objectName, "MeanProcessingTime");
status.setExchangesTotal(exchangesTotal);
status.setExchangesCompleted(exchangesCompleted);
status.setExchangesFailed(exchangesFailed);
status.setMeanProcessingTime(meanProcessingTime);
// Calculate error rate
double errorRate = exchangesTotal > 0 ?
(double) exchangesFailed / exchangesTotal : 0.0;
status.setErrorRate(errorRate);
// Determine health based on error rate and processing time
boolean isHealthy = errorRate < 0.05 && meanProcessingTime < 5000;
status.setHealthy(isHealthy);
} else {
status.setHealthy(false);
status.setError("Route statistics not available");
}
} catch (Exception e) {
status.setHealthy(false);
status.setError("Error retrieving route statistics: " + e.getMessage());
}
routeStatuses.add(status);
}
exchange.getIn().setBody(routeStatuses);
})
.marshal().json(JsonLibrary.Jackson)
.to("kafka:route-health-status")
.log("Route health status published for ${body.size} routes");
// Message queue health check
from("timer://queue-health-check?period=45000")
.routeId("queue-health-check")
.process(exchange -> {
List<QueueHealthStatus> queueStatuses = new ArrayList<>();
// Check Kafka topic health
queueStatuses.add(checkKafkaTopicHealth("order-events"));
queueStatuses.add(checkKafkaTopicHealth("inventory-updates"));
queueStatuses.add(checkKafkaTopicHealth("payment-notifications"));
exchange.getIn().setBody(queueStatuses);
})
.marshal().json(JsonLibrary.Jackson)
.to("kafka:queue-health-status");
// Endpoint connectivity health check
from("timer://endpoint-health-check?period=120000")
.routeId("endpoint-health-check")
.process(exchange -> {
List<EndpointHealthStatus> endpointStatuses = checkEndpoints();
exchange.getIn().setBody(endpointStatuses);
})
.marshal().json(JsonLibrary.Jackson)
.to("kafka:endpoint-health-status");
}
private List<EndpointHealthStatus> checkEndpoints() {
List<EndpointHealthStatus> statuses = new ArrayList<>();
// Check HTTP endpoints
statuses.add(checkHttpEndpoint("customer-service", "http://customer-service:8080/health"));
statuses.add(checkHttpEndpoint("inventory-service", "http://inventory-service:8080/health"));
statuses.add(checkHttpEndpoint("payment-service", "http://payment-service:8080/health"));
// Check database endpoints
statuses.add(checkDatabaseEndpoint("postgresql", "jdbc:postgresql://postgres:5432/orders"));
// Check cache endpoints
statuses.add(checkCacheEndpoint("redis", "redis://redis:6379"));
return statuses;
}
private EndpointHealthStatus checkHttpEndpoint(String serviceName, String url) {
EndpointHealthStatus status = new EndpointHealthStatus();
status.setEndpointName(serviceName);
status.setEndpointUri(url);
status.setEndpointType("HTTP");
try {
long startTime = System.currentTimeMillis();
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(Duration.ofSeconds(10))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
long responseTime = System.currentTimeMillis() - startTime;
status.setHealthy(response.statusCode() >= 200 && response.statusCode() < 300);
status.setResponseTime(responseTime);
status.setStatusCode(response.statusCode());
status.setLastCheck(LocalDateTime.now());
if (!status.isHealthy()) {
status.setError("HTTP " + response.statusCode());
}
} catch (Exception e) {
status.setHealthy(false);
status.setError(e.getMessage());
status.setLastCheck(LocalDateTime.now());
}
return status;
}
private EndpointHealthStatus checkDatabaseEndpoint(String dbName, String jdbcUrl) {
EndpointHealthStatus status = new EndpointHealthStatus();
status.setEndpointName(dbName);
status.setEndpointUri(jdbcUrl);
status.setEndpointType("DATABASE");
try {
long startTime = System.currentTimeMillis();
try (Connection connection = DriverManager.getConnection(jdbcUrl)) {
try (PreparedStatement statement = connection.prepareStatement("SELECT 1")) {
statement.executeQuery();
}
}
long responseTime = System.currentTimeMillis() - startTime;
status.setHealthy(true);
status.setResponseTime(responseTime);
status.setLastCheck(LocalDateTime.now());
} catch (Exception e) {
status.setHealthy(false);
status.setError(e.getMessage());
status.setLastCheck(LocalDateTime.now());
}
return status;
}
private EndpointHealthStatus checkCacheEndpoint(String cacheName, String redisUrl) {
EndpointHealthStatus status = new EndpointHealthStatus();
status.setEndpointName(cacheName);
status.setEndpointUri(redisUrl);
status.setEndpointType("CACHE");
try {
long startTime = System.currentTimeMillis();
// Simple Redis connectivity check
Jedis jedis = new Jedis(URI.create(redisUrl));
String response = jedis.ping();
jedis.close();
long responseTime = System.currentTimeMillis() - startTime;
status.setHealthy("PONG".equals(response));
status.setResponseTime(responseTime);
status.setLastCheck(LocalDateTime.now());
if (!status.isHealthy()) {
status.setError("Unexpected ping response: " + response);
}
} catch (Exception e) {
status.setHealthy(false);
status.setError(e.getMessage());
status.setLastCheck(LocalDateTime.now());
}
return status;
}
private QueueHealthStatus checkKafkaTopicHealth(String topicName) {
QueueHealthStatus status = new QueueHealthStatus();
status.setQueueName(topicName);
status.setQueueType("KAFKA_TOPIC");
try {
// Check if topic exists and get partition information
Properties props = new Properties();
props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
try (AdminClient adminClient = AdminClient.create(props)) {
DescribeTopicsResult topicsResult = adminClient.describeTopics(Arrays.asList(topicName));
TopicDescription description = topicsResult.values().get(topicName).get(5, TimeUnit.SECONDS);
status.setHealthy(true);
status.setPartitionCount(description.partitions().size());
status.setLastCheck(LocalDateTime.now());
}
} catch (Exception e) {
status.setHealthy(false);
status.setError(e.getMessage());
status.setLastCheck(LocalDateTime.now());
}
return status;
}
}
3. Health Check Management and Aggregation
Centralized health check management and reporting:
// Health Check Manager
@Service
public class HealthCheckManager {
@Autowired
private HealthIndicatorRegistry healthIndicatorRegistry;
@Autowired
private HealthAggregator healthAggregator;
private final Map<String, HealthCheckResult> healthCheckHistory = new ConcurrentHashMap<>();
private final ScheduledExecutorService executorService = Executors.newScheduledThreadPool(5);
@PostConstruct
public void initializeHealthChecks() {
// Schedule periodic health checks
executorService.scheduleAtFixedRate(this::performHealthChecks, 0, 30, TimeUnit.SECONDS);
executorService.scheduleAtFixedRate(this::publishHealthStatus, 10, 60, TimeUnit.SECONDS);
executorService.scheduleAtFixedRate(this::cleanupHealthHistory, 0, 1, TimeUnit.HOURS);
}
public SystemHealthStatus getSystemHealth() {
SystemHealthStatus systemHealth = new SystemHealthStatus();
systemHealth.setTimestamp(Instant.now());
Map<String, Health> individualHealths = new HashMap<>();
// Collect health from all registered indicators
for (Map.Entry<String, HealthIndicator> entry : healthIndicatorRegistry.getAll().entrySet()) {
try {
Health health = entry.getValue().health();
individualHealths.put(entry.getKey(), health);
} catch (Exception e) {
Health errorHealth = Health.down()
.withDetail("error", e.getMessage())
.withDetail("exception", e.getClass().getSimpleName())
.build();
individualHealths.put(entry.getKey(), errorHealth);
}
}
// Aggregate overall health
Health overallHealth = healthAggregator.aggregate(individualHealths);
systemHealth.setOverallStatus(overallHealth.getStatus());
systemHealth.setIndividualHealths(individualHealths);
systemHealth.setHealthScore(calculateHealthScore(individualHealths));
// Add trending information
systemHealth.setHealthTrend(calculateHealthTrend());
systemHealth.setIssuesSummary(summarizeHealthIssues(individualHealths));
return systemHealth;
}
public List<HealthAlert> getHealthAlerts() {
List<HealthAlert> alerts = new ArrayList<>();
SystemHealthStatus currentHealth = getSystemHealth();
// Check for critical health issues
for (Map.Entry<String, Health> entry : currentHealth.getIndividualHealths().entrySet()) {
if (entry.getValue().getStatus() == Status.DOWN) {
HealthAlert alert = new HealthAlert();
alert.setAlertId(UUID.randomUUID().toString());
alert.setComponent(entry.getKey());
alert.setSeverity(AlertSeverity.CRITICAL);
alert.setTitle("Component Health Check Failed");
alert.setMessage("Health check failed for component: " + entry.getKey());
alert.setTimestamp(Instant.now());
alert.setDetails(entry.getValue().getDetails());
alerts.add(alert);
}
}
// Check for performance degradation
if (currentHealth.getHealthScore() < 0.8) {
HealthAlert alert = new HealthAlert();
alert.setAlertId(UUID.randomUUID().toString());
alert.setComponent("SYSTEM");
alert.setSeverity(AlertSeverity.WARNING);
alert.setTitle("System Health Degradation");
alert.setMessage("Overall system health score has degraded to " +
String.format("%.2f", currentHealth.getHealthScore()));
alert.setTimestamp(Instant.now());
alerts.add(alert);
}
// Check for health trend issues
if ("DECLINING".equals(currentHealth.getHealthTrend())) {
HealthAlert alert = new HealthAlert();
alert.setAlertId(UUID.randomUUID().toString());
alert.setComponent("SYSTEM");
alert.setSeverity(AlertSeverity.WARNING);
alert.setTitle("Declining Health Trend");
alert.setMessage("System health trend is declining over recent checks");
alert.setTimestamp(Instant.now());
alerts.add(alert);
}
return alerts;
}
public HealthCheckReport generateHealthReport(Duration period) {
HealthCheckReport report = new HealthCheckReport();
report.setReportPeriod(period);
report.setGeneratedAt(Instant.now());
Instant cutoff = Instant.now().minus(period);
// Collect health check results from history
Map<String, List<HealthCheckResult>> componentHistory = new HashMap<>();
for (Map.Entry<String, HealthCheckResult> entry : healthCheckHistory.entrySet()) {
if (entry.getValue().getTimestamp().isAfter(cutoff)) {
String component = entry.getKey().split("-")[0]; // Extract component name
componentHistory.computeIfAbsent(component, k -> new ArrayList<>()).add(entry.getValue());
}
}
// Calculate availability statistics
Map<String, AvailabilityStatistics> availabilityStats = new HashMap<>();
for (Map.Entry<String, List<HealthCheckResult>> entry : componentHistory.entrySet()) {
AvailabilityStatistics stats = calculateAvailabilityStatistics(entry.getValue());
availabilityStats.put(entry.getKey(), stats);
}
report.setAvailabilityStatistics(availabilityStats);
// Calculate overall system availability
double overallAvailability = availabilityStats.values().stream()
.mapToDouble(AvailabilityStatistics::getAvailabilityPercentage)
.average()
.orElse(0.0);
report.setOverallAvailability(overallAvailability);
// Identify top issues
List<HealthIssue> topIssues = identifyTopHealthIssues(componentHistory);
report.setTopIssues(topIssues);
return report;
}
private void performHealthChecks() {
try {
SystemHealthStatus healthStatus = getSystemHealth();
// Store health check results in history
for (Map.Entry<String, Health> entry : healthStatus.getIndividualHealths().entrySet()) {
HealthCheckResult result = new HealthCheckResult();
result.setComponent(entry.getKey());
result.setStatus(entry.getValue().getStatus());
result.setDetails(entry.getValue().getDetails());
result.setTimestamp(Instant.now());
String historyKey = entry.getKey() + "-" + System.currentTimeMillis();
healthCheckHistory.put(historyKey, result);
}
log.debug("Health checks completed - Overall status: {}, Score: {}",
healthStatus.getOverallStatus(), healthStatus.getHealthScore());
} catch (Exception e) {
log.error("Error performing health checks", e);
}
}
private void publishHealthStatus() {
try {
SystemHealthStatus healthStatus = getSystemHealth();
// Publish to monitoring system
publishToMonitoring(healthStatus);
// Check for alerts
List<HealthAlert> alerts = getHealthAlerts();
if (!alerts.isEmpty()) {
publishHealthAlerts(alerts);
}
log.info("Health status published - Status: {}, Alerts: {}",
healthStatus.getOverallStatus(), alerts.size());
} catch (Exception e) {
log.error("Error publishing health status", e);
}
}
private double calculateHealthScore(Map<String, Health> healthMap) {
if (healthMap.isEmpty()) {
return 0.0;
}
int totalComponents = healthMap.size();
long healthyComponents = healthMap.values().stream()
.mapToLong(health -> health.getStatus() == Status.UP ? 1 : 0)
.sum();
return (double) healthyComponents / totalComponents;
}
private String calculateHealthTrend() {
// Analyze recent health scores to determine trend
List<Double> recentScores = getRecentHealthScores(Duration.ofMinutes(30));
if (recentScores.size() < 3) {
return "INSUFFICIENT_DATA";
}
// Simple trend analysis
double firstHalf = recentScores.subList(0, recentScores.size() / 2).stream()
.mapToDouble(Double::doubleValue)
.average()
.orElse(0.0);
double secondHalf = recentScores.subList(recentScores.size() / 2, recentScores.size()).stream()
.mapToDouble(Double::doubleValue)
.average()
.orElse(0.0);
if (secondHalf > firstHalf + 0.1) {
return "IMPROVING";
} else if (secondHalf < firstHalf - 0.1) {
return "DECLINING";
} else {
return "STABLE";
}
}
private List<String> summarizeHealthIssues(Map<String, Health> healthMap) {
return healthMap.entrySet().stream()
.filter(entry -> entry.getValue().getStatus() != Status.UP)
.map(entry -> entry.getKey() + ": " + entry.getValue().getStatus())
.collect(Collectors.toList());
}
}
// Health Check REST Controller
@RestController
@RequestMapping("/health")
public class HealthCheckController {
@Autowired
private HealthCheckManager healthCheckManager;
@GetMapping
public ResponseEntity<SystemHealthStatus> getSystemHealth() {
SystemHealthStatus health = healthCheckManager.getSystemHealth();
HttpStatus httpStatus = health.getOverallStatus() == Status.UP ?
HttpStatus.OK : HttpStatus.SERVICE_UNAVAILABLE;
return ResponseEntity.status(httpStatus).body(health);
}
@GetMapping("/detailed")
public ResponseEntity<SystemHealthStatus> getDetailedHealth() {
SystemHealthStatus health = healthCheckManager.getSystemHealth();
return ResponseEntity.ok(health);
}
@GetMapping("/alerts")
public ResponseEntity<List<HealthAlert>> getHealthAlerts() {
List<HealthAlert> alerts = healthCheckManager.getHealthAlerts();
return ResponseEntity.ok(alerts);
}
@GetMapping("/report")
public ResponseEntity<HealthCheckReport> getHealthReport(
@RequestParam(defaultValue = "PT24H") String period) {
Duration reportPeriod = Duration.parse(period);
HealthCheckReport report = healthCheckManager.generateHealthReport(reportPeriod);
return ResponseEntity.ok(report);
}
@GetMapping("/readiness")
public ResponseEntity<Map<String, String>> readinessCheck() {
SystemHealthStatus health = healthCheckManager.getSystemHealth();
Map<String, String> response = new HashMap<>();
response.put("status", health.getOverallStatus().toString());
response.put("ready", health.getOverallStatus() == Status.UP ? "true" : "false");
response.put("timestamp", health.getTimestamp().toString());
HttpStatus httpStatus = health.getOverallStatus() == Status.UP ?
HttpStatus.OK : HttpStatus.SERVICE_UNAVAILABLE;
return ResponseEntity.status(httpStatus).body(response);
}
@GetMapping("/liveness")
public ResponseEntity<Map<String, String>> livenessCheck() {
// Liveness check should be simpler - just verify basic application health
Map<String, String> response = new HashMap<>();
response.put("status", "UP");
response.put("alive", "true");
response.put("timestamp", Instant.now().toString());
return ResponseEntity.ok(response);
}
}
Best Practices
1. Health Check Design and Implementation
- Design health checks to be fast and lightweight to minimize impact on system performance
- Implement different levels of health checks (shallow, medium, deep) for different use cases
- Use meaningful health check names and descriptions for operational clarity
- Implement proper timeout handling for all external dependency checks
- Design health checks to be idempotent and side-effect free
2. Health Check Coverage and Scope
- Implement health checks for all critical system components and dependencies
- Include business function health checks beyond technical component status
- Design health checks to verify end-to-end functionality when appropriate
- Implement synthetic transaction testing for critical business workflows
- Cover all integration points and external dependencies
3. Performance and Resource Management
- Monitor the performance impact of health checks on system resources
- Implement health check caching to avoid excessive resource usage
- Use appropriate health check frequencies based on component criticality
- Implement circuit breakers for health checks that depend on external systems
- Design for graceful degradation when health checks themselves fail
4. Alerting and Response
- Configure appropriate alerting thresholds and escalation policies
- Implement health check correlation to avoid alert storms
- Design automated responses for common health check failures
- Provide clear guidance for manual intervention when automated response fails
- Implement health check suppression during planned maintenance
5. Integration and Automation
- Integrate health checks with load balancers and service discovery systems
- Implement health check integration with deployment and rollback automation
- Use health checks for automated scaling decisions and capacity management
- Integrate with monitoring and observability platforms for comprehensive visibility
- Implement health check data retention and historical analysis
6. Security and Compliance
- Implement appropriate authentication and authorization for health check endpoints
- Avoid exposing sensitive information in health check responses
- Implement audit trails for health check access and modifications
- Ensure compliance with security policies for health check data handling
- Design health checks to detect security-related issues when appropriate
Health Checks are essential for maintaining system reliability, enabling proactive issue detection, and supporting automated operational procedures in complex distributed enterprise integration architectures, providing the foundation for operational excellence and service reliability.
← Back to All Patterns