Error Tracking
Overview
Error Tracking is a systematic observability pattern in enterprise integration architectures that comprehensively captures, categorizes, analyzes, and manages application errors, exceptions, failures, and anomalies across distributed systems to enable rapid issue identification, effective debugging, proactive problem resolution, and continuous system reliability improvement. Like a sophisticated diagnostic and forensic system that not only detects when something goes wrong but also provides detailed context about what happened, why it happened, and how to fix it, error tracking provides end-to-end visibility into system failures and their impact on business operations. This pattern is essential for maintaining system reliability, reducing mean time to resolution (MTTR), preventing error recurrence, supporting root cause analysis, and ensuring high-quality user experiences in complex enterprise environments where rapid error detection and resolution are critical to business continuity.
Theoretical Foundation
Error Tracking is grounded in fault tolerance theory, error propagation analysis, incident management principles, and reliability engineering methodologies. It incorporates concepts from exception handling patterns, failure analysis frameworks, observability theory, and continuous improvement processes to provide a comprehensive framework for error management and system reliability. The pattern addresses the fundamental need for systematic error capture, intelligent error analysis, effective error communication, and data-driven reliability improvements in distributed enterprise systems.
Core Principles
1. Comprehensive Error Capture and Classification
Systematic capture and categorization of all types of system errors and failures: - Exception tracking - detailed capture of application exceptions with full stack traces and context - System error monitoring - monitoring of system-level errors and infrastructure failures - Business logic errors - tracking of business rule violations and process failures - Integration failures - monitoring of external service failures and communication errors
2. Contextual Error Information and Analysis
Rich contextual information to support effective error analysis and resolution: - Environmental context - system state, configuration, and environmental conditions at time of error - User context - user information, session details, and user journey context when errors occur - Technical context - detailed technical information including stack traces, request details, and system metrics - Business context - business process context, transaction details, and business impact assessment
3. Intelligent Error Aggregation and Deduplication
Smart grouping and management of related errors to reduce noise and improve efficiency: - Error fingerprinting - intelligent grouping of similar errors to reduce duplicate noise - Error correlation - identification of related errors and failure cascades across services - Impact assessment - evaluation of error frequency, severity, and business impact - Trend analysis - identification of error patterns and trends over time
4. Proactive Error Management and Resolution
Automated and guided approaches to error resolution and prevention: - Automated alerting - intelligent alerting based on error severity, frequency, and business impact - Resolution tracking - systematic tracking of error resolution status and progress - Root cause analysis - guided analysis to identify underlying causes of errors - Prevention strategies - implementation of measures to prevent error recurrence
Why Error Tracking is Essential in Integration Architecture
1. Rapid Issue Detection and Response
In complex distributed systems, error tracking provides: - Real-time error detection - immediate notification when critical errors occur - Error impact assessment - understanding of error impact on business operations and users - Prioritized response - intelligent prioritization of errors based on severity and business impact - Coordinated incident response - support for coordinated incident management and resolution
2. Effective Debugging and Troubleshooting
Supporting rapid problem diagnosis and resolution: - Detailed error context - comprehensive information for understanding error causes and conditions - Error reproduction - sufficient context to reproduce and debug errors in development environments - Cross-service correlation - understanding of error propagation across distributed services - Historical analysis - access to historical error patterns for comparative analysis
3. System Reliability Improvement
Using error data to continuously improve system reliability: - Reliability metrics - measurement of system reliability and error rates over time - Failure pattern analysis - identification of common failure patterns and root causes - Preventive measures - implementation of measures to prevent known error patterns - Quality assurance - support for quality assurance processes and reliability testing
4. Business Continuity and Customer Experience
Minimizing business impact through effective error management: - Customer impact minimization - rapid resolution of errors that affect customer experience - Business process continuity - ensuring business processes continue despite technical errors - SLA compliance - support for meeting service level agreement commitments - Reputation protection - preventing error-related damage to business reputation
Benefits in Integration Contexts
1. Technical Advantages
- Reduced MTTR - faster error resolution through comprehensive error information and context
- Improved debugging efficiency - more effective debugging through detailed error context and correlation
- Proactive issue prevention - prevention of error recurrence through root cause analysis
- System reliability enhancement - continuous improvement of system reliability through error analysis
2. Operational Benefits
- Operational efficiency - more efficient operations through automated error detection and alerting
- Resource optimization - better resource allocation through understanding of error patterns and impacts
- Incident management - more effective incident management through coordinated error tracking
- Quality improvement - continuous quality improvement through systematic error analysis
3. Integration Enablement
- Service reliability - improved reliability of integration services through comprehensive error tracking
- Dependency monitoring - understanding of external service reliability and failure patterns
- Integration quality - better integration quality through systematic error monitoring and analysis
- Service coordination - better coordination of distributed services through shared error visibility
4. Business Value
- Customer satisfaction - improved customer satisfaction through rapid error resolution
- Business continuity - better business continuity through proactive error management
- Risk mitigation - reduced business risk through comprehensive error monitoring and management
- Competitive advantage - competitive advantage through superior system reliability and user experience
Integration Architecture Applications
1. Comprehensive Error Tracking System
Enterprise-grade error tracking with intelligent analysis and management:
// Error Tracking Configuration
@Configuration
@EnableConfigurationProperties(ErrorTrackingProperties.class)
public class ErrorTrackingConfiguration {
@Bean
public ErrorCaptureService errorCaptureService() {
return new ErrorCaptureService();
}
@Bean
public ErrorAnalysisService errorAnalysisService() {
return new ErrorAnalysisService();
}
@Bean
public ErrorAggregationService errorAggregationService() {
return new ErrorAggregationService();
}
@Bean
public ErrorAlertService errorAlertService() {
return new ErrorAlertService();
}
@Bean
public ErrorReportingService errorReportingService() {
return new ErrorReportingService();
}
@Bean
public ErrorResolutionTracker errorResolutionTracker() {
return new ErrorResolutionTracker();
}
}
// Error Capture Service
@Service
public class ErrorCaptureService {
@Autowired
private ErrorAnalysisService errorAnalysisService;
@Autowired
private ErrorAggregationService errorAggregationService;
@Autowired
private ContextCollectorService contextCollectorService;
private static final Logger logger = LoggerFactory.getLogger(ErrorCaptureService.class);
@Async
public void captureError(Throwable throwable, ErrorContext context) {
try {
// Create error entry
ErrorEntry errorEntry = createErrorEntry(throwable, context);
// Enrich with additional context
enrichErrorContext(errorEntry);
// Calculate error fingerprint
String fingerprint = calculateErrorFingerprint(errorEntry);
errorEntry.setFingerprint(fingerprint);
// Determine error severity
ErrorSeverity severity = determineErrorSeverity(errorEntry);
errorEntry.setSeverity(severity);
// Analyze error
ErrorAnalysisResult analysis = errorAnalysisService.analyzeError(errorEntry);
errorEntry.setAnalysisResult(analysis);
// Aggregate with similar errors
ErrorGroup errorGroup = errorAggregationService.aggregateError(errorEntry);
// Store error
storeError(errorEntry, errorGroup);
// Send alerts if necessary
checkAndSendAlerts(errorEntry, errorGroup);
logger.info("Error captured and processed - ErrorId: {}, Type: {}, Severity: {}, Fingerprint: {}",
errorEntry.getId(), errorEntry.getExceptionType(),
errorEntry.getSeverity(), fingerprint);
} catch (Exception e) {
logger.error("Failed to capture error", e);
// Fallback error capture to prevent error capture failures
captureErrorCaptureFallback(throwable, context, e);
}
}
@EventListener
public void handleUncaughtException(UncaughtExceptionEvent event) {
ErrorContext context = ErrorContext.builder()
.source("UNCAUGHT_EXCEPTION")
.timestamp(Instant.now())
.threadName(Thread.currentThread().getName())
.build();
captureError(event.getThrowable(), context);
}
public void captureBusinessError(String errorCode, String errorMessage,
Map<String, Object> businessContext) {
BusinessError businessError = new BusinessError(errorCode, errorMessage);
ErrorContext context = ErrorContext.builder()
.source("BUSINESS_LOGIC")
.businessContext(businessContext)
.timestamp(Instant.now())
.build();
captureError(businessError, context);
}
public void captureIntegrationError(String serviceName, String endpoint,
Throwable throwable, IntegrationContext integrationContext) {
ErrorContext context = ErrorContext.builder()
.source("INTEGRATION")
.serviceName(serviceName)
.endpoint(endpoint)
.integrationContext(integrationContext)
.timestamp(Instant.now())
.build();
captureError(throwable, context);
}
private ErrorEntry createErrorEntry(Throwable throwable, ErrorContext context) {
ErrorEntry entry = new ErrorEntry();
entry.setId(UUID.randomUUID().toString());
entry.setTimestamp(context.getTimestamp());
entry.setExceptionType(throwable.getClass().getName());
entry.setExceptionMessage(throwable.getMessage());
entry.setStackTrace(getStackTraceString(throwable));
entry.setSource(context.getSource());
entry.setServiceName(context.getServiceName());
entry.setEndpoint(context.getEndpoint());
// Add cause chain
if (throwable.getCause() != null) {
entry.setCauseChain(buildCauseChain(throwable));
}
// Add suppressed exceptions
Throwable[] suppressed = throwable.getSuppressed();
if (suppressed.length > 0) {
entry.setSuppressedExceptions(Arrays.stream(suppressed)
.map(this::createSuppressedException)
.collect(Collectors.toList()));
}
return entry;
}
private void enrichErrorContext(ErrorEntry errorEntry) {
try {
// Add system context
SystemContext systemContext = contextCollectorService.collectSystemContext();
errorEntry.setSystemContext(systemContext);
// Add request context if available
RequestContext requestContext = contextCollectorService.collectRequestContext();
if (requestContext != null) {
errorEntry.setRequestContext(requestContext);
}
// Add user context if available
UserContext userContext = contextCollectorService.collectUserContext();
if (userContext != null) {
errorEntry.setUserContext(userContext);
}
// Add application context
ApplicationContext applicationContext = contextCollectorService.collectApplicationContext();
errorEntry.setApplicationContext(applicationContext);
// Add performance context
PerformanceContext performanceContext = contextCollectorService.collectPerformanceContext();
errorEntry.setPerformanceContext(performanceContext);
} catch (Exception e) {
logger.warn("Failed to enrich error context", e);
}
}
private String calculateErrorFingerprint(ErrorEntry errorEntry) {
try {
// Create fingerprint based on exception type, message pattern, and stack trace
StringBuilder fingerprintData = new StringBuilder();
// Add exception type
fingerprintData.append(errorEntry.getExceptionType());
// Add normalized error message (remove dynamic values)
String normalizedMessage = normalizeErrorMessage(errorEntry.getExceptionMessage());
fingerprintData.append("|").append(normalizedMessage);
// Add key stack trace elements (top 3-5 frames from application code)
List<String> keyStackFrames = extractKeyStackFrames(errorEntry.getStackTrace());
fingerprintData.append("|").append(String.join(",", keyStackFrames));
// Add source/endpoint information
if (errorEntry.getEndpoint() != null) {
fingerprintData.append("|").append(errorEntry.getEndpoint());
}
// Calculate hash
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest(fingerprintData.toString().getBytes(StandardCharsets.UTF_8));
return Base64.getEncoder().encodeToString(hash);
} catch (Exception e) {
logger.warn("Failed to calculate error fingerprint, using fallback", e);
return "fallback-" + errorEntry.getExceptionType() + "-" + System.currentTimeMillis();
}
}
private ErrorSeverity determineErrorSeverity(ErrorEntry errorEntry) {
// Determine severity based on exception type, context, and impact
// Critical errors
if (isSecurityException(errorEntry)) {
return ErrorSeverity.CRITICAL;
}
if (isDataCorruptionException(errorEntry)) {
return ErrorSeverity.CRITICAL;
}
if (isSystemFailureException(errorEntry)) {
return ErrorSeverity.CRITICAL;
}
// High severity errors
if (isPaymentRelatedError(errorEntry)) {
return ErrorSeverity.HIGH;
}
if (isCustomerImpactingError(errorEntry)) {
return ErrorSeverity.HIGH;
}
if (isDatabaseException(errorEntry)) {
return ErrorSeverity.HIGH;
}
// Medium severity errors
if (isIntegrationException(errorEntry)) {
return ErrorSeverity.MEDIUM;
}
if (isValidationException(errorEntry)) {
return ErrorSeverity.MEDIUM;
}
// Low severity errors (default)
return ErrorSeverity.LOW;
}
private String normalizeErrorMessage(String message) {
if (message == null) return "";
// Replace dynamic values with placeholders
return message
.replaceAll("\\d+", "{number}")
.replaceAll("[a-fA-F0-9-]{36}", "{uuid}")
.replaceAll("\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}", "{timestamp}")
.replaceAll("'[^']*'", "'{string}'")
.replaceAll("\"[^\"]*\"", "\"{string}\"");
}
private List<String> extractKeyStackFrames(String stackTrace) {
List<String> keyFrames = new ArrayList<>();
String[] lines = stackTrace.split("\n");
for (String line : lines) {
if (line.contains("at ") && isApplicationCode(line)) {
// Extract method and class information
String frame = line.trim().replaceAll("at ", "")
.replaceAll("\\(.*\\)", "()"); // Remove line numbers and file info
keyFrames.add(frame);
// Limit to top 5 application frames
if (keyFrames.size() >= 5) {
break;
}
}
}
return keyFrames;
}
private boolean isApplicationCode(String stackFrame) {
// Identify application code vs framework/library code
return stackFrame.contains("org.kallio") || // Application package
stackFrame.contains("com.yourcompany"); // Additional application packages
}
}
// Error Analysis Service
@Service
public class ErrorAnalysisService {
@Autowired
private ErrorPatternMatcher errorPatternMatcher;
@Autowired
private ErrorImpactAnalyzer errorImpactAnalyzer;
@Autowired
private ErrorCorrelationService errorCorrelationService;
public ErrorAnalysisResult analyzeError(ErrorEntry errorEntry) {
ErrorAnalysisResult result = new ErrorAnalysisResult();
result.setErrorId(errorEntry.getId());
result.setAnalysisTimestamp(Instant.now());
// Pattern matching
List<ErrorPattern> matchedPatterns = errorPatternMatcher.matchPatterns(errorEntry);
result.setMatchedPatterns(matchedPatterns);
// Root cause analysis
RootCauseAnalysis rootCause = performRootCauseAnalysis(errorEntry, matchedPatterns);
result.setRootCauseAnalysis(rootCause);
// Impact analysis
ErrorImpactAssessment impact = errorImpactAnalyzer.assessImpact(errorEntry);
result.setImpactAssessment(impact);
// Correlation analysis
ErrorCorrelationResult correlation = errorCorrelationService.findCorrelatedErrors(errorEntry);
result.setCorrelationResult(correlation);
// Resolution suggestions
List<ResolutionSuggestion> suggestions = generateResolutionSuggestions(errorEntry, matchedPatterns);
result.setResolutionSuggestions(suggestions);
// Classification
ErrorClassification classification = classifyError(errorEntry, result);
result.setClassification(classification);
return result;
}
private RootCauseAnalysis performRootCauseAnalysis(ErrorEntry errorEntry,
List<ErrorPattern> matchedPatterns) {
RootCauseAnalysis analysis = new RootCauseAnalysis();
analysis.setAnalysisMethod("AUTOMATED_PATTERN_MATCHING");
// Analyze based on matched patterns
if (!matchedPatterns.isEmpty()) {
ErrorPattern primaryPattern = matchedPatterns.get(0);
analysis.setPossibleCauses(primaryPattern.getKnownCauses());
analysis.setConfidenceLevel(primaryPattern.getConfidenceLevel());
analysis.setRecommendedActions(primaryPattern.getRecommendedActions());
}
// Additional analysis based on context
analyzeContextualCauses(errorEntry, analysis);
// Analyze error timing and frequency
analyzeErrorTiming(errorEntry, analysis);
return analysis;
}
private void analyzeContextualCauses(ErrorEntry errorEntry, RootCauseAnalysis analysis) {
List<String> contextualCauses = new ArrayList<>();
// Analyze system context
if (errorEntry.getSystemContext() != null) {
SystemContext systemContext = errorEntry.getSystemContext();
if (systemContext.getMemoryUtilization() > 85) {
contextualCauses.add("High memory utilization (" +
systemContext.getMemoryUtilization() + "%)");
}
if (systemContext.getCpuUtilization() > 80) {
contextualCauses.add("High CPU utilization (" +
systemContext.getCpuUtilization() + "%)");
}
if (systemContext.getDiskUtilization() > 90) {
contextualCauses.add("High disk utilization (" +
systemContext.getDiskUtilization() + "%)");
}
}
// Analyze request context
if (errorEntry.getRequestContext() != null) {
RequestContext requestContext = errorEntry.getRequestContext();
if (requestContext.getRequestSize() > 50 * 1024 * 1024) { // 50MB
contextualCauses.add("Large request size (" +
formatBytes(requestContext.getRequestSize()) + ")");
}
if (requestContext.getProcessingTime() > 30000) { // 30 seconds
contextualCauses.add("Long processing time (" +
requestContext.getProcessingTime() + "ms)");
}
}
analysis.setContextualCauses(contextualCauses);
}
private List<ResolutionSuggestion> generateResolutionSuggestions(ErrorEntry errorEntry,
List<ErrorPattern> patterns) {
List<ResolutionSuggestion> suggestions = new ArrayList<>();
// Suggestions based on patterns
for (ErrorPattern pattern : patterns) {
for (String action : pattern.getRecommendedActions()) {
ResolutionSuggestion suggestion = new ResolutionSuggestion();
suggestion.setType(ResolutionSuggestionType.PATTERN_BASED);
suggestion.setAction(action);
suggestion.setConfidence(pattern.getConfidenceLevel());
suggestion.setDescription("Based on error pattern: " + pattern.getName());
suggestions.add(suggestion);
}
}
// Suggestions based on error type
suggestions.addAll(generateTypeBasedSuggestions(errorEntry));
// Suggestions based on context
suggestions.addAll(generateContextBasedSuggestions(errorEntry));
return suggestions.stream()
.sorted((a, b) -> Double.compare(b.getConfidence(), a.getConfidence()))
.limit(5) // Top 5 suggestions
.collect(Collectors.toList());
}
private List<ResolutionSuggestion> generateTypeBasedSuggestions(ErrorEntry errorEntry) {
List<ResolutionSuggestion> suggestions = new ArrayList<>();
String exceptionType = errorEntry.getExceptionType();
if (exceptionType.contains("OutOfMemoryError")) {
suggestions.add(createSuggestion(
ResolutionSuggestionType.INFRASTRUCTURE,
"Increase JVM heap size",
"Add -Xmx parameter to increase maximum heap memory",
0.8
));
suggestions.add(createSuggestion(
ResolutionSuggestionType.CODE,
"Analyze memory usage patterns",
"Review code for memory leaks and optimize data structures",
0.7
));
}
if (exceptionType.contains("TimeoutException")) {
suggestions.add(createSuggestion(
ResolutionSuggestionType.CONFIGURATION,
"Increase timeout configuration",
"Review and increase timeout values for affected operations",
0.7
));
suggestions.add(createSuggestion(
ResolutionSuggestionType.MONITORING,
"Monitor external service performance",
"Check performance of external dependencies",
0.6
));
}
if (exceptionType.contains("SQLException")) {
suggestions.add(createSuggestion(
ResolutionSuggestionType.DATABASE,
"Check database connection pool",
"Review database connection pool configuration and health",
0.8
));
suggestions.add(createSuggestion(
ResolutionSuggestionType.DATABASE,
"Analyze SQL query performance",
"Review SQL query execution plans and optimize if necessary",
0.7
));
}
return suggestions;
}
}
// Error Aggregation Service
@Service
public class ErrorAggregationService {
@Autowired
private ErrorGroupRepository errorGroupRepository;
@Autowired
private ErrorEntryRepository errorEntryRepository;
private final Map<String, ErrorGroup> activeGroups = new ConcurrentHashMap<>();
public ErrorGroup aggregateError(ErrorEntry errorEntry) {
String fingerprint = errorEntry.getFingerprint();
// Find or create error group
ErrorGroup errorGroup = activeGroups.computeIfAbsent(fingerprint, fp -> {
// Check if group exists in database
Optional<ErrorGroup> existingGroup = errorGroupRepository.findByFingerprint(fp);
return existingGroup.orElseGet(() -> createNewErrorGroup(errorEntry));
});
// Update group with new error
updateErrorGroup(errorGroup, errorEntry);
// Check if group status needs updating
updateGroupStatus(errorGroup);
// Persist updates
errorGroupRepository.save(errorGroup);
return errorGroup;
}
private ErrorGroup createNewErrorGroup(ErrorEntry errorEntry) {
ErrorGroup group = new ErrorGroup();
group.setId(UUID.randomUUID().toString());
group.setFingerprint(errorEntry.getFingerprint());
group.setTitle(generateGroupTitle(errorEntry));
group.setDescription(generateGroupDescription(errorEntry));
group.setFirstSeen(errorEntry.getTimestamp());
group.setLastSeen(errorEntry.getTimestamp());
group.setErrorCount(0);
group.setAffectedUsers(new HashSet<>());
group.setStatus(ErrorGroupStatus.OPEN);
group.setSeverity(errorEntry.getSeverity());
group.setSource(errorEntry.getSource());
group.setServiceName(errorEntry.getServiceName());
group.setEnvironment(getEnvironment());
return group;
}
private void updateErrorGroup(ErrorGroup group, ErrorEntry errorEntry) {
// Update occurrence information
group.setLastSeen(errorEntry.getTimestamp());
group.setErrorCount(group.getErrorCount() + 1);
// Update severity if new error is more severe
if (errorEntry.getSeverity().ordinal() > group.getSeverity().ordinal()) {
group.setSeverity(errorEntry.getSeverity());
}
// Track affected users
if (errorEntry.getUserContext() != null &&
errorEntry.getUserContext().getUserId() != null) {
group.getAffectedUsers().add(errorEntry.getUserContext().getUserId());
}
// Update frequency metrics
updateFrequencyMetrics(group, errorEntry);
// Update trend information
updateTrendInformation(group, errorEntry);
}
private void updateGroupStatus(ErrorGroup group) {
// Auto-resolve if no new errors for a certain period
Duration timeSinceLastSeen = Duration.between(group.getLastSeen(), Instant.now());
if (group.getStatus() == ErrorGroupStatus.OPEN &&
timeSinceLastSeen.toDays() > 30 &&
group.getErrorCount() < 5) {
group.setStatus(ErrorGroupStatus.AUTO_RESOLVED);
group.setResolvedAt(Instant.now());
group.setResolutionNote("Auto-resolved: No new occurrences for 30 days");
}
// Escalate if error frequency increases significantly
if (group.getStatus() == ErrorGroupStatus.OPEN &&
hasFrequencySpike(group)) {
group.setStatus(ErrorGroupStatus.ESCALATED);
group.setEscalatedAt(Instant.now());
}
}
private void updateFrequencyMetrics(ErrorGroup group, ErrorEntry errorEntry) {
// Update hourly frequency
LocalDateTime errorHour = errorEntry.getTimestamp().atZone(ZoneOffset.UTC)
.truncatedTo(ChronoUnit.HOURS).toLocalDateTime();
Map<LocalDateTime, Integer> hourlyFrequency = group.getHourlyFrequency();
if (hourlyFrequency == null) {
hourlyFrequency = new HashMap<>();
group.setHourlyFrequency(hourlyFrequency);
}
hourlyFrequency.merge(errorHour, 1, Integer::sum);
// Keep only last 48 hours of data
Instant cutoff = Instant.now().minus(Duration.ofHours(48));
hourlyFrequency.entrySet().removeIf(entry ->
entry.getKey().toInstant(ZoneOffset.UTC).isBefore(cutoff));
}
public ErrorGroupSummary getErrorGroupSummary(Duration period) {
Instant startTime = Instant.now().minus(period);
List<ErrorGroup> groups = errorGroupRepository.findByLastSeenAfter(startTime);
ErrorGroupSummary summary = new ErrorGroupSummary();
summary.setPeriod(period);
summary.setTotalGroups(groups.size());
// Calculate statistics
summary.setOpenGroups((int) groups.stream()
.filter(g -> g.getStatus() == ErrorGroupStatus.OPEN)
.count());
summary.setResolvedGroups((int) groups.stream()
.filter(g -> g.getStatus() == ErrorGroupStatus.RESOLVED)
.count());
summary.setEscalatedGroups((int) groups.stream()
.filter(g -> g.getStatus() == ErrorGroupStatus.ESCALATED)
.count());
// Calculate total errors
int totalErrors = groups.stream()
.mapToInt(ErrorGroup::getErrorCount)
.sum();
summary.setTotalErrors(totalErrors);
// Calculate affected users
Set<String> allAffectedUsers = groups.stream()
.flatMap(g -> g.getAffectedUsers().stream())
.collect(Collectors.toSet());
summary.setAffectedUsers(allAffectedUsers.size());
// Top error groups
List<ErrorGroup> topGroups = groups.stream()
.sorted((a, b) -> Integer.compare(b.getErrorCount(), a.getErrorCount()))
.limit(10)
.collect(Collectors.toList());
summary.setTopErrorGroups(topGroups);
return summary;
}
}
// Error Alert Service
@Service
public class ErrorAlertService {
@Autowired
private NotificationService notificationService;
@Autowired
private ErrorAlertConfiguration alertConfiguration;
@Value("${error.alerts.slack.webhook-url}")
private String slackWebhookUrl;
@Value("${error.alerts.email.recipients}")
private List<String> emailRecipients;
public void checkAndSendAlerts(ErrorEntry errorEntry, ErrorGroup errorGroup) {
List<ErrorAlert> alerts = new ArrayList<>();
// Check severity-based alerts
if (shouldAlertOnSeverity(errorEntry.getSeverity())) {
alerts.add(createSeverityAlert(errorEntry, errorGroup));
}
// Check frequency-based alerts
if (shouldAlertOnFrequency(errorGroup)) {
alerts.add(createFrequencyAlert(errorGroup));
}
// Check new error type alerts
if (isNewErrorType(errorEntry, errorGroup)) {
alerts.add(createNewErrorTypeAlert(errorEntry, errorGroup));
}
// Check user impact alerts
if (shouldAlertOnUserImpact(errorGroup)) {
alerts.add(createUserImpactAlert(errorGroup));
}
// Send alerts
for (ErrorAlert alert : alerts) {
sendAlert(alert);
}
}
private boolean shouldAlertOnSeverity(ErrorSeverity severity) {
return alertConfiguration.getSeverityAlertThresholds().contains(severity);
}
private boolean shouldAlertOnFrequency(ErrorGroup errorGroup) {
// Alert if error count exceeds threshold within time window
int threshold = alertConfiguration.getFrequencyThreshold();
Duration timeWindow = alertConfiguration.getFrequencyTimeWindow();
Instant cutoff = Instant.now().minus(timeWindow);
// Count recent errors
int recentCount = errorEntryRepository.countByGroupFingerprintAndTimestampAfter(
errorGroup.getFingerprint(), cutoff);
return recentCount >= threshold;
}
private boolean isNewErrorType(ErrorEntry errorEntry, ErrorGroup errorGroup) {
return errorGroup.getErrorCount() == 1; // First occurrence
}
private boolean shouldAlertOnUserImpact(ErrorGroup errorGroup) {
int userThreshold = alertConfiguration.getUserImpactThreshold();
return errorGroup.getAffectedUsers().size() >= userThreshold;
}
private void sendAlert(ErrorAlert alert) {
try {
NotificationMessage message = createNotificationMessage(alert);
// Send based on severity
switch (alert.getSeverity()) {
case CRITICAL:
notificationService.sendEmail(emailRecipients, message);
notificationService.sendSlack(slackWebhookUrl, message);
notificationService.sendSms(getOnCallContacts(), message);
break;
case HIGH:
notificationService.sendEmail(emailRecipients, message);
notificationService.sendSlack(slackWebhookUrl, message);
break;
case MEDIUM:
notificationService.sendSlack(slackWebhookUrl, message);
break;
case LOW:
// Only log for low severity
logger.info("Error alert: {}", alert.getMessage());
break;
}
// Store alert for tracking
storeAlert(alert);
logger.info("Error alert sent - Type: {}, Severity: {}, ErrorGroup: {}",
alert.getType(), alert.getSeverity(), alert.getErrorGroupId());
} catch (Exception e) {
logger.error("Failed to send error alert", e);
}
}
}
2. Error Resolution Tracking System
Systematic tracking and management of error resolution processes:
// Error Resolution Tracker
@Service
public class ErrorResolutionTracker {
@Autowired
private ErrorGroupRepository errorGroupRepository;
@Autowired
private ResolutionActivityRepository resolutionActivityRepository;
@Autowired
private NotificationService notificationService;
public ResolutionTicket createResolutionTicket(ErrorGroup errorGroup, String assignee,
ResolutionPriority priority) {
ResolutionTicket ticket = new ResolutionTicket();
ticket.setId(UUID.randomUUID().toString());
ticket.setErrorGroupId(errorGroup.getId());
ticket.setTitle("Resolve: " + errorGroup.getTitle());
ticket.setDescription(generateResolutionDescription(errorGroup));
ticket.setAssignee(assignee);
ticket.setPriority(priority);
ticket.setStatus(ResolutionStatus.OPEN);
ticket.setCreatedAt(Instant.now());
ticket.setDueDate(calculateDueDate(priority));
// Add resolution suggestions
List<ResolutionSuggestion> suggestions = getResolutionSuggestions(errorGroup);
ticket.setSuggestions(suggestions);
// Create initial activity
ResolutionActivity activity = new ResolutionActivity();
activity.setTicketId(ticket.getId());
activity.setType(ResolutionActivityType.CREATED);
activity.setDescription("Resolution ticket created");
activity.setUserId("SYSTEM");
activity.setTimestamp(Instant.now());
resolutionActivityRepository.save(activity);
// Update error group status
errorGroup.setStatus(ErrorGroupStatus.IN_PROGRESS);
errorGroup.setAssignee(assignee);
errorGroupRepository.save(errorGroup);
// Send notification
notifyAssignment(ticket, assignee);
logger.info("Resolution ticket created - TicketId: {}, ErrorGroup: {}, Assignee: {}",
ticket.getId(), errorGroup.getId(), assignee);
return ticket;
}
public void updateResolutionProgress(String ticketId, String userId, String update,
ResolutionProgressType progressType) {
ResolutionTicket ticket = getResolutionTicket(ticketId);
// Create activity record
ResolutionActivity activity = new ResolutionActivity();
activity.setTicketId(ticketId);
activity.setType(ResolutionActivityType.PROGRESS_UPDATE);
activity.setDescription(update);
activity.setUserId(userId);
activity.setTimestamp(Instant.now());
activity.setProgressType(progressType);
resolutionActivityRepository.save(activity);
// Update ticket status if needed
updateTicketStatus(ticket, progressType);
logger.info("Resolution progress updated - TicketId: {}, UserId: {}, Type: {}",
ticketId, userId, progressType);
}
public void markResolved(String ticketId, String userId, String resolutionNote,
ResolutionType resolutionType) {
ResolutionTicket ticket = getResolutionTicket(ticketId);
// Update ticket
ticket.setStatus(ResolutionStatus.RESOLVED);
ticket.setResolvedBy(userId);
ticket.setResolvedAt(Instant.now());
ticket.setResolutionNote(resolutionNote);
ticket.setResolutionType(resolutionType);
// Create resolution activity
ResolutionActivity activity = new ResolutionActivity();
activity.setTicketId(ticketId);
activity.setType(ResolutionActivityType.RESOLVED);
activity.setDescription("Ticket resolved: " + resolutionNote);
activity.setUserId(userId);
activity.setTimestamp(Instant.now());
resolutionActivityRepository.save(activity);
// Update error group
ErrorGroup errorGroup = errorGroupRepository.findById(ticket.getErrorGroupId()).orElse(null);
if (errorGroup != null) {
errorGroup.setStatus(ErrorGroupStatus.RESOLVED);
errorGroup.setResolvedAt(Instant.now());
errorGroup.setResolutionNote(resolutionNote);
errorGroupRepository.save(errorGroup);
}
// Calculate resolution metrics
updateResolutionMetrics(ticket);
// Send notifications
notifyResolution(ticket, resolutionType);
logger.info("Resolution ticket marked resolved - TicketId: {}, ResolvedBy: {}, Type: {}",
ticketId, userId, resolutionType);
}
public void markVerified(String ticketId, String userId, boolean verificationSuccess,
String verificationNote) {
ResolutionTicket ticket = getResolutionTicket(ticketId);
if (verificationSuccess) {
ticket.setStatus(ResolutionStatus.VERIFIED);
ticket.setVerifiedBy(userId);
ticket.setVerifiedAt(Instant.now());
ticket.setVerificationNote(verificationNote);
// Create verification activity
ResolutionActivity activity = new ResolutionActivity();
activity.setTicketId(ticketId);
activity.setType(ResolutionActivityType.VERIFIED);
activity.setDescription("Resolution verified: " + verificationNote);
activity.setUserId(userId);
activity.setTimestamp(Instant.now());
resolutionActivityRepository.save(activity);
// Update error group status to closed
ErrorGroup errorGroup = errorGroupRepository.findById(ticket.getErrorGroupId()).orElse(null);
if (errorGroup != null) {
errorGroup.setStatus(ErrorGroupStatus.CLOSED);
errorGroupRepository.save(errorGroup);
}
} else {
// Reopen ticket
ticket.setStatus(ResolutionStatus.REOPENED);
ResolutionActivity activity = new ResolutionActivity();
activity.setTicketId(ticketId);
activity.setType(ResolutionActivityType.REOPENED);
activity.setDescription("Resolution verification failed: " + verificationNote);
activity.setUserId(userId);
activity.setTimestamp(Instant.now());
resolutionActivityRepository.save(activity);
// Update error group status back to in progress
ErrorGroup errorGroup = errorGroupRepository.findById(ticket.getErrorGroupId()).orElse(null);
if (errorGroup != null) {
errorGroup.setStatus(ErrorGroupStatus.IN_PROGRESS);
errorGroupRepository.save(errorGroup);
}
}
logger.info("Resolution verification completed - TicketId: {}, Success: {}, VerifiedBy: {}",
ticketId, verificationSuccess, userId);
}
public ResolutionReport generateResolutionReport(Duration period) {
Instant endTime = Instant.now();
Instant startTime = endTime.minus(period);
ResolutionReport report = new ResolutionReport();
report.setPeriod(period);
report.setStartTime(startTime);
report.setEndTime(endTime);
report.setGeneratedAt(Instant.now());
// Get resolution tickets in period
List<ResolutionTicket> tickets = resolutionTicketRepository.findByCreatedAtBetween(startTime, endTime);
// Calculate metrics
ResolutionMetrics metrics = calculateResolutionMetrics(tickets);
report.setMetrics(metrics);
// Resolution time analysis
ResolutionTimeAnalysis timeAnalysis = analyzeResolutionTimes(tickets);
report.setTimeAnalysis(timeAnalysis);
// Assignee performance
List<AssigneePerformance> assigneePerformance = analyzeAssigneePerformance(tickets);
report.setAssigneePerformance(assigneePerformance);
// Resolution type breakdown
Map<ResolutionType, Integer> resolutionTypeBreakdown = tickets.stream()
.filter(t -> t.getResolutionType() != null)
.collect(Collectors.groupingBy(
ResolutionTicket::getResolutionType,
Collectors.collectingAndThen(Collectors.counting(), Long::intValue)
));
report.setResolutionTypeBreakdown(resolutionTypeBreakdown);
return report;
}
}
// Error REST Controller
@RestController
@RequestMapping("/api/errors")
public class ErrorTrackingController {
@Autowired
private ErrorCaptureService errorCaptureService;
@Autowired
private ErrorAggregationService errorAggregationService;
@Autowired
private ErrorResolutionTracker resolutionTracker;
@PostMapping("/capture")
public ResponseEntity<Map<String, String>> captureError(@RequestBody ErrorCaptureRequest request) {
try {
ErrorContext context = ErrorContext.builder()
.source(request.getSource())
.serviceName(request.getServiceName())
.endpoint(request.getEndpoint())
.businessContext(request.getBusinessContext())
.timestamp(Instant.now())
.build();
Exception exception = new Exception(request.getMessage());
exception.setStackTrace(parseStackTrace(request.getStackTrace()));
errorCaptureService.captureError(exception, context);
return ResponseEntity.ok(Map.of("status", "captured"));
} catch (Exception e) {
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
.body(Map.of("error", e.getMessage()));
}
}
@GetMapping("/groups")
public ResponseEntity<PagedResponse<ErrorGroup>> getErrorGroups(
@RequestParam(defaultValue = "0") int page,
@RequestParam(defaultValue = "20") int size,
@RequestParam(required = false) ErrorGroupStatus status,
@RequestParam(required = false) ErrorSeverity severity,
@RequestParam(defaultValue = "lastSeen") String sortBy,
@RequestParam(defaultValue = "desc") String sortDirection) {
Pageable pageable = PageRequest.of(page, size,
Sort.Direction.fromString(sortDirection), sortBy);
Page<ErrorGroup> groups = errorGroupRepository.findWithFilters(
status, severity, pageable);
PagedResponse<ErrorGroup> response = new PagedResponse<>();
response.setContent(groups.getContent());
response.setPageNumber(groups.getNumber());
response.setPageSize(groups.getSize());
response.setTotalElements(groups.getTotalElements());
response.setTotalPages(groups.getTotalPages());
return ResponseEntity.ok(response);
}
@GetMapping("/groups/{groupId}")
public ResponseEntity<ErrorGroupDetail> getErrorGroupDetail(@PathVariable String groupId) {
Optional<ErrorGroup> group = errorGroupRepository.findById(groupId);
if (group.isEmpty()) {
return ResponseEntity.notFound().build();
}
ErrorGroupDetail detail = new ErrorGroupDetail();
detail.setGroup(group.get());
// Get recent errors
List<ErrorEntry> recentErrors = errorEntryRepository
.findByGroupFingerprintOrderByTimestampDesc(group.get().getFingerprint(),
PageRequest.of(0, 10));
detail.setRecentErrors(recentErrors);
// Get resolution ticket if exists
Optional<ResolutionTicket> ticket = resolutionTicketRepository
.findByErrorGroupId(groupId);
detail.setResolutionTicket(ticket.orElse(null));
return ResponseEntity.ok(detail);
}
@GetMapping("/summary")
public ResponseEntity<ErrorSummary> getErrorSummary(
@RequestParam(defaultValue = "P1D") String period) {
Duration summaryPeriod = Duration.parse(period);
ErrorGroupSummary groupSummary = errorAggregationService.getErrorGroupSummary(summaryPeriod);
ResolutionReport resolutionReport = resolutionTracker.generateResolutionReport(summaryPeriod);
ErrorSummary summary = new ErrorSummary();
summary.setGroupSummary(groupSummary);
summary.setResolutionReport(resolutionReport);
summary.setPeriod(summaryPeriod);
summary.setGeneratedAt(Instant.now());
return ResponseEntity.ok(summary);
}
@PostMapping("/groups/{groupId}/resolve")
public ResponseEntity<Map<String, String>> createResolutionTicket(
@PathVariable String groupId,
@RequestBody ResolutionTicketRequest request) {
try {
ErrorGroup group = errorGroupRepository.findById(groupId)
.orElseThrow(() -> new IllegalArgumentException("Error group not found"));
ResolutionTicket ticket = resolutionTracker.createResolutionTicket(
group, request.getAssignee(), request.getPriority());
return ResponseEntity.ok(Map.of("ticketId", ticket.getId()));
} catch (Exception e) {
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
.body(Map.of("error", e.getMessage()));
}
}
@PostMapping("/tickets/{ticketId}/progress")
public ResponseEntity<Map<String, String>> updateProgress(
@PathVariable String ticketId,
@RequestBody ProgressUpdateRequest request) {
try {
resolutionTracker.updateResolutionProgress(
ticketId, request.getUserId(), request.getUpdate(), request.getProgressType());
return ResponseEntity.ok(Map.of("status", "updated"));
} catch (Exception e) {
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
.body(Map.of("error", e.getMessage()));
}
}
@PostMapping("/tickets/{ticketId}/resolve")
public ResponseEntity<Map<String, String>> markResolved(
@PathVariable String ticketId,
@RequestBody ResolveRequest request) {
try {
resolutionTracker.markResolved(
ticketId, request.getUserId(), request.getNote(), request.getResolutionType());
return ResponseEntity.ok(Map.of("status", "resolved"));
} catch (Exception e) {
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
.body(Map.of("error", e.getMessage()));
}
}
}
Best Practices
1. Comprehensive Error Capture Strategy
- Implement error capture at multiple levels (application, infrastructure, integration, business logic)
- Use structured error logging with consistent formats and comprehensive context information
- Implement intelligent error fingerprinting to group related errors and reduce noise
- Capture sufficient context information for effective debugging and root cause analysis
- Implement fallback error capture mechanisms to ensure error capture reliability
2. Intelligent Error Analysis and Classification
- Use pattern matching and machine learning to identify common error patterns and root causes
- Implement automated severity assessment based on error type, context, and business impact
- Use correlation analysis to identify related errors and failure cascades
- Provide automated root cause analysis suggestions based on error patterns and context
- Implement error trend analysis to identify emerging issues and degradation patterns
3. Effective Error Aggregation and Management
- Use intelligent error grouping to reduce duplicate noise and improve signal-to-noise ratio
- Implement dynamic error group management with automatic status updates and lifecycle management
- Use frequency analysis and trend detection to identify significant error patterns
- Implement error impact assessment to prioritize resolution efforts effectively
- Provide comprehensive error group reporting and analytics for continuous improvement
4. Proactive Error Alerting and Communication
- Implement intelligent alerting based on error severity, frequency, and business impact
- Use escalation policies and notification channels appropriate for different error types
- Avoid alert fatigue through intelligent alert grouping and rate limiting
- Provide context-rich alerts with actionable information for resolution teams
- Implement alert acknowledgment and resolution tracking for accountability
5. Systematic Error Resolution and Prevention
- Implement systematic error resolution tracking with clear ownership and accountability
- Use resolution workflow management to guide error investigation and resolution processes
- Provide automated resolution suggestions based on error patterns and historical resolutions
- Implement error resolution verification and validation processes
- Use error analysis for proactive prevention and system reliability improvements
6. Integration and Automation
- Integrate error tracking with incident management and support systems
- Implement automated error escalation based on business impact and resolution time
- Use error tracking data for system reliability metrics and SLA monitoring
- Integrate with deployment and release management for error tracking during deployments
- Implement automated remediation for known error patterns when appropriate
Error Tracking is essential for maintaining system reliability, ensuring rapid issue resolution, supporting effective debugging and troubleshooting, and driving continuous system improvement in complex enterprise integration architectures, providing the foundation for high-quality, reliable systems that meet business and customer expectations.
← Back to All Patterns