Observability

Seeing Inside Distributed Systems

The Challenge

In a Monolith:

  • One log file
  • One database connection pool to monitor
  • Stack traces show the full picture

In Microservices:

  • Logs scattered across 20+ services
  • Requests span multiple services
  • "Where did my request fail?"

You can't debug what you can't see.

The Three Pillars of Observability

1. Logs
Individual events: "User logged in", "Payment failed"

2. Metrics
Aggregated numbers: CPU%, Request count, Error rate

3. Traces
Request journeys: "Request entered Gateway → Patient Service → DB → Response"

Together, they answer: What happened? Why? Where?

Observability vs. Monitoring

Monitoring:
"Is the system up? Are we within SLA?"
Known unknowns (predefined dashboards).

Observability:
"Why is Checkout slow for users in Europe at 3pm?"
Unknown unknowns (exploratory investigation).

Microservices need observability because failures are emergent (cascading, timing-based).

The Practice Manager Stack

We'll implement observability using:

  • Logging: Structured JSON logs
  • Metrics: Prometheus + Grafana
  • Tracing: OpenTelemetry + Jaeger
  • Aggregation: Loki (optional)

All integrated with our Quarkus microservices.

Part 1: Structured Logging

The Problem with Traditional Logs

2025-12-06 10:23:45 INFO Patient created
2025-12-06 10:23:46 ERROR Failed to save

Questions you can't answer:

  • Which patient? Which service instance?
  • Was this part of a consultation flow?
  • How long did it take?

Structured Logging with JSON

{
  "timestamp": "2025-12-06T10:23:45Z",
  "level": "INFO",
  "service": "practicemanager",
  "trace_id": "550e8400-e29b-41d4-a716",
  "patient_id": "P001",
  "action": "patient_created",
  "duration_ms": 45
}

Now you can query:

  • All logs for trace_id=550e8400...
  • Average duration_ms for action=patient_created

Exercise 1: Enable JSON Logging in Quarkus

Structured Logs for practicemanager

Objective: Convert plain text logs to structured JSON.

Prerequisites:

  • practicemanager service
  • Quarkus 3.x

Exercise 1: Step 1 - Add Dependencies

Edit practicemanager/pom.xml:

<dependency>
  <groupId>io.quarkus</groupId>
  <artifactId>quarkus-logging-json</artifactId>
</dependency>

Exercise 1: Step 2 - Configure JSON Format

Edit practicemanager/src/main/resources/application.properties:

# Enable JSON logging
quarkus.log.console.json=true

# Include useful fields
quarkus.log.console.json.pretty-print=false
quarkus.log.console.json.fields.timestamp.enabled=true
quarkus.log.console.json.fields.service.enabled=true
quarkus.log.console.json.additional-field."service.name".value=practicemanager

Exercise 1: Step 3 - Add Context to Logs

Update PatientService.java:

import org.jboss.logging.Logger;
import org.jboss.logging.MDC;

@ApplicationScoped
public class PatientService {
    private static final Logger LOG = Logger.getLogger(PatientService.class);

    public Patient registerNewPatient(Patient patient) {
        MDC.put("patient_ssn", patient.getSsn());
        MDC.put("action", "register_patient");

        LOG.infof("Registering new patient: %s %s",
                  patient.getFirstName(), patient.getLastName());

        Patient saved = patientPort.save(patient);

        // Add patient ID after save
        MDC.put("patient_id", saved.getId());
        LOG.infof("Patient registered successfully with ID: %s", saved.getId());
        MDC.clear();

        return saved;
    }
}

Note: Use patient.getSsn() as identifier before save, then add patient_id after.

Exercise 1: Step 4 - Test

cd practicemanager
./mvnw quarkus:dev

# In another terminal
curl -X POST http://localhost:8080/api/patients \
  -H "Content-Type: application/json" \
  -d '{
    "firstName": "Test",
    "lastName": "User",
    "ssn": "123-45-6789"
  }'

Check logs: You should see JSON output with patient_id and action fields.

Part 2: Metrics with Prometheus

What to Measure?

RED Method (Request-focused):

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Latency (p50, p95, p99)

USE Method (Resource-focused):

  • Utilization: % of resource used (CPU, memory)
  • Saturation: Queue length, waiting threads
  • Errors: Error count

Exercise 2: Expose Prometheus Metrics

Instrument practicemanager

Objective: Export application metrics to Prometheus.

Exercise 2: Step 1 - Add Dependencies

Edit practicemanager/pom.xml:

<dependency>
  <groupId>io.quarkus</groupId>
  <artifactId>quarkus-micrometer-registry-prometheus</artifactId>
</dependency>

Exercise 2: Step 2 - Enable Metrics

Edit application.properties:

# Prometheus metrics endpoint
quarkus.micrometer.export.prometheus.enabled=true
quarkus.micrometer.export.prometheus.path=/q/metrics

# Enable HTTP metrics
quarkus.micrometer.binder.http-server.enabled=true
quarkus.micrometer.binder.http-client.enabled=true

# Enable JVM metrics
quarkus.micrometer.binder.jvm.enabled=true

Exercise 2: Step 3 - Add Custom Metrics

Create PatientMetrics.java:

package com.isen.practicemanager.core.service;

import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

import java.util.concurrent.TimeUnit;

@ApplicationScoped
public class PatientMetrics {
  @Inject
  MeterRegistry registry;

  public void recordPatientRegistration(long durationMs) {
    registry.counter("patients.registered.total").increment();

    Timer.builder("patients.registration.duration")
            .description("Time to register a patient")
            .register(registry)
            .record(durationMs, TimeUnit.MILLISECONDS);
  }

  public void recordDocumentUpload(String type, long sizeBytes) {
    registry.counter("documents.uploaded.total",
            "type", type).increment();

    registry.gauge("documents.size.bytes", sizeBytes);
  }
}

Exercise 2: Step 4 - Use Metrics

Update PatientService.java:

@Inject
PatientMetrics metrics;

public Patient registerNewPatient(Patient patient) {
    long start = System.currentTimeMillis();

    Patient saved = patientPort.save(patient);

    long duration = System.currentTimeMillis() - start;
    metrics.recordPatientRegistration(duration);

    return saved;
}

Exercise 2: Step 5 - Verify Metrics

curl http://localhost:8080/q/metrics

You should see:

# HELP patients_registered_total
# TYPE patients_registered_total counter
patients_registered_total 5.0

# HELP patients_registration_duration_seconds
# TYPE patients_registration_duration_seconds summary
patients_registration_duration_seconds_count 5.0
patients_registration_duration_seconds_sum 0.245

Exercise 3: Deploy Prometheus

Collect and Store Metrics

Objective: Run Prometheus to scrape metrics from services.

Exercise 3: Step 1 - Create prometheus.yml

Create observability/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'practicemanager'
    metrics_path: '/q/metrics'
    static_configs:
      - targets: ['practicemanager:8080']
        labels:
          service: 'practicemanager'
          environment: 'dev'

  - job_name: 'consultations'
    metrics_path: '/q/metrics'
    static_configs:
      - targets: ['consultations:8084']
        labels:
          service: 'consultations'
          environment: 'dev'

  - job_name: 'apisix'
    metrics_path: '/apisix/prometheus/metrics'
    static_configs:
      - targets: ['microservices-apisix:9091']
        labels:
          service: 'apisix'
          environment: 'dev'

Exercise 3: Step 2 - Add to docker-compose.yml

Add Prometheus to main docker-compose.yml:

  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: microservices-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./observability/prometheus.yml:/etc/prometheus/prometheus.yml:Z
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - microservices-network

volumes:
  prometheus-data:
    name: prometheus-data

Important: Add :Z flag for SELinux/Podman, set permissions: chmod 644 observability/prometheus.yml

Exercise 3: Step 3 - Start Prometheus

# Start Prometheus
docker compose up prometheus -d

# Verify it's running
docker ps | grep prometheus
curl http://localhost:9090/-/healthy

Open Prometheus UI: http://localhost:9090

Exercise 3: Step 4 - Query Metrics

In Prometheus UI, try these queries:

Request rate (per second):

rate(http_server_requests_seconds_count[5m])

Error rate:

rate(http_server_requests_seconds_count{status=~"5.."}[5m])

95th percentile latency:

histogram_quantile(0.95,
  rate(http_server_requests_seconds_bucket[5m]))

Exercise 4: Visualize with Grafana

Beautiful Dashboards

Objective: Create real-time dashboards for microservices.

Exercise 4: Step 1 - Add Grafana to Docker Compose

Edit observability/docker-compose.yml:

  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./observability/grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
    networks:
      - microservices-network

volumes:
  grafana-data:

Exercise 4: Step 2 - Auto-configure Prometheus Datasource

Create observability/grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Exercise 4: Step 3 - Start Grafana

docker-compose up -d grafana

# Open Grafana
open http://localhost:3000

Login: admin / admin

Exercise 4: Step 4 - Create Dashboard

Import pre-built dashboard:

  1. Click "+" → "Import"
  2. Enter ID: 4701 (JVM Micrometer)
  3. Select Prometheus datasource
  4. Click "Import"

You now have:

  • JVM memory usage
  • Thread count
  • Garbage collection metrics

Exercise 4: Step 5 - Custom Dashboard

Create a new dashboard with panels:

Panel 1: Request Rate

sum(rate(http_server_requests_seconds_count[5m]))
  by (service)

Panel 2: Error Rate

sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
  by (service)

Panel 3: Response Time

histogram_quantile(0.95,
  sum(rate(http_server_requests_seconds_bucket[5m]))
    by (le, service))

Exercise 4: Step 6 - Set Alerts

Create alert for high error rate:

  1. Edit Panel 2 (Error Rate)
  2. Go to "Alert" tab
  3. Set condition: WHEN avg() OF query(A, 5m) IS ABOVE 0.05
  4. Configure notification channel (Email, Slack)

Alert triggers when error rate > 5%

Part 3: Distributed Tracing

The Problem: Where is the Bottleneck?

Request: Create consultation
Total time: 850ms

Which service is slow?

  • Gateway: 50ms
  • Consultations: 200ms
  • → REST call to practicemanager: ???
  • practicemanager: 600ms
  • → Database query: ???

You need a trace.

Distributed Tracing Concepts

Trace: The complete journey of a request
Span: A single operation within a trace (e.g., "DB query")
Trace ID: Unique identifier passed across services

Trace: Create Consultation (trace_id: abc123)
  ├─ Span: Gateway routing (50ms)
  ├─ Span: Consultations.create (200ms)
  │   └─ Span: HTTP GET /patients/P001 (150ms)
  └─ Span: PracticeManager.getPatient (600ms)
      └─ Span: Database query (580ms) ← THE PROBLEM

OpenTelemetry + Jaeger

OpenTelemetry (OTel):
Standard for instrumenting applications (vendor-neutral).

Jaeger:
Open-source tracing backend for visualizing traces.

Quarkus has built-in support!

Exercise 5: Enable Distributed Tracing

OpenTelemetry in practicemanager

Objective: Export traces to Jaeger.

Exercise 5: Step 1 - Add Dependencies

Edit practicemanager/pom.xml:

<dependency>
  <groupId>io.quarkus</groupId>
  <artifactId>quarkus-opentelemetry</artifactId>
</dependency>

Do the same for consultations/pom.xml.

Exercise 5: Step 2 - Configure OTel

Edit practicemanager/src/main/resources/application.properties:

# OpenTelemetry configuration
quarkus.otel.enabled=true
quarkus.otel.service.name=practicemanager
quarkus.otel.exporter.otlp.endpoint=http://localhost:4317

# Export traces to Jaeger
quarkus.otel.exporter.otlp.traces.endpoint=http://localhost:4317

# Sample all requests (dev only!)
quarkus.otel.traces.sampler=always_on

Repeat for consultations (change service name).

Exercise 5: Step 3 - Deploy Jaeger

Add to observability/docker-compose.yml:

  jaeger:
    image: jaegertracing/all-in-one:1.52
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC receiver
      - "4318:4318"    # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

Exercise 5: Step 4 - Start Services

cd observability
docker-compose up -d jaeger

# Restart microservices to enable tracing
cd ../practicemanager && ./mvnw quarkus:dev
cd ../consultations && ./mvnw quarkus:dev

Exercise 5: Step 5 - Generate Traffic

# Create a patient
curl -X POST http://localhost:8080/api/patients \
  -H "Content-Type: application/json" \
  -d '{"firstName":"Alice","lastName":"Test","ssn":"999"}'

# Create a consultation (this calls practicemanager!)
curl -X POST http://localhost:8084/consultations \
  -H "Content-Type: application/json" \
  -d '{"patientId":"P001","doctorId":"D001",
       "scheduledAt":"2025-12-10T10:00:00","reason":"Checkup"}'

Exercise 5: Step 6 - View Traces

Open Jaeger UI: http://localhost:16686

  1. Select Service: consultations
  2. Click "Find Traces"
  3. Click on a trace

You should see:

  • Consultation service span
  • HTTP call to practicemanager
  • practicemanager service span
  • Database operations

Exercise 5: Step 7 - Add Custom Spans

For fine-grained tracing, add custom spans:

import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.trace.Span;

@Inject
Tracer tracer;

public Patient registerNewPatient(Patient patient) {
    Span span = tracer.spanBuilder("validate-patient-data")
                      .startSpan();
    try {
        // Validation logic
        validatePatientData(patient);
        span.setAttribute("patient.ssn", patient.getSsn());
        span.addEvent("validation.completed");
    } finally {
        span.end();
    }

    return patientPort.save(patient);
}

Exercise 6: Trace Context Propagation

Ensure trace_id flows through messaging

Challenge: When using RabbitMQ, trace context must be propagated manually.

Exercise 6: Step 1 - Add Trace ID to Events

Update ConsultationCompletedEvent.java:

public class ConsultationCompletedEvent {
    private String traceId;
    private String consultationId;
    private String patientId;
    // ... other fields

    // Add getter/setter for traceId
}

Exercise 6: Step 2 - Inject Trace ID When Publishing

Update RabbitMQEventPublisher.java:

import io.opentelemetry.api.trace.Span;

@ApplicationScoped
public class RabbitMQEventPublisher implements EventPublisher {

    @Override
    public void publishConsultationCompleted(ConsultationCompletedEvent event) {
        // Inject current trace ID
        Span currentSpan = Span.current();
        event.setTraceId(currentSpan.getSpanContext()
                                   .getTraceId());

        emitter.send(Message.of(event)
                           .withMetadata(/* ... */));
    }
}

Exercise 6: Step 3 - Resume Trace in Consumer

Update ConsultationEventConsumer.java:

import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.trace.SpanKind;

@Inject
Tracer tracer;

@Incoming("consultation-completed")
public CompletionStage<Void> consume(ConsultationCompletedEvent event) {
    // Resume trace from event
    Span span = tracer.spanBuilder("process-consultation-event")
                      .setSpanKind(SpanKind.CONSUMER)
                      .setAttribute("trace.id", event.getTraceId())
                      .setAttribute("consultation.id", event.getConsultationId())
                      .startSpan();

    try (var scope = span.makeCurrent()) {
        LOG.infof("Processing consultation completed: %s",
                  event.getConsultationId());
        // Process event
        return CompletableFuture.completedFuture(null);
    } finally {
        span.end();
    }
}

Exercise 7: End-to-End Observability

Full Stack Integration

Objective: Connect all three pillars.

Exercise 7: Scenario - Debug Slow Consultation

Step 1: User reports: "Creating consultations is slow"

Step 2: Check Grafana dashboard

  • Response time p95 for /consultations POST is 1200ms (normal: 200ms)

Step 3: Query Prometheus

histogram_quantile(0.95,
  rate(http_server_requests_seconds_bucket{uri="/consultations"}[5m]))

Result: Spike at 14:30

Exercise 7: Debug (Continued)

Step 4: Open Jaeger, filter by time: 14:30-14:35

Step 5: Find slow traces (>1s)

Step 6: Drill into trace

  • Consultations span: 50ms
  • HTTP call to practicemanager: 1150ms ← PROBLEM
  • practicemanager span: 1100ms
    • Database query: 1080ms ← ROOT CAUSE

Step 7: Check practicemanager logs (filter by trace_id)

{
  "trace_id": "abc123",
  "message": "Slow query detected",
  "query": "SELECT * FROM patients WHERE ssn=?",
  "duration_ms": 1080
}

Diagnosis: Missing database index on ssn column.

Exercise 8: Alerting Rules

Proactive Problem Detection

Objective: Get notified before users complain.

Exercise 8: Step 1 - Prometheus Alert Rules

Create observability/prometheus-alerts.yml:

groups:
  - name: microservices
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_server_requests_seconds_count{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "{{ $value }} errors/sec"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_server_requests_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"

Exercise 8: Step 2 - Configure Alertmanager

Add to docker-compose.yml:

  alertmanager:
    image: prom/alertmanager:v0.26.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'

Exercise 8: Step 3 - Alertmanager Config

Create observability/alertmanager.yml:

route:
  receiver: 'team-email'
  group_by: ['alertname', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alerts@example.com'
        auth_password: 'password'

Exercise 8: Step 4 - Test Alerts

Trigger high error rate:

Add a failing endpoint to practicemanager:

@GET
@Path("/fail")
public Response fail() {
    throw new RuntimeException("Intentional failure");
}
# Generate errors
for i in {1..100}; do
  curl http://localhost:8080/api/fail
done

Check Alertmanager: http://localhost:9093

Best Practices

  1. Use Correlation IDs: Every request gets a unique ID (trace_id)
  2. Log at Boundaries: Entry/exit of services, external calls
  3. Don't Log PII: Avoid logging SSNs, credit cards
  4. Sample Traces in Production: 1-10% to reduce overhead
  5. Set SLOs: "p95 latency < 200ms" not "make it faster"

SLO Example

Service Level Objective (SLO):

  • Availability: 99.9% uptime (43 minutes downtime/month)
  • Latency: p95 < 200ms
  • Error Rate: < 0.1%

Measure with Prometheus:

# Availability
sum(rate(http_server_requests_seconds_count{status!~"5.."}[30d]))
/
sum(rate(http_server_requests_seconds_count[30d]))

Observability Maturity Model

Level 1: Basic logs, manual debugging
Level 2: Centralized logs, basic metrics
Level 3: Distributed tracing, alerts
Level 4: Automated root cause analysis
Level 5: Predictive anomaly detection (ML)

Most teams should aim for Level 3.

Summary

  • Logs: Structured JSON with context (trace_id, service)
  • Metrics: Prometheus + Grafana for dashboards
  • Traces: OpenTelemetry + Jaeger for request flows
  • Alerts: Proactive detection via Alertmanager
  • Goal: Mean Time To Recovery (MTTR) < 15 minutes

Observability is not optional in microservices.

Resources

Next: Containers & Orchestration for production deployment.