Debug distributed systems with traces

rca-distributed-systemsskillsetup L364
Tibsfox/gsd-skill-creator
What it does

Analyze distributed-system incidents using trace causality and service graphs

Best for

Production incidents in microservice meshes (Kubernetes, Istio, Linkerd) where the fault could originate in any of 10-100 services and the causal path runs through network hops, retries, and timeouts.

Inputs
  • · Distributed system topology (service mesh, microservices)
  • · OpenTelemetry traces with trace_id propagation across services
  • · Metrics (latency, error rate, CPU, memory) with trace_id correlation
  • · Logs with trace_id and baggage attributes
  • · Service dependency graph (from APM tool or manual)
Outputs
  • · Causal chain of failed service requests (trace waterfall)
  • · Root service identified by span latency and error signals
  • · Service dependency graph with anomaly highlighting
  • · Anomaly correlation across logs/metrics/traces
  • · Hypothesis ranked by blast radius and temporal alignment
Requires
  • · OpenTelemetry instrumentation (auto or manual)
  • · Trace backend (Jaeger, Tempo, Honeycomb, Datadog, New Relic)
  • · Metrics backend (Prometheus, Datadog, Honeycomb)
  • · Log backend (ELK, Loki, Datadog, Honeycomb)
  • · APM tool with service graph visualization
  • · Tail-based sampling infrastructure
Preconditions
  • · All services emit traces with trace_id propagated via context
  • · trace_id present in logs and metric labels across all systems
  • · Tail-based sampling enabled (head-based sampling misses rare errors)
  • · Service owners know their immediate upstream/downstream dependencies
  • · Observability budget allows for high-cardinality attribute retention
Failure modes
  • · Missing trace_id in logs → cannot pivot between pillars
  • · Head-based sampling → exactly the traces you need (rare errors) are discarded
  • · Service dependency graph out of date → misses recently added services
  • · Async message queues untraced → causal chain appears broken
  • · Circuit breakers + retries create non-linear dynamics → simple causal chain reasoning fails
  • · Multi-tenant noise → same slowdown in two tenants from different causes
Trust signals
  • · OpenTelemetry is CNCF standard for distributed tracing (adopted by Google, Amazon, Microsoft)
  • · Three-pillar approach (logs/metrics/traces) validated across Datadog, Honeycomb, New Relic case studies
  • · Tail-based sampling critical insight confirmed by Dogetti et al. 2023 survey
  • · Service graph analysis grounded in network centrality measures from control theory
  • · DynaCausal 2024 framework adds automated anomaly localization to manual process
  • · Microsoft AgentRx framework specifically targets multi-service failure diagnosis