Debug distributed systems with traces
rca-distributed-systemsskillsetup L3★64
Tibsfox/gsd-skill-creator ↗What it does
Analyze distributed-system incidents using trace causality and service graphs
Best for
Production incidents in microservice meshes (Kubernetes, Istio, Linkerd) where the fault could originate in any of 10-100 services and the causal path runs through network hops, retries, and timeouts.
Inputs
- · Distributed system topology (service mesh, microservices)
- · OpenTelemetry traces with trace_id propagation across services
- · Metrics (latency, error rate, CPU, memory) with trace_id correlation
- · Logs with trace_id and baggage attributes
- · Service dependency graph (from APM tool or manual)
Outputs
- · Causal chain of failed service requests (trace waterfall)
- · Root service identified by span latency and error signals
- · Service dependency graph with anomaly highlighting
- · Anomaly correlation across logs/metrics/traces
- · Hypothesis ranked by blast radius and temporal alignment
Requires
- · OpenTelemetry instrumentation (auto or manual)
- · Trace backend (Jaeger, Tempo, Honeycomb, Datadog, New Relic)
- · Metrics backend (Prometheus, Datadog, Honeycomb)
- · Log backend (ELK, Loki, Datadog, Honeycomb)
- · APM tool with service graph visualization
- · Tail-based sampling infrastructure
Preconditions
- · All services emit traces with trace_id propagated via context
- · trace_id present in logs and metric labels across all systems
- · Tail-based sampling enabled (head-based sampling misses rare errors)
- · Service owners know their immediate upstream/downstream dependencies
- · Observability budget allows for high-cardinality attribute retention
Failure modes
- · Missing trace_id in logs → cannot pivot between pillars
- · Head-based sampling → exactly the traces you need (rare errors) are discarded
- · Service dependency graph out of date → misses recently added services
- · Async message queues untraced → causal chain appears broken
- · Circuit breakers + retries create non-linear dynamics → simple causal chain reasoning fails
- · Multi-tenant noise → same slowdown in two tenants from different causes
Trust signals
- · OpenTelemetry is CNCF standard for distributed tracing (adopted by Google, Amazon, Microsoft)
- · Three-pillar approach (logs/metrics/traces) validated across Datadog, Honeycomb, New Relic case studies
- · Tail-based sampling critical insight confirmed by Dogetti et al. 2023 survey
- · Service graph analysis grounded in network centrality measures from control theory
- · DynaCausal 2024 framework adds automated anomaly localization to manual process
- · Microsoft AgentRx framework specifically targets multi-service failure diagnosis