cyberneticlibrary

Debug scraper-to-analyst pipeline

debug-pipelinecommandsetup L1136
colossus-lab/openarg_backend
What it does

Diagnose data pipeline failures across scraper-embedding-collector-analyst

Best for

Rapid triage of multi-stage data pipeline failures in production with clear diagnostic separation per pipeline stage.

Inputs
  • · Problem description ($ARGUMENTS)
  • · Current system state (logs, database, Redis)
Outputs
  • · Diagnostics checklist results
  • · Identified failure point (Scraper/Embedding/Collector/Analyst)
  • · Remediation steps
Requires
  • · PostgreSQL
  • · Redis
  • · Flower (Celery monitoring)
  • · OpenAI embedding API
  • · HTTPX (for timeouts)
  • · Pandas (file parsing)
  • · GPT-4o-mini / GPT-4o (LLM analysis)
Preconditions
  • · PostgreSQL running and migrations applied
  • · Redis on port 6381
  • · Celery workers active
  • · Environment variables set (OPENAI_API_KEY, DATABASE_URL, etc.)
Failure modes
  • · PostgreSQL down → pipeline stalls at dataset storage
  • · Redis unavailable → Celery broker fails
  • · Missing OpenAI key → embedding task fails
  • · CKAN portal unresponsive → scraper finds no catalogs
  • · No cached datasets → analyst cannot analyze
Trust signals
  • · Dedicated subsection per stage with specific failure checklist
  • · SQL query examples to verify data presence
  • · Flower URL provided for real-time task monitoring
  • · Explicit step counts: vector search → data gather → analysis