cyberneticlibrary

Load bulk contract data into BigQuery

cf-parallel-harvester-rawloadworkflowsetup L30
chrisns/uk-tenders-mcp
What it does

Parallel append-only bulk load BigQuery from 2-month shards

Best for

Append-loading UK Contracts Finder historical data (2016-2026) into BigQuery in resumable 2-month shards.

Inputs
  • · GCP project, BQ location, Python venv
  • · Shard date ranges (from-to, 2-month windows)
  • · Bulk harvester data source
Outputs
  • · RAWLOAD report (seen=N, appended=M per shard)
  • · Error text if failed/timed out
Requires
  • · BigQuery (WRITE_APPEND, no compile/DML)
  • · Contracts Finder bulk harvester API
  • · Python script runner (600s timeout per shard)
Preconditions
  • · BigQuery credentials (GCP_PROJECT env var)
  • · PYTHONPATH set to ingestion/src
  • · Multi-phase: Raw-load (parallel 2-month shards)
  • · 2016-11 through 2026-05 (~59 shards)
Failure modes
  • · Partial appends OK (resumable on retry)
  • · Timeout at 600s may truncate shard
  • · No compile/dedup in this phase (separate step)
  • · Concurrency safe via WRITE_APPEND (no conflicts)
Trust signals
  • · Concurrency-safe WRITE_APPEND strategy
  • · 2-month sharding (manageable context per shard)
  • · Timeout + resumability pattern
  • · Per-shard reporting (seen/appended counts)