cyberneticlibrary

Load bulk contract data into BigQuery

cf-parallel-harvester-rawloadworkflowsetup L3★0

chrisns/uk-tenders-mcp ↗

What it does

Parallel append-only bulk load BigQuery from 2-month shards

Best for

Append-loading UK Contracts Finder historical data (2016-2026) into BigQuery in resumable 2-month shards.

Inputs

· GCP project, BQ location, Python venv
· Shard date ranges (from-to, 2-month windows)
· Bulk harvester data source

Outputs

· RAWLOAD report (seen=N, appended=M per shard)
· Error text if failed/timed out

Requires

· BigQuery (WRITE_APPEND, no compile/DML)
· Contracts Finder bulk harvester API
· Python script runner (600s timeout per shard)

Preconditions

· BigQuery credentials (GCP_PROJECT env var)
· PYTHONPATH set to ingestion/src
· Multi-phase: Raw-load (parallel 2-month shards)
· 2016-11 through 2026-05 (~59 shards)

Failure modes

· Partial appends OK (resumable on retry)
· Timeout at 600s may truncate shard
· No compile/dedup in this phase (separate step)
· Concurrency safe via WRITE_APPEND (no conflicts)

Trust signals

· Concurrency-safe WRITE_APPEND strategy
· 2-month sharding (manageable context per shard)
· Timeout + resumability pattern
· Per-shard reporting (seen/appended counts)