cyberneticlibrary

Scale pandas workflows beyond memory

daskskillsetup L227,559
K-Dense-AI/scientific-agent-skills
What it does

Parallelize pandas/NumPy for larger-than-RAM datasets

Best for

Scaling existing pandas code to multi-GB datasets without Spark rewrite.

Inputs
  • · CSV/Parquet files >RAM
  • · pandas DataFrame operations
  • · NumPy array chunks
Outputs
  • · Lazy task graph
  • · Computed results in DataFrame/Array format
Requires
  • · Dask 2025.1+
  • · pandas 2+
  • · PyArrow 16+
  • · s3fs/gcsfs
Preconditions

Python 3.10+, dask installed, sufficient disk for spill

Failure modes

Chunk size too large (OOM), shuffle operations slow on single machine

Trust signals
  • · Lazy evaluation prevents memory overflow
  • · Distributed scheduler for multi-machine