cyberneticlibrary

Process ML datasets at scale

ray-dataskillsetup L39,423
Orchestra-Research/AI-Research-SKILLs
What it does

Process distributed ML data at scale

Best for

Batch inference and preprocessing on 100GB+ datasets across multi-node clusters.

Inputs
  • · Parquet/CSV/JSON files (S3, GCS, local)
  • · Ray cluster config
  • · Batch size and transform functions
Outputs
  • · Processed dataset (Parquet/CSV/JSON)
  • · PyTorch/TensorFlow DataLoader
  • · Streaming batches
Requires
  • · Ray[data]
  • · PyArrow
  • · Pandas
Preconditions
  • · Ray cluster running
  • · Data accessible
  • · For GPU: H100/A100 cluster
Failure modes
  • · S3 credentials missing
  • · Out-of-memory on large batch
  • · Arrow schema incompatible
Trust signals
  • · Orchestra Research official skill
  • · Production Ray ecosystem
  • · PyTorch/TensorFlow integration