cyberneticlibrary

Process ML datasets at scale

ray-dataskillsetup L3★9,423

Orchestra-Research/AI-Research-SKILLs ↗

What it does

Process distributed ML data at scale

Best for

Batch inference and preprocessing on 100GB+ datasets across multi-node clusters.

Inputs

· Parquet/CSV/JSON files (S3, GCS, local)
· Ray cluster config
· Batch size and transform functions

Outputs

· Processed dataset (Parquet/CSV/JSON)
· PyTorch/TensorFlow DataLoader
· Streaming batches

Requires

· Ray[data]
· PyArrow
· Pandas

Preconditions

· Ray cluster running
· Data accessible
· For GPU: H100/A100 cluster

Failure modes

· S3 credentials missing
· Out-of-memory on large batch
· Arrow schema incompatible

Trust signals

· Orchestra Research official skill
· Production Ray ecosystem
· PyTorch/TensorFlow integration