Process ML datasets at scale
ray-dataskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Process distributed ML data at scale
Best for
Batch inference and preprocessing on 100GB+ datasets across multi-node clusters.
Inputs
- · Parquet/CSV/JSON files (S3, GCS, local)
- · Ray cluster config
- · Batch size and transform functions
Outputs
- · Processed dataset (Parquet/CSV/JSON)
- · PyTorch/TensorFlow DataLoader
- · Streaming batches
Requires
- · Ray[data]
- · PyArrow
- · Pandas
Preconditions
- · Ray cluster running
- · Data accessible
- · For GPU: H100/A100 cluster
Failure modes
- · S3 credentials missing
- · Out-of-memory on large batch
- · Arrow schema incompatible
Trust signals
- · Orchestra Research official skill
- · Production Ray ecosystem
- · PyTorch/TensorFlow integration