Deploy ML models serverless with GPUs

modal-serverless-gpuskillsetup L29,423
Orchestra-Research/AI-Research-SKILLs
What it does

Deploy ML workloads to Modal serverless GPU endpoints

Best for

Deploying inference-only ML models without managing containers or servers; pay-per-invocation.

Inputs
  • · Python function (trained model, inference logic)
  • · GPU type required
Outputs
  • · HTTPS endpoint URL
  • · invocation method (REST/webhook)
Requires
  • · modal SDK
  • · containerization (implicit)
Preconditions

Modal account with GPU quota, function packaged as Python module

Failure modes
  • · cold start latency (spin-up time)
  • · timeout on long inference
  • · GPU memory exceeded
Trust signals
  • · serverless cost model explained
  • · cold-start time quantified
  • · GPU endpoint example provided