Deploy high-throughput LLM APIs

serving-llms-vllmskillsetup L49,423
Orchestra-Research/AI-Research-SKILLs
What it does

Serve LLMs with PagedAttention and dynamic batching at scale

Best for

Production serving of open models on NVIDIA hardware when high throughput and predictable latency required.

Inputs
  • · model checkpoint
  • · batch size
  • · tensor parallelism config
  • · quantization method
Outputs
  • · OpenAI-compatible API endpoint
  • · throughput metrics (tok/s)
  • · latency percentiles
Requires
  • · vLLM framework
  • · NVIDIA CUDA GPUs
  • · transformers library
Preconditions

NVIDIA GPU required (16GB+ VRAM for 7B models); CUDA 11.8+; sufficient GPU count for parallelism

Failure modes
  • · Out-of-memory on large batch
  • · Long time-to-first-token (TTFT)
  • · Context window exhausted
  • · Tensor parallelism overhead
Trust signals
  • · PagedAttention innovation (memory efficiency)
  • · Dynamic batching for latency
  • · Industry standard at scale