Deploy high-throughput LLM APIs
serving-llms-vllmskillsetup L4★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Serve LLMs with PagedAttention and dynamic batching at scale
Best for
Production serving of open models on NVIDIA hardware when high throughput and predictable latency required.
Inputs
- · model checkpoint
- · batch size
- · tensor parallelism config
- · quantization method
Outputs
- · OpenAI-compatible API endpoint
- · throughput metrics (tok/s)
- · latency percentiles
Requires
- · vLLM framework
- · NVIDIA CUDA GPUs
- · transformers library
Preconditions
NVIDIA GPU required (16GB+ VRAM for 7B models); CUDA 11.8+; sufficient GPU count for parallelism
Failure modes
- · Out-of-memory on large batch
- · Long time-to-first-token (TTFT)
- · Context window exhausted
- · Tensor parallelism overhead
Trust signals
- · PagedAttention innovation (memory efficiency)
- · Dynamic batching for latency
- · Industry standard at scale