cyberneticlibrary

Deploy high-throughput LLM APIs

serving-llms-vllmskillsetup L4★9,423

Orchestra-Research/AI-Research-SKILLs ↗

What it does

Serve LLMs with PagedAttention and dynamic batching at scale

Best for

Production serving of open models on NVIDIA hardware when high throughput and predictable latency required.

Inputs

· model checkpoint
· batch size
· tensor parallelism config
· quantization method

Outputs

· OpenAI-compatible API endpoint
· throughput metrics (tok/s)
· latency percentiles

Requires

· vLLM framework
· NVIDIA CUDA GPUs
· transformers library

Preconditions

NVIDIA GPU required (16GB+ VRAM for 7B models); CUDA 11.8+; sufficient GPU count for parallelism

Failure modes

· Out-of-memory on large batch
· Long time-to-first-token (TTFT)
· Context window exhausted
· Tensor parallelism overhead

Trust signals

· PagedAttention innovation (memory efficiency)
· Dynamic batching for latency
· Industry standard at scale