Deploy LLMs in One Click
In this guide, we’ll cover how to deploy different LLMs, choose the right serving stack, and optimize for latency, cost, scaling, and reliability.
Self-hosting an LLM shouldn’t mean fighting CUDA versions, broken wheels, flash-attn builds, OOMs, and “works locally, fails on the server”.
Deploy Llama 3.1 8B in One Click (Production-Ready)
This page lets you deploy Llama 3.1 8B on our GPU servers with a single click and get a production-ready, OpenAI-compatible API endpoint (with auth, logs, metrics, and sane defaults).
What you get
- OpenAI-compatible endpoint (
/v1/chat/completions, streaming supported) - Dedicated deployment URL + API key
- Presets for context length, quantization, batching, concurrency
- Reliability defaults: health checks, auto-restart, timeouts, retries
GPU requirements (recommended)
Llama 3.1 8B is inference-friendly, but your experience depends on VRAM, context length, and precision.
- Recommended minimum: 16 GB VRAM
- Good baseline: 24 GB VRAM
- High throughput / heavy batching: 40–80 GB VRAM
Exact VRAM needs vary by runtime (vLLM/TGI), KV cache size, context length, and concurrency. That’s why our presets are designed to be safe by default.
Expected performance (tokens/sec)
Performance depends on GPU class, quantization, batching, and concurrency. Typical single-stream ranges:
| GPU tier | Expected tokens/sec (approx.) | Best for |
|---|---|---|
| L4 / A10 class | ~35–90 tok/s | MVPs, prototypes, early SaaS |
| A100 40GB | ~90–180 tok/s | higher traffic, lower latency |
| A100 80GB / H100 | ~140–300+ tok/s | heavy throughput + batching |
Deploy in 3 steps:
Step 1 — Pick your presets for the selected model
Choose:
- Precision: FP16 / INT8 / 4-bit
- Context length: 8k / 16k / 32k (VRAM dependent)
Step 2 — Click Deploy
We provision the GPU server, pull a tested runtime, and start the model server.
Step 3 — Copy your endpoint + API key
You get:
- Base URL
- API key
- Logs + metrics dashboard
- Health endpoint
OpenAI-compatible endpoint snippet
Use your deployed Llama endpoint with OpenAI-style clients.
curl https://YOUR_ENDPOINT/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{ \
"model": "llama-3.1-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a concise product description for my app."}
],
"temperature": 0.7,
"stream": true
}'
Deploy Qwen 2.5 16B in One Click (Production-Ready)
Cold starts vary by GPU tier and model size. For consistent latency, keep at least one instance warm.
Self-hosting an LLM shouldn’t mean fighting CUDA versions, broken wheels, flash-attn builds, OOMs, and “works locally, fails on the server”.
This page lets you deploy Llama 3.1 8B on our GPU servers with a single click and get a production-ready, OpenAI-compatible API endpoint (with auth, logs, metrics, and sane defaults).
What you get
- OpenAI-compatible endpoint (
/v1/chat/completions, streaming supported) - Dedicated deployment URL + API key
- Presets for context length, quantization, batching, concurrency
- Observability: latency, TTFT (time-to-first-token), tokens/sec, GPU memory, error rate
- Reliability defaults: health checks, auto-restart, timeouts, retries
- Scale controls: concurrency caps + optional warm instances (“keep one hot”)
Llama 3.1 8B is inference-friendly, but your experience depends on VRAM, context length, and precision.
- Recommended minimum: 16 GB VRAM
- Good baseline: 24 GB VRAM
- High throughput / heavy batching: 40–80 GB VRAM
Exact VRAM needs vary by runtime (vLLM/TGI), KV cache size, context length, and concurrency. That’s why our presets are designed to be safe by default.
| Mode | Min VRAM | Recommended VRAM | Notes |
|---|---|---|---|
| FP16 / BF16 | ~18–22 GB | 24 GB+ | Best quality, heavier memory |
| INT8 | ~10–14 GB | 16 GB | Great balance for most apps |
| 4-bit (AWQ/GPTQ) | ~6–10 GB | 12–16 GB | Lowest cost, slight quality tradeoff |
Performance depends on GPU class, quantization, batching, and concurrency. Typical single-stream ranges:
| GPU tier | Expected tokens/sec (approx.) | Best for |
|---|---|---|
| L4 / A10 class | ~35–90 tok/s | MVPs, prototypes, early SaaS |
| A100 40GB | ~90–180 tok/s | higher traffic, lower latency |
| A100 80GB / H100 | ~140–300+ tok/s | heavy throughput + batching |
Tip: For real UX, track TTFT (time-to-first-token) and p95 latency—not just tokens/sec.
Cold starts usually include image pulls (first time), weight download (first time per node), kernel compilation/warmup, and server initialization.
Typical ranges:
- Warm deployment: ~0.2–2s TTFT
- Cold deployment: ~20–120s (depends on GPU + cache + runtime)
How we reduce cold start
- prebuilt runtime images with pinned CUDA/PyTorch stacks
- weight caching on the node
- optional warm instances (“keep one hot”)
- readiness checks that only mark the service ready when it can actually serve
Your cost is basically:
GPU $/hour + storage + bandwidth + (optional) warm instances
Example scenarios:
- Solo MVP: 1 GPU, no warm pool, low concurrency
- Early startup: 1 GPU + warm instance + dashboards/logs
- Production traffic: multiple replicas + autoscaling + tuned concurrency
If you share expected traffic (requests/day, tokens/request, target latency), we can recommend:
- a preset (precision, context, concurrency)
- expected tokens/sec and TTFT
- an estimated $ / 1M tokens
Step 1 — Pick your preset
Choose:
- Precision: FP16 / INT8 / 4-bit
- Context length: 8k / 16k / 32k (VRAM dependent)
- Concurrency: 1 / 4 / 8 / 16
- Warm pool: Off / 1 / 2
Step 2 — Click Deploy
We provision the GPU server, pull a tested runtime, and start the model server.
Step 3 — Copy your endpoint + API key
You get:
- Base URL
- API key
- Logs + metrics dashboard
- Health endpoint
Use your deployed Llama endpoint with OpenAI-style clients.
curl https://YOUR_ENDPOINT/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{ \
"model": "llama-3.1-8b", \
"messages": [ \
{"role": "system", "content": "You are a helpful assistant."}, \
{"role": "user", "content": "Write a concise product description for my app."} \
], \
"temperature": 0.7,
"stream": true
}'