Running Llama 3 70B on Kubernetes: AWQ Quantization and Tensor Parallelism (2026)

Llama 3 70B in full BF16 precision requires ~140 GB of VRAM. The largest single GPU you can provision on AWS without a multi-month waitlist — an H100 80GB on p5 — holds 80 GB. You can't fit a 70B model on one GPU at full precision. This is not a Kubernetes problem or a scheduling problem; it's a physics problem.

The two tools for dealing with this are quantization (reduce model precision to shrink the weight size) and tensor parallelism (split the model across multiple GPUs). Neither is magic: quantization costs some quality, tensor parallelism adds communication overhead. But together they make 70B inference practical on hardware you can actually get, at costs that aren't absurd.

This post covers the VRAM math, the quantization options that actually matter, the tensor parallelism constraints specific to Llama 3 70B's architecture, and the Kubernetes configuration that holds it together.

The VRAM math

Model weight size at different precisions:

Precision	Bytes per param	Llama 3 70B weight size
BF16 / FP16	2 bytes	~140 GB
INT8	1 byte	~70 GB
AWQ / GPTQ 4-bit	~0.5 bytes	~35–40 GB
FP8	1 byte	~70 GB (H100 native)

The weight size is the floor. On top of that, vLLM needs VRAM for:

KV cache — the attention key/value state for all active requests in the batch
Activations — intermediate computation tensors
CUDA kernels and PyTorch overhead — a few GB regardless of model size

The KV cache budget is what --gpu-memory-utilization controls:

kv_cache_budget = (total_vram × gpu_memory_utilization) − model_weight_size

For 4×A10G (96 GB total) running AWQ 4-bit (40 GB weights), at --gpu-memory-utilization 0.90:

kv_cache_budget = (96 × 0.90) − 40 = 86.4 − 40 = ~46 GB

46 GB of KV cache supports a meaningful number of concurrent requests at 8K context. Drop to 0.85 utilization and you're at ~42 GB — still workable. Go below 0.80 with a large model and you'll see vLLM fail to initialize due to insufficient KV cache budget.

GPU sizing: what instance buys you what

Instance	GPUs	Total VRAM	Practical use	Approx. on-demand price
g5.xlarge	1× A10G	24 GB	7B BF16, 13B INT8, 70B 4-bit (TP=1, very tight)	~$1/hr
g5.12xlarge	4× A10G	96 GB	70B 4-bit (TP=4), 34B BF16	~$5–6/hr
g5.48xlarge	8× A10G	192 GB	70B BF16 (TP=8), 70B INT8 (TP=4)	~$16–17/hr
p4d.24xlarge	8× A100 40GB	320 GB	70B BF16 (TP=4+), high throughput	~$32/hr
p5.48xlarge	8× H100 80GB	640 GB	70B FP8 (TP=2), 405B 4-bit (TP=8)	~$98/hr
p6-b300.48xlarge	8× B300	~2,100 GB	405B BF16, very large models	TBD (2026)

Note: there is no g5 instance with exactly 2 A10G GPUs. The g5.12xlarge with 4 GPUs is the smallest multi-GPU option in the g5 family. If you need 2 GPUs, you're stepping up to p3 or p4.

For most 70B serving use cases, g5.12xlarge is the right starting point. It costs roughly 3× less than the g5.48xlarge and delivers sufficient throughput for moderate traffic behind a request queue. At the same token rate, g5.48xlarge makes sense when you need the extra headroom — but start with the smaller instance and measure first.

The p6-b300 (Blackwell) instances, announced by AWS in early 2026, use B300 GPUs with FP4 support. At 2,100 GB aggregate VRAM, they're aimed at frontier model training and very large inference. For 70B serving they're significant overkill unless you're running dozens of concurrent instances of the model.

Quantization formats: what to actually use

AWQ (Activation-Aware Weight Quantization)

AWQ (arxiv 2306.00978) identifies which weight channels are most sensitive to quantization by looking at activation magnitudes, then preserves those channels at higher precision. The result is 4-bit quantization that degrades quality less than naive round-to-nearest.

To use AWQ with vLLM:

bash

vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --quantization awq \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --dtype half

You need a pre-quantized AWQ checkpoint — vLLM doesn't quantize on-the-fly from a BF16 model. Use checkpoints from TheBloke or run AutoAWQ yourself:

bash

1pip install autoawq
2python -c "
3from awq import AutoAWQForCausalLM
4from transformers import AutoTokenizer
5
6model = AutoAWQForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-70B-Instruct')
7tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-70B-Instruct')
8quant_config = {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}
9model.quantize(tokenizer, quant_config=quant_config)
10model.save_quantized('./llama3-70b-awq')
11"

Newer vLLM versions also support --quantization awq_marlin, which uses the Marlin kernel for optimized AWQ inference on Ampere and later. Check your vLLM version — if awq_marlin is listed in the quantization options, it's worth benchmarking against awq.

GPTQ

GPTQ (arxiv 2210.17323) uses second-order Hessian information to minimize quantization error layer by layer. Quality is comparable to AWQ at the same bit width, with more quantization compute required upfront.

bash

vllm serve ./llama3-70b-gptq \
  --quantization gptq \
  --tensor-parallel-size 4

GPTQ checkpoints are widely available on HuggingFace. Both AWQ and GPTQ at 4-bit are reasonable choices for 70B serving — the quality difference in practice is small and workload-dependent. Run your own eval on your target task before committing.

INT8

INT8 is a simpler 8-bit linear quantization. It halves the model size relative to BF16 (70B → ~70 GB) without the complexity of 4-bit methods. Quality degradation is minimal. Useful when you have enough VRAM for INT8 but want to avoid the quality risk of 4-bit.

bash

vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --tensor-parallel-size 4

On p4d.24xlarge (8×A100 40GB, 320 GB total), INT8 70B fits comfortably at TP=4 with room for a large KV cache.

FP8

FP8 has native hardware support on H100 (Hopper) and newer. On H100, FP8 gives roughly the same model size as INT8 with better throughput because the hardware tensor cores run FP8 natively.

Do not use FP8 on A10G or A100 (Ampere). Ampere lacks native FP8 tensor cores — vLLM will either error or fall back to a slower compute path. FP8 is only meaningful on p5 (H100) or p6 (Blackwell).

bash

# H100 only
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 2

Tensor parallelism: constraints and configuration

Tensor parallelism splits individual weight matrices across GPUs. Each GPU holds a shard of every attention layer and every FFN layer, and they communicate during the forward pass via AllReduce operations over NCCL.

The hard constraint for Llama 3 70B: tensor-parallel-size must evenly divide the number of KV heads. Llama 3 70B uses Grouped Query Attention (GQA) with 8 KV heads. Valid TP sizes: 1, 2, 4, 8.

Setting --tensor-parallel-size 3 or --tensor-parallel-size 6 will fail at model load time with a heads divisibility error.

For g5.12xlarge (4 GPUs), use --tensor-parallel-size 4. For p5.48xlarge (8 GPUs), use --tensor-parallel-size 8 or split across two models at TP=4.

Kubernetes Deployment for 4-GPU inference

yaml

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: llama3-70b-vllm
5  namespace: inference
6spec:
7  replicas: 1
8  selector:
9    matchLabels:
10      app: llama3-70b-vllm
11  template:
12    metadata:
13      labels:
14        app: llama3-70b-vllm
15    spec:
16      tolerations:
17      - key: nvidia.com/gpu
18        operator: Exists
19        effect: NoSchedule
20      nodeSelector:
21        node.kubernetes.io/instance-type: g5.12xlarge
22      containers:
23      - name: vllm
24        image: vllm/vllm-openai:latest
25        args:
26        - "--model"
27        - "$(MODEL_ID)"
28        - "--quantization"
29        - "awq"
30        - "--tensor-parallel-size"
31        - "4"
32        - "--gpu-memory-utilization"
33        - "0.90"
34        - "--max-model-len"
35        - "8192"
36        - "--dtype"
37        - "half"
38        - "--host"
39        - "0.0.0.0"
40        - "--port"
41        - "8000"
42        env:
43        - name: MODEL_ID
44          value: "TheBloke/Llama-3-70B-Instruct-AWQ"
45        - name: HUGGING_FACE_HUB_TOKEN
46          valueFrom:
47            secretKeyRef:
48              name: hf-token
49              key: token
50        - name: NCCL_SOCKET_IFNAME
51          value: "eth0"
52        - name: NCCL_IB_DISABLE
53          value: "1"
54        resources:
55          limits:
56            nvidia.com/gpu: "4"
57            memory: "120Gi"
58            cpu: "16"
59          requests:
60            nvidia.com/gpu: "4"
61            memory: "80Gi"
62            cpu: "8"
63        ports:
64        - containerPort: 8000
65        volumeMounts:
66        - name: shm
67          mountPath: /dev/shm
68        - name: model-cache
69          mountPath: /root/.cache/huggingface
70        readinessProbe:
71          httpGet:
72            path: /health
73            port: 8000
74          initialDelaySeconds: 120
75          periodSeconds: 10
76          failureThreshold: 30
77        livenessProbe:
78          httpGet:
79            path: /health
80            port: 8000
81          initialDelaySeconds: 180
82          periodSeconds: 30
83          failureThreshold: 3
84      volumes:
85      - name: shm
86        emptyDir:
87          medium: Memory
88          sizeLimit: 16Gi
89      - name: model-cache
90        persistentVolumeClaim:
91          claimName: model-cache-pvc

The shm volume is not optional. NCCL uses /dev/shm for inter-process shared memory during GPU communication. The default Docker /dev/shm size is 64 MB — far too small for multi-GPU NCCL operations. Setting it to 16 Gi as a memory-backed emptyDir gives NCCL room to work without hitting NCCL error: Out of memory mid-inference.

The readinessProbe with a 120-second initial delay is necessary because model loading takes time: downloading the checkpoint (if not cached), loading weights into VRAM, and initializing NCCL across GPUs. Without a generous initial delay, Kubernetes will kill the pod before it finishes starting.

Karpenter NodePool for GPU nodes

If you're using Karpenter for node provisioning, a dedicated NodePool for GPU inference:

yaml

1apiVersion: karpenter.sh/v1
2kind: NodePool
3metadata:
4  name: gpu-inference
5spec:
6  template:
7    metadata:
8      labels:
9        node-role: gpu-inference
10    spec:
11      nodeClassRef:
12        group: karpenter.k8s.aws
13        kind: EC2NodeClass
14        name: gpu-inference
15      requirements:
16      - key: karpenter.sh/capacity-type
17        operator: In
18        values: ["on-demand"]
19      - key: node.kubernetes.io/instance-type
20        operator: In
21        values: ["g5.12xlarge", "g5.48xlarge"]
22      - key: kubernetes.io/arch
23        operator: In
24        values: ["amd64"]
25      taints:
26      - key: nvidia.com/gpu
27        value: "true"
28        effect: NoSchedule
29  disruption:
30    consolidationPolicy: WhenEmpty
31    consolidateAfter: 5m
32  limits:
33    nvidia.com/gpu: 32
34---
35apiVersion: karpenter.k8s.aws/v1
36kind: EC2NodeClass
37metadata:
38  name: gpu-inference
39spec:
40  amiFamily: AL2
41  role: KarpenterNodeRole
42  subnetSelectorTerms:
43  - tags:
44      karpenter.sh/discovery: my-cluster
45  securityGroupSelectorTerms:
46  - tags:
47      karpenter.sh/discovery: my-cluster
48  instanceStorePolicy: RAID0

consolidationPolicy: WhenEmpty (not WhenEmptyOrUnderutilized) is intentional. GPU nodes should only be reclaimed when they have zero pods — a partially-loaded GPU is still worth keeping because reloading a 40 GB model checkpoint takes several minutes.

Try the toolkit: Generate and customize this NodePool YAML with the Karpenter NodePool Generator — configure instance families, capacity types, and disruption budgets without writing the YAML manually.

NCCL tuning for multi-GPU inference

Beyond the /dev/shm fix, a few NCCL environment variables reduce startup noise on EC2:

yaml

1env:
2- name: NCCL_SOCKET_IFNAME
3  value: "eth0"          # Bind to primary interface; avoids NCCL probing all interfaces
4- name: NCCL_IB_DISABLE
5  value: "1"             # g5 uses PCIe, not InfiniBand; disable IB to skip failed probes
6- name: NCCL_P2P_DISABLE
7  value: "0"             # Keep P2P enabled — A10G supports PCIe peer-to-peer on g5.12xlarge
8- name: NCCL_DEBUG
9  value: "WARN"          # INFO is verbose; use WARN in production, INFO when debugging

On g5.12xlarge, the four A10G GPUs communicate via PCIe — there is no NVLink. PCIe bandwidth (~64 GB/s bidirectional) is lower than NVLink but sufficient for TP=4 with a well-batched workload. If you're seeing high AllReduce latency, check that NCCL_P2P_DISABLE is not set to 1.

Multi-node with pipeline parallelism

When a single node doesn't have enough VRAM and you need to span multiple nodes, use pipeline parallelism. PP splits the model's layers across nodes — Node 1 handles the first N layers, Node 2 handles the next N layers, with activations passed between nodes as each forward pass moves through the pipeline.

Llama 3 70B has 80 transformer layers. With --pipeline-parallel-size 2, each node handles 40 layers. Combined with --tensor-parallel-size 4 on each node, this gives a 2-node, 8-GPU deployment:

bash

# On the head node (also runs rank 0)
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --dtype bfloat16

Pipeline parallelism adds latency per request (the pipeline bubble — ranks upstream must wait for ranks downstream to finish their stage before the next microbatch can start). It's suited for throughput-optimized batch workloads, not low-latency interactive use. If TTFT matters, prefer fitting the model on a single node with quantization rather than going multi-node.

Benchmarking: TTFT, TPOT, throughput

Three metrics matter for LLM serving performance:

TTFT (Time to First Token): latency from request arrival to first generated token. Drives perceived responsiveness for chat interfaces.
TPOT (Time Per Output Token): average time between tokens once generation starts. Drives streaming smoothness.
Throughput: total tokens generated per second across all concurrent requests. Drives cost per request.

Run vLLM's built-in benchmark against a live server:

bash

1# Clone the vLLM repo to get the benchmark script (not installed as a Python module)
2git clone --depth 1 https://github.com/vllm-project/vllm.git /tmp/vllm
3
4python3 /tmp/vllm/benchmarks/benchmark_serving.py \
5  --backend vllm \
6  --model meta-llama/Meta-Llama-3-70B-Instruct \
7  --host localhost \
8  --port 8000 \
9  --num-prompts 200 \
10  --request-rate 10 \
11  --input-len 512 \
12  --output-len 256

The benchmark reports P50/P99 TTFT and TPOT, plus overall throughput. Run at multiple --request-rate values to find the saturation point — the rate at which TTFT starts climbing sharply. That's your serving capacity ceiling at current configuration.

A rough guide for g5.12xlarge with AWQ 4-bit Llama 3 70B:

TTFT under light load (1–5 req/s): 1–3 seconds (model loading is the bottleneck, not compute)
Throughput saturation: ~5–10 req/s at 512 input / 256 output
Beyond saturation: TTFT climbs steeply; add replicas rather than trying to push a single instance harder

Production readiness checklist

Before serving 70B in production:

Model downloaded and cached on a PVC — don't download from HuggingFace Hub on every pod start
readinessProbe with initialDelaySeconds ≥ 120 — model load takes 2–5 minutes from cache
/dev/shm emptyDir with sizeLimit: 16Gi — NCCL will OOM without it
--tensor-parallel-size divides num_kv_heads — for 70B: valid values are 1, 2, 4, 8
GPU resource limits = requests — Kubernetes won't schedule fractional GPU requests; mismatched values cause scheduling failures
NCCL_IB_DISABLE=1 on g5 — avoids slow startup probing for IB devices that don't exist
Karpenter NodePool consolidationPolicy: WhenEmpty — prevents premature GPU node reclamation
Load test with benchmark_serving.py before production traffic — find your saturation point first

For the infrastructure layer below vLLM — GPU Operator, MIG, device plugin: NVIDIA GPU Operator: Running GPU Workloads on Kubernetes

For KEDA-based autoscaling of inference deployments based on queue depth: KEDA: Event-Driven Autoscaling for Kubernetes

For Lambda Managed Instances as a serverless alternative for CPU-based inference (embedding models, small GGUF models): AWS Lambda Managed Instances: What Actually Changed and When to Use It

Running Llama 3 70B on Kubernetes: AWQ Quantization and Tensor Parallelism

The VRAM math

GPU sizing: what instance buys you what

Quantization formats: what to actually use

AWQ (Activation-Aware Weight Quantization)

GPTQ

INT8

FP8

Tensor parallelism: constraints and configuration

Kubernetes Deployment for 4-GPU inference

Karpenter NodePool for GPU nodes

NCCL tuning for multi-GPU inference

Multi-node with pipeline parallelism

Benchmarking: TTFT, TPOT, throughput

Production readiness checklist

References

Related Topics

Practice this

Read Next

NVIDIA GPU Operator: Running GPU Workloads on Kubernetes

Kubernetes GPU Workloads: Scheduling Machine Learning Jobs on EKS

How to Deploy an LLM on Kubernetes: GPU Nodes, Model Serving, and Autoscaling

The VRAM math

GPU sizing: what instance buys you what

Quantization formats: what to actually use

AWQ (Activation-Aware Weight Quantization)

GPTQ

INT8

FP8

Tensor parallelism: constraints and configuration

Kubernetes Deployment for 4-GPU inference

Karpenter NodePool for GPU nodes

NCCL tuning for multi-GPU inference

Multi-node with pipeline parallelism

Benchmarking: TTFT, TPOT, throughput

Production readiness checklist

Related posts

References

Related Topics

Practice this

Read Next

NVIDIA GPU Operator: Running GPU Workloads on Kubernetes

Kubernetes GPU Workloads: Scheduling Machine Learning Jobs on EKS

How to Deploy an LLM on Kubernetes: GPU Nodes, Model Serving, and Autoscaling