8B

Benchmark Question:
Prompt: 7000 tokens
Completion: 214 tokens

GPU	vRAM	Answer Speed (seconds)	4 parallel questions (seconds)	Price	AWS*	Runpod*	Limits

GPU	vRAM	Answer Speed (seconds)	4 parallel questions (seconds)	Price	AWS*	Runpod*	Limits
Tesla L4	24	16	37		G6 $0.80	$0.43	Up to 20k tokens
RTX 4090	24	5	8	$2,000	NA	$0.69	Up to 20k tokens
Tesla L40s	48	6.5	9	$9,800	G6e $1.861	$1.03	Full 128k Token window
H100 SXM	80	2.4	3.4	$31,000	NA	$2.99	Full 128k Token window

* Discounts available for different usage patterns

70B

Option A

Model: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

GPU: 4 x RTX 4090 96GB VRAM (84 GB used)

72GB disk space used by model

Tested on Runpod

--model meta-llama/Meta-Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 4--max-model-len 8000

8x 4090 runs full context window at 90% vram. 20k window at 87%. (64GB RAM)

4x 4090 20k window: works (32GB RAM)

2x 4090 20k window: doesn’t work

2x 4090 8k window: doesn’t work

4x 4090 full window: doesn’t work

1x H100 NVL: 20k window:

Llama 3.3 70B

4x A6000 ($3.04/hr runpod) - Tested with 20k window, maybe allows more.

$16.288 + VAT on AWS g5.48xlarge

g6e.12xlarge

$10.49264

+ VAT (not stockholm)