8B
Benchmark Question:
Prompt: 7000 tokens
Completion: 214 tokens
GPU | vRAM | Answer Speed (seconds) | 4 parallel questions (seconds) | Price | AWS* | Runpod* | Limits |
---|---|---|---|---|---|---|---|
Tesla L4 | 24 | 16 | 37 | G6 $0.80 | $0.43 | Up to 20k tokens | |
RTX 4090 | 24 | 5 | 8 | $2,000 | NA | $0.69 | Up to 20k tokens |
Tesla L40s | 48 | 6.5 | 9 | $9,800 | G6e $1.861 | $1.03 | Full 128k Token window |
H100 SXM | 80 | 2.4 | 3.4 | $31,000 | NA | $2.99 | Full 128k Token window |
* Discounts available for different usage patterns
70B
Option A
Model: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
GPU: 4 x RTX 4090 96GB VRAM (84 GB used)
72GB disk space used by model
Tested on Runpod
--model meta-llama/Meta-Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 4--max-model-len 8000
Other 70B Options
8x 4090 runs full context window at 90% vram. 20k window at 87%.
4x 4090 20k window: works
2x 4090 20k window: doesn’t work
2x 4090 8k window: doesn’t work
4x 4090 full window: doesn’t work
1x H100 NVL: 20k window: