Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

8B

Benchmark Question:
Prompt: 7000 tokens
Completion: 214 tokens

GPU

vRAM

Answer Speed (seconds)

4 parallel questions

(seconds)

Price

AWS*

Runpod*

Limits

Tesla L4

24

16

37

G6 $0.80

$0.43

Up to 20k tokens

RTX 4090

24

5

8

$2,000

NA

$0.69

Up to 20k tokens

Tesla L40s

48

6.5

9

$9,800

G6e $1.861

$1.03

Full 128k Token window

H100 SXM

80

2.4

3.4

$31,000

NA

$2.99

Full 128k Token window

* Discounts available for different usage patterns

70B

Option A

Model: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

GPU: 4 x RTX 4090 96GB VRAM (84 GB used)

72GB disk space used by model

Tested on Runpod

--model meta-llama/Meta-Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 4--max-model-len 8000

Other 70B Options

8x 4090 runs full context window at 90% vram. 20k window at 87%. (64GB RAM)

4x 4090 20k window: works (32GB RAM)

2x 4090 20k window: doesn’t work

2x 4090 8k window: doesn’t work

4x 4090 full window: doesn’t work

1x H100 NVL: 20k window: