...
--model meta-llama/Meta-Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 4--max-model-len 8000
Other 70B Options
8x 4090 runs full context window at 90% vram. 20k window at 87%.
4x 4090 20k window: works
2x 4090 20k window: doesn’t work
2x 4090 8k window: doesn’t work
4x 4090 full window: doesn’t work
1x H100 NVL: 20k window: