Number of Required Servers
POC Requirements
Purpose | Option 1 (GPU embedding) | Option 2 (CPU embedding) |
---|---|---|
Gateway (Linux) | t3a.xlarge | t3a.2xlarge |
GPU Embedder (Linux) | g4dn.xlarge | None. Embedding uses CPU of Gateway |
LLM (Linux) | g6.xlarge | g6.xlarge |
Dashboard | t3a.large | t3a.large |
1x Linux Instance ( Gateway)
t3a.2xlarge - 100 GB SSD disk
1x Linux GPU Instance (LLM)
...
Customer may provide an SSL certificate to secure access to the dashboard website.
Embedding
Embedding is the process of converting data ingested into the system into a format that can be efficiently stored and retrieved to produce answers to questions.
Embedding can be performed using a CPU or GPU. It is a one time process for each piece of data during ingestion. If the bulk of the customer’s data is ingested during the product onboarding it may be optimal to use a GPU for embedding at the beginning and then have further embedding take place with a CPU.
If significant amounts of data need to be embedded on a continual basic, a GPU for embedding may be desirable in addition to the GPU required for the LLM.
Here are some benchmarks for comparison.
Document | GPU time Nvidia T4 (16GB) g4dn.xlarge One task at a time | CPU time AMD EPYC 7571 2.5Ghz t3a.2xlarge Two tasks at a time |
---|---|---|
JSON call transcript - Text document Words: 103,000 | 19 Seconds | 587 Seconds |
Microsoft Word document with graphics Words: 13,000 Characters: 85,000 | 18 Seconds | 805 Seconds |
Large text only Microsoft Word document Words: 630,000 Characters: 3,273,000 | 20 Mins, 26 Seconds (1226 Seconds) | |
1,000 documents Word/PDF Av size 500 kb |
Machine Types
EC2 Linux Instances
...
E.g. G6.xlarge
$0.803 / Hour (Llama 3) - 15.8 tokens/sec - Around 670$ (after Tax) for 24X7X30 days= full month.
1x required
EC2Windows Instances
...