Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Number of Required Servers

POC Requirements

Purpose

Option 1 (GPU embedding)

Option 2 (CPU embedding)

Gateway (Linux)

t3a.xlarge

t3a.2xlarge

GPU Embedder (Linux)

g4dn.xlarge

None. Embedding uses CPU of Gateway

LLM (Linux)

g6.xlarge

g6.xlarge

Dashboard

t3a.large

t3a.large

1x Linux Instance ( Gateway)

t3a.2xlarge - 100 GB SSD disk

1x Linux GPU Instance (LLM)

...

Customer may provide an SSL certificate to secure access to the dashboard website.

Embedding

Embedding is the process of converting data ingested into the system into a format that can be efficiently stored and retrieved to produce answers to questions.

Embedding can be performed using a CPU or GPU. It is a one time process for each piece of data during ingestion. If the bulk of the customer’s data is ingested during the product onboarding it may be optimal to use a GPU for embedding at the beginning and then have further embedding take place with a CPU.

If significant amounts of data need to be embedded on a continual basic, a GPU for embedding may be desirable in addition to the GPU required for the LLM.

Here are some benchmarks for comparison.

Document

GPU time

Nvidia T4 (16GB)

g4dn.xlarge

One task at a time

CPU time

AMD EPYC 7571 2.5Ghz

t3a.2xlarge

Two tasks at a time

JSON call transcript - Text document
Characters: 767,000

Words: 103,000

19 Seconds

587 Seconds

Microsoft Word document with graphics

Words: 13,000

Characters: 85,000

18 Seconds

805 Seconds

Large text only Microsoft Word document

Words: 630,000

Characters: 3,273,000

20 Mins, 26 Seconds

(1226 Seconds)

1,000 documents

Word/PDF

Av size 500 kb

Machine Types

EC2 Linux Instances

...

E.g. G6.xlarge

$0.803 / Hour (Llama 3) - 15.8 tokens/sec - Around 670$ (after Tax) for 24X7X30 days= full month.

1x required

EC2Windows Instances

...