Deployment system sizing

Gateway Vector DB

The gateway stores the text of the content uploaded.
Assuming a five-char average per word and 500 words per page, it is estimated that 1 page of content should take 5kb.
The system stores the original text and a vector representation (chunk) of the content, approximately 2kb per page.

A total of 7kb per page, or in other words, 1M can contain around 140 pages of content.

According to AGAT experience, based on AGAT real document of 10K of real AGAT documents - it was 35K chunks.

So a document, on average, is between 3.5 Chunks (according to AGAT) and 10 chunks (According to ChatGPT).

According to ChatGPT the average document has 10 pages, so you would need 1M for 14 documents.

	Based on ChatGPT	Based on the AGAT experiment

	Based on ChatGPT	Based on the AGAT experiment
Number of chunks per document	10	3.5
The average size of one document	70K
Size on disk for 100,000 Documents -	7.2 GB	2.8 GB

Depending on how many pages of content - you can calculate the estimated size of DB needed for the product.

Here is a sample sizing from our company:

Site A

Site B

Word No. of Files

Size

Average word file size

1214

240 MB

200Kb

480

400 MB

800Kb

Excel No. of Files

Size

283

37 MB

281

24 MB

PowerPoint No. of Files

Size

472

2368 MB

27

74 MB

Pdf No. of Files

Size

Average file size

1372

2262 MB

1600 Kb

975

407 MB

400Kb

Number of Documents chunks

Each Chunk is 500 tokens.

100 tokens is around 75 words.

>> Each chunk is around 375 words.

Each chunk contains 375/500= 0.75 pages.

If the site has 150,000 chunks - it contains 150,000*0.75= 112.500 pages, which is around 10K documents.

Memory usage for the Postgres DB

it is recommended to run with at least 8G for the postgress. Even when empty, it uses 4G.

For 1M chunks, it takes around 8G ram (around 350,000 documents according to AGAT)

Gateway SQL server BD sizing

It is recommended that the SQL Server run with at least 4GB.

Cost estimations for using OpenAI services

Embeddings

Assuming a typical text page is 500 words, calculate the cost of embedding a typical page; let's break it down:

Estimate the number of tokens per page:
- Assume that 100 tokens are around 75 words.
- A typical page is around 500 to 600 words, translating to approximately 700 tokens.
Determine the cost per token:
- The cost for embeddings (Text Embedding 3 small) is $0.02 for 1 million tokens.
Calculate the cost for one page:
- Since 700 tokens are on a typical page, we can calculate the cost as follows:
Cost per page=700 tokens1,000,000 tokens×0.02 USD\text{Cost per page} = \frac{700 \text{ tokens}}{1,000,000 \text{ tokens}} \times 0.02 \text{ USD} Cost per page=1,000,000 tokens700 tokens×0.02 USD Cost per page=0.000014 USD\text{Cost per page} = 0.000014 \text{ USD}Cost per page=0.000014 USD

So, the cost for embedding a typical page (approximately 700 tokens) would be per page.

Or in other words - $1 can produce 70K pages

Questions

The average price per question is $0.0059, meaning $1 can produce 170 questions.

1 million chunks ( take 8GB of RAM.

BusinessGPT AI Governance & Security

Deployment system sizing

Gateway Vector DB

Number of Documents chunks

Memory usage for the Postgres DB

Gateway SQL server BD sizing

Cost estimations for using OpenAI services

Embeddings

Questions

Related content