Gateway Vector DB
The gateway stores the text of the content uploaded.
Assuming a five-char average per word and 500 words per page, it is estimated that 1 page of content should take 5kb.
The system stores the original text and a vector representation (chunk) of the content, approximately 2kb per page.
...
| Site A | Site B |
Word No. of Files Size Average word file size | 1214 240 MB 200Kb | 480 400 MB 800Kb |
Excel No. of Files Size
| 283 37 MB | 281 24 MB
|
PowerPoint No. of Files Size
| 472 2368 MB | 27 74 MB |
Pdf No. of Files Size Average file size | 1372 2262 MB 1600 Kb | 975 407 MB 400Kb |
Number of Documents chanks
Each Chunk is 500 tokens.
...
If the site has 150,000 chunks - it contains 150,000*0.75= 112.500 pages, which is around 10K documents.
Memory usage for the Postgres DB
it is recommended to run with at least 8G for the postgress. Even when empty, it uses 4G.
For 1M chunks, it takes around 8G ram (around 350,000 documents according to AGAT)
Gateway SQL server BD sizing
It is recommended to that the SQL Server run with at least 4GB.
Cost estimations for using OpenAI services
Embeddings
Assuming a typical text page is 500 words, calculate the cost of embedding a typical page; let's break it down:
Estimate the number of tokens per page:
You mentioned that 100 tokens are around 75 words.
A typical page is around 500 to 600 words, translating to approximately 700 tokens.
Determine the cost per token:
The cost for embeddings (Text Embedding 3 small) is $0.02 for 1 million tokens.
Calculate the cost for
...
one page:
Since 700 tokens are on a typical page, we can calculate the cost as follows:
Cost per page=700 tokens1,000,000 tokens×0.02 USD\text{Cost per page} = \frac{700 \text{ tokens}}{1,000,000 \text{ tokens}} \times 0.02 \text{ USD} Cost per page=1,000,000 tokens700 tokens×0.02 USD Cost per page=0.000014 USD\text{Cost per page} = 0.000014 \text{ USD}Cost per page=0.000014 USD
So, the cost for embedding a typical page (approximately 700 tokens) would be per page.
Or in other words - 1$ can produce 70K pages