The gateway stores the text of the content uploaded.
Assuming a five-char average per word and 500 words per page, it is estimated that 1 page of content should take 5kb.
The system stores the original text and a vector representation (chunk) of the content, approximately 2kb per page.
A total of 7kb per page, or in other words, 1M can contain around 140 pages of content.
Assuming According to AGAT experience, based on AGAT real document of 10K of real AGAT documents - it was 35K chunks.
So a document, on average, is between 3.5 Chunks (according to AGAT) and 10 chunks (According to ChatGPT).
According to ChatGPT the average document has 10 pages, you can ingest 14 documents in 1M.
If you have 100,000 documents - you would need 7200M=7.2G.so you would need 1M for 14 documents.
Based on ChatGPT | Based on the AGAT experiment | |
---|---|---|
Number of chunks per document | 10 | 3.5 |
The average size of one document | 70K | |
Size on disk for 100,000 Documents - | 7.2 | 2.8 |
Depending on how many pages of content - you can calculate the estimated size of DB needed for the product.
Here is a sample sizing from our company:
...
If the site has 150,000 chunks - it contains 150,000*0.75= 112.500 pages, which is around 10K documents.
Memory usage for the Postgres DB
it is recommended to run with at least 8G for the postgress. Even when empty, it uses 4G.
For 1M chunks, it takes around 8G ram (around 350,000 documents according to AGAT)