
Generative AI (GenAI) applications consume data differently than traditional online transaction processing (OLTP) workloads. They process data in small chunks, typically on rows and tables and with relatively simple data structures. Large language model (LLM) training, on the other hand, requires rich document structures or binary objects that can be multiple kilobytes or even megabytes in size.
Retrieval-augmented generation (RAG), a data retrieval technique used widely for GenAI applications, leverages similarly complex data structures in real time to build prompts and generate responses. Also, more than any other type of modern workload, AI-driven workloads require the underlying database to be capable of processing queries against this type of data quickly and efficiently.
Traditional relational database management systems (RDBMS) database platforms were never designed for this type of workload. The storage engines they are built on assume consistency when it comes to row size, and they operate most efficiently when using narrow rows with typed attributes.
Best practices for working with RDBMS include minimizing the number of attributes on a table, using vertical partitioning and keeping rows small to avoid off-row storage. This also applies to using JSON attribute types in RDBMS, which allow developers to extend the rows of their relational tables with flexible schema objects that can be referenced in complex queries. This has allowed them to remain relevant as a new category of document- and object-based NoSQL alternatives has evolved to address this need.
In the open source RDBMS PostgreSQL, the standard guidance is to minimize the number of values JSON or JSONB attributes might contain or to use sparsely populated objects where most values are not included for any given row.
Differences in the data model aside, one of the biggest advantages of a true document database like MongoDB, over RDBMS platforms such as PostgreSQL, for GenAI workloads is under the hood. MongoDB is built on top of a storage engine that is designed to handle rich, complex documents of variable sizes, from a few bytes to multiple megabytes.
Modern OLTP workloads are more frequently causing RDBMS platforms designed for small, consistent rows to resort to off-row storage for large data objects. Off-row storage techniques like PostgreSQL’s TOAST may introduce performance bottlenecks as queries can no longer access the data they need from the rows on the table, causing secondary retrievals from a large object storage layer.
Additionally, MongoDB leverages a binary document structure known as BSON to store and transmit strongly typed attribute data, which is processed by the server and stored on disk in the same format, eliminating the need for server-side parsing.
RDBMS databases such as PostgreSQL transmit complex JSON as text. This requires server-side parsing on writes to at least validate JSON formatting or (in the case of JSONB deserializing) the attribute values so that they can be efficiently used in server-side queries that require computed values. On the way out, data must be serialized into JSON text before being transmitted to the client.
The overhead of this process on large documents is significant and measurable, as we will demonstrate in the following benchmark test.
Comparing MongoDB to PostgreSQL
Gen AI presents new challenges as the average size of objects the database needs to store and process increases dramatically. To evaluate the performance of PostgreSQL JSON/JSONB versus MongoDB BSON, we ran both servers on the same hardware platform:
Windows 10 Pro
32GB RAM
Intel Core i7-8700K CPU@3.70GHz
PostgreSQL v16.2
MongoDB v7.0.8
We measured single-threaded performance to get a better idea of protocol overhead between the client and server. As mentioned earlier, MongoDB uses BSON specifically to reduce this overhead by eliminating the need to serialize and deserialize data. Threading or running multiple clients masks this overhead without measuring overall resource utilization of the entire system.
For the write workloads, we inserted 10,000 documents into a single table or collection with the following configurations and payload sizes:
- Single attribute with 10-, 200-, 1000-, 2000- and 4000-byte payloads.
- Multi-attribute with 10 1-byte, 50 4-byte, 100 10-byte, 100 20-byte and 100 40-byte attributes.
The PostgreSQL server was configured using the following settings:
- All tables created as unlogged tables.
- Set shared_buffers=8GB.
- Set synchronous_commit=off.
- Set wal_buffers=48MB.
The MongoDB server was a default installation with no tuning applied.
The first test we ran was to insert 10,000 single-attribute documents while measuring total time taken to complete in milliseconds as payload increased from 10B to ~4KB. The figure below shows the raw data and chart of the results.
10K Item Insert (Single Attribute Payload) | |||
MongoDB | Postgres (JSONB) | Postgres (JSON) | |
n=1, s=10 | 773 | 399 | 331 |
n=1, s=200 | 789 | 2184 | 969 |
n=1, s=1000 | 750 | 8393 | 4071 |
n=1, s=2000 | 850 | 16387 | 7944 |
n=1, s=4000 | 829 | 31705 | 15767 |
The data is pretty clear when it comes to large documents: PostgreSQL performance is on par with MongoDB until the document size starts to grow past a few hundred bytes, and then it takes a hard turn for the worse once TOAST kicks in at around 2KB.
At the other end, it is no surprise to see that PostgreSQL is quite competitive at processing the kind of workload it was designed to handle, working with small chunks of data. As the size of the payload increases, however, processing overhead quickly becomes noticeable. PostgreSQL only validates formatting for JSON attributes whereas JSONB also parses the attribute values.
The additional overhead incurred by JSONB can be seen in this test. MongoDB BSON outperforms both JSON and JSONB by a wide margin with a very flat curve across all payloads.
Testing More Attributes and Bigger Payloads
The next test was to insert 10,000 multi-attribute documents while measuring the total time taken to complete the test in milliseconds as the payload spreads out across multiple attributes and increases in size from 10B to ~4KB. The figure below shows the raw data and chart of the results.
10K Item Insert (Multi-Attribute Payload) | |||
MongoDB | Postgres (JSONB) | Postgres (JSON) | |
n=10, s=10 | 531 | 415 | 322 |
n=10, s=200 | 569 | 2040 | 793 |
n=50, s=1000 | 641 | 9504 | 4419 |
n=100, s=2000 | 812 | 19213 | 9181 |
n=200, s=4000 | 1085 | 37278 | 17460 |
The primary takeaway is that JSONB incurs higher parsing overhead as the number of attributes increases. Both MongoDB BSON and PostgreSQL JSON maintain relatively flat performance across the single- and multi-attribute test, which makes sense as JSONB is now required to parse multiple attributes.
BSON has zero parsing overhead and JSON is simply validating formats, so neither has significantly more work to do when it comes to inserting complex documents. MongoDB BSON remains the clear winner, with JSONB falling further behind.
MongoDB vs. PostgreSQL Read Test
The final test was a read test, against the multi-attribute document sets. Indexes were created on an array attribute containing 10 integer values randomly selected from the 10,000 IDs of the documents inserted during the previous benchmark.
In PostgreSQL, this attribute was created on the row itself, not in the JSON/JSONB document. This was done to eliminate any possible overhead related to indexing document attributes, which PostgreSQL was not originally designed for. There would be no reason not to project indexed attributes out to the row in a real workload, so it represents a more equal comparison to configure the test this way.
The query test iterates across the 10,000 unique integer ID values and, for each one, selects all documents in the table or collection containing that ID in the indexed array. The result is that about 100,000 documents are retrieved during the query test.
In both cases, the entire result set is iterated, but no work is done on the actual data. This results in a benchmark that is skewed somewhat in the favor of PostgreSQL, as the JSON text is not actually parsed into a usable object by the client, whereas the BSON documents do not require parsing.
Despite this advantage, the results again speak loudly in favor of MongoDB.
Array Index Query Test | |||
MongoDB | Postgres (JSONB) | Postgres (JSON) | |
n=10, s=10 | 3587 | 19933 | 18749 |
n=10, s=200 | 3810 | 23619 | 23946 |
n=50, s=1000 | 4741 | 27760 | 21311 |
n=100, s=2000 | 6023 | 36701 | 23264 |
n=200, s=4000 | 8352 | 53808 | 27789 |
Overhead related to serialization is high for both JSON and JSONB. MongoDB clearly benefits from the fact that this is not required for BSON. Overhead related to deserialization of JSON text is not even measured for PostgreSQL as the text is just thrown away in the code.
Even with small documents, MongoDB wins by a wide margin. One interesting takeaway is that JSONB incurs a hit on the read as well, since there appears to be more cost associated with serialization of the typed attribute values.
In comparison, although JSONB is supposed to be faster for reads, it appears that if the data values stored in the document are not actually referenced in the query or processed server side, it is probably better to use JSON instead.
Summarizing the Results
Considering the well-known performance limitations of RDBMS when it comes to wide rows and large data attributes, it is no surprise that these tests indicate that a platform like PostgreSQL will struggle with the kind of rich, complex document data required by generative AI workloads.
Using a document database for this kind of document data delivers better performance than using a tool that simply wasn’t designed for these workloads.
The post PostgreSQL vs. MongoDB: Which Is Better for GenAI? appeared first on The New Stack.
Find out which data model — RDBMS or document — is better suited for large, complex, generative AI workloads.