Data modelling techniques in MongoDB
We grew up learning about normal forms, 1st, 2nd and 3rd normal form. Storage was not cheap those days and hence the objective was to save storage. But as we grew up, the world around us also changed. Applications transformed to a distributed architecture from an N Tier architecture. Data volume increased exponentially and so is the data velocity. On top of it, the structure of the data is no longer fixed and predictable. It is polymorphic and does not follow a strict schema. This change saw the birth of NoSQL databases. They took advantage of cheaper storage and hence became bold enough to trade off non-redundancy of data with performance and scalability. They are a complete different breed who does not follow referential integrity. Joins are sacrificed to be able to scale horizontally. The long known ACID properties of databases are no longer relevant for such databases, they are what we call now are BASE.

In such a scenario, the data modelling approach also changed radically. We need to unlearn the RDBMS concepts and approach it more from an application perspective. The data model is driven by how the application looks at data. It is driven by the data ingestion and retrieval patterns of the application. And the approach will again slightly differ based on which NoSQL you are involved with. In this article, I will be talking about the data modelling techniques used in MongoDB

In MongoDB, data modelling exercise primarily comprises of modelling the collections and identifying the sharding and indexing strategy. The collections are either related using the embedding or the linking technique. Embedding is more appropriate when the child document does not have an independent existence and it is always retrieved with the parent document. Embedding is also a better choice when we are looking for a good read performance. On the other hand, linking will be a better choice when we are looking for a better write performance or when we know that the child document is going to grow exponentially. Remember, MongoDB has a restriction of 16 MB on the document size. However, life is not a bed of roses is an old adage. The real world scenarios are not that straight forward to choose one against the other. Sometimes we need to go with a hybrid approach as well. There are certain patterns that we can look for while designing the data model. Let me elaborate these patterns through some examples.

A loyalty application was built on MongoDB. The application stores the customer attribute and each transaction as an array of sub documents in a single document. The problem started when these transactions became more frequent and the child document started growing rapidly. The write performance started getting impacted. Remember wired tiger storage engine follows MVCC which means each update will write the entire document again. So what we recommended is to decouple the transactions from the parent document and move it to its own collection. We only kept the last 5 transactions in the parent document as the customers often wanted to know their last 5 transactions. This particular pattern is called subset pattern

Then there was another application which stores sensor data from devices for every minute in a MongoDB collection. The readings of each minute were being stored as a document. What happened was that the index data grew more than the actual data. There was also a need to aggregate the readings at a certain interval of time. The read performance got impacted heavily. In this case, we followed a pattern called bucket pattern where we stored a set of readings as a single document. Each set is a bucket and the bucket count was derived from the need to aggregate the data. If the aggregation is required every 10 minutes, then each bucket will store 10 minutes of data.

There are other patterns available like the attribute, computed and the approximation pattern. The application use case and the attributes will determine the selection of a certain pattern. So when it comes to NoSQL data modelling, it becomes extremely important to understand the application requirements and how the application will ingest or retrieve the data. If the data model is not designed correctly, it will have a significant negative impact when the application goes live and hits production.

Comments

Popular posts from this blog

Microservice Overview