Data modelling techniques in MongoDB
We grew up
learning about normal forms, 1st, 2nd and 3rd normal
form. Storage was not cheap those days and hence the objective was to save
storage. But as we grew up, the world around us also changed. Applications transformed
to a distributed architecture from an N Tier architecture. Data volume
increased exponentially and so is the data velocity. On top of it, the
structure of the data is no longer fixed and predictable. It is polymorphic and
does not follow a strict schema. This change saw the birth of NoSQL databases.
They took advantage of cheaper storage and hence became bold enough to trade
off non-redundancy of data with performance and scalability. They are a complete
different breed who does not follow referential integrity. Joins are sacrificed
to be able to scale horizontally. The long known ACID properties of
databases are no longer relevant for such databases, they are what we call now
are BASE.
In such a scenario,
the data modelling approach also changed radically. We need to unlearn the
RDBMS concepts and approach it more from an application perspective. The data
model is driven by how the application looks at data. It is driven by the data
ingestion and retrieval patterns of the application. And the approach will
again slightly differ based on which NoSQL you are involved with. In this
article, I will be talking about the data modelling techniques used in MongoDB
In MongoDB,
data modelling exercise primarily comprises of modelling the collections and
identifying the sharding and indexing strategy. The collections are either related
using the embedding or the linking technique. Embedding is more appropriate
when the child document does not have an independent existence and it is always
retrieved with the parent document. Embedding is also a better choice when we
are looking for a good read performance. On the other hand, linking will be a
better choice when we are looking for a better write performance or when we know
that the child document is going to grow exponentially. Remember, MongoDB has a
restriction of 16 MB on the document size. However, life is not a bed of roses
is an old adage. The real world scenarios are not that straight forward to
choose one against the other. Sometimes we need to go with a hybrid approach as
well. There are certain patterns that we can look for while designing the data
model. Let me elaborate these patterns through some examples.
A loyalty
application was built on MongoDB. The application stores the customer attribute
and each transaction as an array of sub documents in a single document. The
problem started when these transactions became more frequent and the child
document started growing rapidly. The write performance started getting
impacted. Remember wired tiger storage engine follows MVCC which means each
update will write the entire document again. So what we recommended is to decouple
the transactions from the parent document and move it to its own collection. We
only kept the last 5 transactions in the parent document as the customers often
wanted to know their last 5 transactions. This particular pattern is called subset
pattern
Then there
was another application which stores sensor data from devices for every minute
in a MongoDB collection. The readings of each minute were being stored as a
document. What happened was that the index data grew more than the actual data.
There was also a need to aggregate the readings at a certain interval of time.
The read performance got impacted heavily. In this case, we followed a pattern
called bucket pattern where we stored a set of readings as a single
document. Each set is a bucket and the bucket count was derived from the need
to aggregate the data. If the aggregation is required every 10 minutes, then
each bucket will store 10 minutes of data.
There are
other patterns available like the attribute, computed and the approximation
pattern. The application use case and the attributes will determine the
selection of a certain pattern. So when it comes to NoSQL data modelling, it
becomes extremely important to understand the application requirements and how
the application will ingest or retrieve the data. If the data model is not
designed correctly, it will have a significant negative impact when the
application goes live and hits production.
Comments
Post a Comment