There is an ideal way to design databases in relational databases — 3rd normal form. In MongoDB
, it’s important to keep data in way that’s conducive to the application using the data. You think about
In contrast, in relational DBMS — the data is organized in such a way that is agnostic to the application.
MongoDB
supports rich documents. We can store an array of items, a value for a certain key can be an entire other document. This is going to allow us to pre-join/embed data for fast access. And that’s important because **MongoDB**
doesn’t support joins directly inside the kernel. Instead if we need to join, we need to join in the application itself. The reason being joins are very hard to scale. This forces us to think ahead of time about what data you want to use together with other data. We might wish to embed the data directly within the document. There’re no constraints. In case of MongoDB
, it’s as important as we think because of embedding. MongoDB
considers atomicity in a way. Also it doesn’t support transactions, however atomic operations are supported within one document. The data needs to be designed in such a way that it supports atomic operations.
There’s no declared schema but there’s a good chance that an application is going to have a schema. By having a schema, we mean that every single document in a particular collection is probably going to have a pretty similar structure. There might be small changes to that structure depending on the different versions of your application. Even though it’s not declared ahead of time, it’s important to think about the data structure so that the data schema itself supports all the different features of your application
Let’s look at the below denormalized table for a blog posts project. It’s not the 3rd normal form, it’s broken. Let’s say there are multiple posts with same author, we may update a few rows and leave others un-updated. Leaving the table data inconsistent.
Hence this violates normalization because it violates a common way to describing normalized tables in 3rd normal form, which is that every non-key attribute in the table must provide a fact about the key, the whole key and nothing but the key. And that’s of a play on words for what you say in a US courtroom, telling the truth, the whole truth and nothing but the truth. The key in this case, is the Post Id
and there is a non-key attribute Author Email
which does not follow that. Because it does, in fact tell something about the author. And so it violates that 3rd normal form.
The above table can be represented in MongoDB
as:
{id: 'some id',title: 'some title',body: 'some content here',author: {name: 'author name',email: 'author email id'}}
MongoDB
, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB
which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that’s not the default approach. The default is to avoid it.MongoDB
is flexible enough because it allows addition of keys without re-designing all the documentsMongoDB
. And one of the ideas behind the MongoDB
is to tune up your database to the applications that we’re trying to write and the problem we’re trying to solve.One of the great things about relational database is that it is really good at keeping the data consistent within the database. One of the ways it does that is by using foreign keys. A foreign key constraint is that let’s say there’s a table with some column which will have a foreign key column with values from another table’s column. In MongoDB
, there’s no guarantee that foreign keys will be preserved. It’s upto the programmer to make sure that the data is consistent in that manner. This maybe possible in future versions of MongoDB
but today, there’s no such option. The alternative for foreign key constraints is embedding data.
Transactions support ACID properties but although there are no transactions in MongoDB
, we do have atomic operations. Well, atomic operations means that when you work on a single document that that work will be completed before anyone else sees the document. They’ll see all the changes we made or none of them. And using atomic operations, you can often accomplish the same thing we would have accomplished using transactions in a relational database. And the reason is that, in a relational database, we need to make changes across multiple tables. Usually tables that need to be joined and so we want to do that all at once. And to do it, since there are multiple tables, we’ll have to begin a transaction and do all those updates and then end the transaction. But with MongoDB
, we’re going to embed the data, since we’re going to pre-join it in documents and they’re these rich documents that have hierarchy. We can often accomplish the same thing. For instance, in the blog example, if we wanted to make sure that we updated a blog post atomically, we can do that because we can update the entire blog post at once. Where as if it were a bunch of relational tables, we’d probably have to open a transaction so that we can update the post collection and comments collection.
So what are our approaches that we can take in MongoDB
to overcome a lack of transactions?
Update
, findAndModify
, $addToSet
(within an update) & $push
(within an update) operations operate atomically within a single document.
1 to 1 relations are relations where each item corresponds to exactly one other item. e.g.:
//employee{_id : '25',name: 'john doe',resume: 30}
//resume{_id : '30',jobs: [....],education: [...],employee: 25}
We can model the employee-resume relation by having a collection of employees and a collection of resumes and having the employee point to the resume through linking, where we have an ID
that corresponds to an ID
in th resume collection. Or if we prefer, we can link in another direction, where we have an employee key inside the resume collection, and it may point to the employee itself. Or if we want, we can embed. So we could take this entire resume document and we could embed it right inside the employee collection or vice versa.
This embedding depends upon how the data is being accessed by the application and how frequently the data is being accessed. We need to consider:
MongoDB
, there’re atomic operations on individual documents. So if we knew that we couldn’t withstand any inconsistency and that we wanted to be able to update the entire employee plus the resume all the time, we may decide to put them into the same document and embed them one way or the other so that we can update it all at once.In this relationship, there is many, many entities or many entities that map to the one entity. e.g.:
Let’s assume the below data model:
//city{_id: 1,name: 'NYC',area: 30,people: [{_id: 1,name: 'name',gender: 'gender'.....},....8 million people data inside this array....]}
This won’t work because that’s going to be REALLY HUGE. Let’s try to flip the head.
//people{_id: 1,name: 'John Doe',gender: gender,city: {_id: 1,name: 'NYC',area: '30'.....}}
Now the problem with this design is that if there are obviously multiple people living in NYC, so we’ve done a lot of duplication for city data.
Probably, the best way to model this data is to use true linking.
//people{_id: 1,name: 'John Doe',gender: gender,city: 'NYC'}
//city{_id: 'NYC',...}
In this case, people
collection can be linked to the city
collection. Knowing we don’t have foreign key constraints, we’ve to be consistent about it. So, this is a one to many relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array.
So, if it’s truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough.
e.g.:
The books to authors is a few to few relationship, so we can have either an array of books or authors inside another’s document. Same goes for students to teachers. We could also embed at the risk of duplication. However this will required that each student has a teacher in the system before insertion and vice versa. The application logic may always not allow it. In other words, the parent object must exist for the child object to exist.
Multikey indexes is the feature because of which linking and embedding works so well. Assume that we’ve 2 schemas for student and teachers.
//students{_id: 1,name: 'John Doe',teachers: [1,7,10,23]}
//teachers{_id: '10',name:'Tony Stark'}
Now there are 2 obvious queries.
set
operators. And order for that we need to be efficient, we need use Multikey indexes.To create an index on teachers
column on students
collection, use db.students.ensureIndex({ teachers : 1 })
.
Now to find all the students whom have had a particular teacher? use the query db.students.find( { 'teachers': {$all: [0,1]}} )
. Now, if we append .explain()
to the above query - it shows the internal working by showing where the keys were applied etc.
The main reason for embedding documents in MongoDB
is performance. The main performance benefit comes from improved read performance. Now, why do we get read performance. The reason is the nature of the way computer systems are built, which is they often have spinning disks and those spinning disks have a very high latency, which means they take a very long time (upto 1ms) to get to the first byte. But, when the first byte is access, each additional byte comes very quickly. So they tend to be pretty high bandwidth. So the idea is if we can co-locate the data to be used together in the same document, embed it and then we’re going to spin the disk, find the sector where we need this information and then we’re gonna start reading it. And we’re going to get all the information we need in one go. Also it means if we have 2 pieces of data that would normally be in 2 collections or in several relational database tables. Instead, they’re in one document, that we avoid round trips to the database.
One classic problem from the world of schema design is how do you represent a tree inside the database? Let’s look at the example problem for representing the e-commerce categories in a e-commerce site, such as Amazon. Where we have home, outdoors, winter, snow. And the idea is that we’ve the products
. Also we’ve a category
, where we can lookup category 7 and see the categoryName
, some of the properties for the category
.
//products{category: 7,productName: 'ABC'}
//category{_id: '7',categoryName:'outdoors',parent: 6}
One way to do it is to keep a parent
id - this might be something we can do in a simple relational database. But this doesn’t make it easy to find all the parent
s to this category
. We’ve to iteratively query, find the parent of this and the parent of that until we get all the way to the top.
So an alternative way to do it in MongoDB
is to be able to list ancestors or children. So, let’s think about that and how that would work. So, we could decide to list all the children of this category
:
//category{_id: '7',categoryName:'outdoors',children: [3,6,7,9]}
That’s also fairly limiting, if we’re looking to be able to look and find the entire sub-tree that is above a certain piece of tree. Instead, what works pretty well and again, enable by the ability to put arrays inside MongoDB
is to list the ancestors from the top in order.
//category{_id: '7',categoryName:'outdoors',ancestors: [3,7,5,8,9]}
Again the ability to structure and express rich data is one of the things that makes MongoDB
so interesting. This would be very difficult to do in a relational database. Now, in terms of how you represent the data for something like a product
category
hierarchy, again it all depends upon the access patterns. It depends on how we believe we’re going to need to show the data, access the data for the user. And then based on that, we know how to model it.
One of the reasons for normalization in relational database is to avoid modification anomalies that come with the duplication of data. And when we look at MongoDB
and how it’s structured, allowing these rich documents, it’s easy to assume that what we’re doing is we’re denormalizing the data. And to certain extent, that’s true. As long as we don’t duplicate data, we don’t open ourselves for modification anomalies.
Generally it’s good to embed that data in case of one to one relationships. In case of one to many relationships, embedding can also work well without duplication of data as long as we embed from many to the one. Now, if we go from the one to the many, then linking would avoid the duplication of data. Now, if we want to embed something, even if it causes duplication of data for performance reasons to match the access patterns of the applications. That could make sense, especially if the data is rarely changing or being updated. But we can avoid it often enough, even in this relationship, if we go from the many to one. In many to many relation, which we looked at with students
and teachers
and authors
and books
, there if you want to avoid the modification anomalies that come with denormalization, all we need to do is link through the arrays of object ids
in the documents.
These are all guidelines. For a real world application, we may need to embed data for performance reasons, to match the data access patterns — — it might be needed to embed the data.
Originally published at xameeramir.github.io.