In this article series I will describe some essential techniques for querying geospatial data in MongoDB, which can be useful if you want your app or API to provide access to information ordered based on distance from some specific location. For example:
The techniques will allow your service to scale and remain efficient as they enable constant time access to the data (regardless of the amount of data in your database) and minimise caching required on the client side.
In the following sections I will show you how to:
You should be able to follow this tutorial even if you haven’t used MongoDB before. On the other hand if the first 2 points sound familiar, you might want to skip straight to section 3.
We’ll be using Node.js and the official Javascript MongoDB driver. Code snippets will be in Coffeescript 2.
If you want to run the code examples locally, clone the accompanying repo, and follow the instructions in the README.md
.
For longitude, latitude pairs we need to use the GeoJSON Point object format.
The doc
above is what you'd insert into a mongo collection, it has a field called location
which value is a GeoJSON Point object. The name of the field is immaterial, we could have used any other valid key name instead of location
. We could also nest the GeoJSON deeper into the object structure, however putting type
and coordinates
at the top level of doc
wouldn't work. It's also possible to store multiple GeoJSON objects in one document, for example:
Note that:
Now that you know the basics, lets generate some data to work with.
We’ve created 6 documents, with locations starting at the equator, increasing the latitude in equal intervals, while keeping longitude fixed at 0. Here are the points plotted on a sphere:
See code used to generate this on JSFiddle.
Note that in the following code examples we will omit the boilerplate required to obtain a mongodb collection object and insert documents into it. In the accompanying repo, this boilerplate has been contained in the [with_collection](https://github.com/adrian-gierakowski/paging_geospatial_data_code/tree/f7014c4418c185a06dc3a83d2da0c72ce158e193/src/helpers/with_collection.coffee)
and [with_collection_with_points](https://github.com/adrian-gierakowski/paging_geospatial_data_code/tree/f7014c4418c185a06dc3a83d2da0c72ce158e193/src/helpers/with_collection_with_points)
helper functions.
To query documents based on their distance from a specific point we are going to use the [$near](https://docs.mongodb.com/manual/reference/operator/query/near/index.html)
query operator. But before we do this, we need to create a 2dsphere index on the field containing the GeoJSON Point objects.
runnable example on github
Setting the background option is very important when creating an index on a live database, since by default createIndex will block all other operations on the database while the index is being created ( which might take a while if the collection is big ).
Now the basic query, which returns all documents sorted by distance ( great-circle distance to be precise ) from point [ 0, 0 ]
, based on data at location
field in the documents:
runnable example on github
Unless your intention is to process all documents in the collection ( in which case you’ll probably call [.stream](http://mongodb.github.io/node-mongodb-native/3.1/api/Cursor.html#stream)
instead of [.toArray](http://mongodb.github.io/node-mongodb-native/3.1/api/Cursor.html#toArray)
), you'll want to limit, the number of returned documents. Here's how its done:
runnable example on github
On the illustration below, the white point marks the location used in the above query ( [ 0, 0 ]
) and the points circled in green represents the results of the query with limit
set to 2
.
see code used to generate this on JSFiddle
Note: the technique discussed in this section has been previously described by A. Jesse Jiryu Davis, who implemented the MondoDB feature which makes this technique possible. His article goes into details of why this method is performant, so its worth a read, however code examples are in python, therefore we’ll go through it step by step here for the benefit of the Node.js community.
Building on the query we defined in the previous section, the simplest way to implement paging would be to use limit
to control the page/batch size, and the [skip](https://docs.mongodb.com/manual/reference/method/cursor.skip/index.html)
method to set the desired page offset. For example:
runnable example on github
The above query will return the 2nd page of results when querying our test data, as illustrated below.
see code used to generate this on JSFiddle
However the performance of skip
decreases linearly as the offset increases ( as demonstrated in the article mentioned above ), since the MongoDB server needs to scan through all the query results from the beginning until the offset is reached.
A constant time alternative involves using the [$minDistance](https://docs.mongodb.com/manual/reference/operator/query/minDistance/index.html)
query operator, to exclude results which lay within given radius from the query point.
Assuming that we know the distance between the query point and the furthest document from a given page of results, we could query for the next page as follows:
But where do we get the distance from? We could try to calculate it using a hand rolled implementation of a formula for a distance between two points on a sphere. Or use something like [getDistance](https://github.com/manuelbieh/Geolib/blob/8273a52d86f7dfbd3b0e2aa2b7473ef5149c5374/src/geolib.js#L237)
from Geolib. However we'd have to make sure that our chosen implementation matches the implementation used by MongoDB. Fortunately we don't have to go through all that hassle since we can ask MongoDB to attach a dynamically calculated distance to each document in the query results. We'll just have to convert our find
query into an equivalent aggregation pipeline, using $geoNear stage with its distanceField
option.
Now we can use the calculated_distance
property of the last document from the query results ( since they are sorted by distance, in ascending order ), to fetch the next page. Lets use the above function to fetch first two pages.
runnable example on github
And here is a visualisation of the results of the second call to fetch_page
, with the yellow circle enclosing the area excluded from the query by using minDistance
.
see code used to generate this on JSFiddle
However, this is not exactly what we want: the last document of the previous page is included in the next page, since minDistance
only excludes documents, which distance is smaller then given value. To prevent this, we need to add the following query to our $geoNear
aggregation stage.
runnable example on github
The query uses the [$nin](https://docs.mongodb.com/manual/reference/operator/query/nin/)
operator to skip documents with given _id
s when gathering the results. Are we done? Nearly! Consider the following set of points.
Notice that the 2nd and 3rd points lay at exactly the same distance from point [ 0, 0 ]
. Now, if we set the query_point
to [ 0, 0 ]
and page_size
to 3
, what would the result of fetching the second page be? Assuming that the points returned by the first query were ordered as in the array above ( [ 0, -15 ]
last ), here is how it would look like.
see code used to generate this on JSFiddle
This is because neither minDistance
nor $nin
exclude the document with coordinates [ 0, 15 ]
. Therefore instead of just using the _id
of the last document, we need to collect _id
s of all documents which distance is equal to distance of last_doc
. Putting it all together:
Given current_page
here's how you would fetch the next one:
runnable example on github
Note that the logic in get_last_distance
and get_ids_to_exclude
will most likely be executed on the client, hence it was not included in the fetch_page
function, which expects precomputed exclude_ids
and last_distance
to be passed in. This is to minimize the amount of data sent over the wire. Alternatively the server could include a HTTP Link Header with all information necessary to fetch the next page, in which case all of the above code would be executed on the server.
Finally we need to handle a case where there are so many documents with the same distance, that the last document in a page ends up having the same distance as the last_distance
used to fetch that page. For example, when fetching 3rd page from the following set of points ( with query_point = [ 0, 0 ]
and page_size = 2
):
Using the above implementation we would get points [ [ 0, -15 ], [ 0, 30 ] ]
instead of desired [ [ 0, 30 ], [ 0, 45 ] ]
, since _id
of the last point from page 1 ( [ 0, -15 ]
), is not excluded even though its distance is the same as minDistance
. Therefore in such cases, we need to carry exclude_ids
from previous query to the next.
runnable example on github
It is worth noting that in the extreme case off all documents having the same distance, the size of exclude_ids
array, and hence the amount of data sent over the wire on each request, would increase linearly as we progress through the pages. On top of this, query performance itself would degrade in similar fashion ( and my bet is that it would be worse then the naive implementation relying solely on skip
). To mitigate this, instead of accumulating exclude_ids
, we could keep count of documents to skip ( and use it in conjunction with minDistance
). This would keep the size of the request constant, and the query performance within bounds of the simple skip
solution. However keeping track of _id
s to exclude has some advantages over using skip in cases where queried data is highly dynamic ( when changes are expected to happen in between fetches ), which we might discuss in future articles.
Finally, I’d like to draw your attention to the fact that the above method doesn’t facilitate jumping directly to a specific page ( without fetching all pages in between ). However this can be achieved by adding appropriate skip
( equal to page_size * pages_to_skip_count
) to the query. Using skip
will obviously incur a performance penalty proportional to the amount of documents skipped, but there is really no way around it ( unless instead of specifying amount of pages/documents to skip, your use case could do with with using minDistance
to skip an unknown number of documents ). Fortunately we only need to pay this price once per page jump, since once the desired page is fetched using skip
, the next one can be queried for based using only minDistance
and exclude_ids
.
We have demonstrated how to store and query geospatial data in mongodb, and discussed how to efficiently page through large amount of such data ( from closest location to furthest ). In Part 2 of this series, you will learn how to use a nifty trick to page through locations in reverse order.
Don’t forget clone the accompanying repo and play with code examples which can be easily ran from the command line.