Speeding Up Queries: Re-Imagining Your Data

NoSQL isn't just about new tools, it's also about new [and old] modeling and data access techniques. Since traditional modeling approaches are still relevant, it would be foolish to forget everything that you know about data modeling. Equally foolish would be to fail to learn, experiment and consider different techniques; especially in the face of poor performing queries.

TL;DR

Consider doing more work during inserts and updates (or background tasks), often in the form of denormalization, so that reads are as straightforward as possible.

For most applications, reads far outnumber writes. Yet most code is written so that reads are often complex while writes are dead simple. When performance is an issue, going from complex queries (involving a lot of fields, aggregation or map reduce) to simple queries can have devastatingly awesome consequences.

The Sample Scenario

We want to build a leaderboard system for groups of friends. When a user goes to his or her score page, he doesn't only see his best-score per level, but also the best score out of his friends.

lvl	your score	best score	best by
1	500	900	Leto
2	200	350	Jessica
3	1423	1445	Paul

A First Attempt

Our first approach might be to store our data like so:

db.users.find()
  {_id: 'Gurney', friends: ['Leto', 'Paul', 'Duncan', 'Jessica', 'Glossu']}
  ...
db.scores.find()
 {_id: ObjectId("..."), level: 1, score: 500, user: 'Gurney'},
 {_id: ObjectId("..."), level: 2, score: 200, user: 'Gurney'},
 {_id: ObjectId("..."), level: 1, score: 900, user: 'Leto'},
 {_id: ObjectId("..."), level: 2, score: 150, user: 'Leto'},
 {_id: ObjectId("..."), level: 1, score: 800, user: 'Jessica'},
 {_id: ObjectId("..."), level: 2, score: 350, user: 'Jessica'},
 ...

This approach fits with how we've been taught to build systems. You have well defined and distinct entities which map to real world objects.

With this model, saving scores is simple. We just need to pull our results...It turns out that, at scale, there isn't a great way to do that. The two most likely solutions, I see, are to either do some type of aggregation (GROUP BY in SQL, Map Reduce in MongoDB) or hit the database multiple times (linear to the number of levels we are showing). Neither of which work particularly well with a lot of levels, a lot of friends, or with a sharded architecture.

// The non-critical-path code is optimized...yay

// do this as an upsert, 500 is the new score
db.scores.update({
  user: 'Paul',
  level: 4,
  score: {$lt: 500}},
  {$set: {score: 500}
}, true)

// The frequently accessed code is slow and won't scale
var m = function() {
  emit(this.level, {user: this.user, score: this.score})
}

var r = function(name, scores) {
  var max = scores[0];
  for(var i = 1; i &lt; scores.length; ++i) {
    if (scores[i].score > max.score) { max = scores[i]; }
  }
  return max;
}
var friendsList = db.users.findOne({_id: 'Gurney'}, {friends:1}).friends;
db.scores.mapReduce(m, r, {query: {user: {$in: friendsList}}, out: {inline: 1}})

A Better Approach

Rather than looking at your data and wondering how to query it, consider thinking about an ideal query and figuring out how to model it. Our idea query would be:

db.scores.find({user: 'Gurney'})

This query is simple, it requires a single index on user and it can scale (we'd shard on user). In order to work, a score document would need to look like:

db.scores.find()
{_id: ObjectId("..."), level: 1, score: 500,
  user: 'Gurney', best: {user: Leto, score: 900}
},
...

In order to reach this state, whenever a user scores a new personal best, we potentially have to update anyone who has that user as a friend:

// same as before
db.scores.update({user: 'Paul', level: 4, score: {$lt: 500}},
  {$set: {score: 500}}, true)

// n returns the number of updated scores
var updated = db.runCommand( "getlasterror" ).n

// if this was a new personal best for Paul
if (updated) {
  //find everyone who has Paul as a friend
  var cursor = db.users.find({friends: 'Paul'}, {_id:1})
  var friends = cursor.map(function(doc) { return doc._id; })

  // update their best entry if Paul's new score is better than their
  // existing best
  db.scores.update({user: {$in: friends}, 'best.score': {$lt: 500}},
                   {$set: {'best.user': 'Paul', 'best.score': 500}},
                   false, true);
}

Saving a score has gotten more complicated. However, writes often aren't the bottleneck. In this case, the complexity only happens when the score is a personal best. The rest of the time, it's the same code. We could further optimize the code by having Paul know who has him as a friend, but that isn't the type of denormalization we're focusing on today.

To The Purists

We could argue with purists that, in the end, pragmatism must win. A perfectly designed system that can't handle the load isn't perfect. I go a step further though. To me, the denormalized view of a score (the one that not only includes your score, but also the best score) is the proper one. Yes, we've duplicated data, but by doing so, our design reflects what a score, in our system, is.

Conclusion

If a read query is slow, and you've done what you can with indexes, then the solution may be to push the workload to inserts, updates or a background task. If you have difficulty visualizing it, think of it first in terms of an ideal query, then the structure which would satisfy the query, and finally how to maintain that structure.

This approach can be abused. But more often than not, people are not, whatsoever, thinking about the problem from this perspective.

When a query is slow, don't only ask how can I make this faster? Ask how can I make this simpler? If another part must pay a price, at least that's an option to evaluate.