Boosting Performance and Productivity

Not too long ago, I was working on an ASP.NET project that had a traditional SQL Server 2008 R2 database tier and was suffering from performance problems.

Using DynaTrace and SQL Server Profiler, it was easy enough to see the problem: The amount of data wasn’t overwhelming — a few hundred thousand rows of data across 15-20 tables, all laid out in a friendly-looking and well understood normalized model. However, because the client wanted this data presented to end users in near-realtime in personalized lists, queries were constantly running joins across multiple tables.

A lot of time was spent optimizing stored procedures and indexes, carefully examining execution plans and adjusting every parameter. Over the course of several months, some central stored procedures were re-written multiple times by SQL Server experts both internal and external, yet no-one seemed to be able to get below a threshold of around 250-400 CPU-bound milliseconds per hit. That’s bad because at best one processor core would only be able to serve a few such requests per second, which set a very low limit on the number of possible concurrent users.

It was clear that something had to be done differently.

The Problem With Normalization

Due to confidentiality issues, I can’t show the actual model, so instead I have constructed this example which illustrates the complexity of the problem nicely.

Imagine you have a blog with a hundred thousand posts, with hundreds more being added each day. Each post typically has 20 comments and is tagged with around 10 tags each. The tags are hierarchical in nature, so for instance, a post tagged with “Ford” would match a query for posts tagged with “Car”. There are about 2.000 such tags and the hierarchy is 5 levels deep. Each post also has a category, of which there are 50 in total. Now imagine users each have a list of tags they are interested in (from anywhere in the hierarchy) and only want to see posts with those tags. And the blog is in 10 languages, but not all posts are translated into all languages and not all tags are available in all languages.

Not the easiest scenario in the world, but certainly not impossible to grasp either. If you are used to data modelling, it’s not hard to get a mental picture of what this would look like. You can imagine posts having a foreign-key relationship to comments, categories, users and tags, and tags could use the SQL Server HierarchyId feature somehow. And comments could have a FK-relationship with users. And so on.

The problem with all of that is, it makes it easy to write and maintain data but hard to read it back. And since most websites — even the interaction-heavy ones — are subject to the 90-9-1 principle, they need to be able to read data fast. That’s why caching is usually so important. But in this particular scenario, all queries are personalized and the data is too volatile for the results to be cached in the normal sense.

The NoSQL Alternative

The only proof-of-concept solutions that really worked were all based on denormalization — that is, at write-time data can be baked into the structures where they are needed at read time. This causes a lot of redundancy and makes write operations slower and harder, but the speed gains when reading can be well worth the effort.

Once you start to think in terms of performance optimization with denormalization, you realize that a lot of things you normally factor out into separate tables as a knee-jerk reaction, maybe really belong together. Like comments on a blog post, if there are typically around 20. These comments are only really useful in the context of the post they are written on. If only there was a way to embed them into the post itself, yet make it possible to search across them efficiently. You might think that concepts such as “materialized” or “indexed” views would solve this problem, but in reality the prerequisites and caveats attached to using these technologies makes them unfeasible in many scenarios.

This problem is one that distributed, document-oriented NoSQL databases like Couchbase and MongoDB solve. They offer a high degree of flexibility, extreme performance and horizontal scalability and are the backbone of virtually all modern social sites and applications. They do this by removing normally crucial concepts such as ACID transactions and referential integrity from the mix and allow each individual piece of data to have its own schema (which is often referred to as “schemaless”), yet share whatever bits they have in common.

Using document-oriented databases is not exactly a new idea. NoSQL technologies have been around for many years, most prominently perhaps in the form of once the highly successful IBM Lotus Notes, which to this day holds a special place in my heart for being so incredibly reliable and flexible. But in the last few years, giant leaps forward in terms of accessibility and support have been taken by vendors such as Amazon, MongoDB Inc. (formerly 10gen) and Couchbase Inc. In particular MongoDB has grown into a formidable and competitive product with an absolutely amazing support team behind it.

Imagine each post in the above example being stored as JSON (well, BSON to be precise) in the database using the following (simplified) structure:

{
    title: "My Mustang 1967",
    slug: "/posts/my-mustang-1967",
    body: "This post is about the classic muscle car Ford Mustang 1967 ...",
    tags: [ "/cars/ford", "/cars/misc/classic", "/special/outstanding" ],
    comments: [
        {
            datetime: "2013-09-15T12:05:12.361Z",
            text: "That's a real classic!",
            username: "Claus Topholt",
            user_id: "507f1f77bcf86cd799439011"
        },
        {
            datetime: "2013-10-10T06:01:17.171Z",
            text: "Agreed, what a wonderful looking car.",
            username: "John Smith",
            user_id: "507f1f77bcf86cd799439012"
        }
    ]
}

The tags are embedded into the post as an array, using a path-like structure. The comments are also embedded but as an array of objects, each holding fields such as time, text and information about the user who wrote the comment. If the comment had been edited or moderated, that could be embedded as well. And so on.

Using MongoDB for instance, it’s possible to index even nested fields, and so an index could be put on the datetime and user_id fields inside the array of comments, making it fast and easy to query for the lastest comments per user across all posts. It’s also possible to use super-fast index based regex to query against the tags array. Here’s an example of how to find all posts tagged with “cars” including child tags like “ford”:

    /^cars/

This is just a simple example. In the real world, things are typically a lot more complicated, yet this does show us a glimpse of the power of a database engine such as MongoDB. But looking at the above post structure in JSON, you might say that missing referential integrity looks like a serious problem. For instance, what happens when a user changes their username?

The answer to that is, yes, that can be a problem. Without referential integrity and joins, you would have to handle cascading updates yourself by going into all the relevant places and replace values. But here’s where it gets a bit interesting: Some applications and websites, particularly those that deliver personalized content like in the example above, have very few use-cases where those things matter. Looking at statistics, for instance, we were surprised to find the total number of times any user had changed their username in the previous version of the site: Less than a handful times. And when walking through all of our TSQL, we found only 4 places out of maybe 200 Stored Procedures where ACID transactions were necessary but could be worked around, once everything was re-modelled in MongoDB.

I’m not saying that the solution is to just remove features such as “change username” but rather that doing cascading updates or handling “transactions” in code might not be so big of a deal as you might think initially. Sure, it will be slower to process and code would have to be written more defensively than in a relational model, but if it only happens rarely, the impact might be very low and well worth it.

Productivity and Performance Benefits

Using MongoDB and the built-in C# driver, I was able to implement a proof-of-concept over the course of a few days which imported all of the client’s existing data and stored everything into structures such as the one above with appropriate indexes. As part of the solution, I designed a “queryable” solution which took advantage of MongoDB’s ability to do fast regex matching on indexes in order to solve the entire post/tag/hierarchy/comment/language query problem.

The results were very convincing: Queries now consistently executed in 10-15ms instead of 250-400ms.

Perhaps even more convincing than the performance boost was the ease with which I and subsequently the rest of the team were able to transition from TSQL and Stored Procedures into writing high-performance models and queries using only C# and JSON. The productivity increase was tremendous and continued far beyond the first few months of switching from SQL Server to MongoDB.

Moving to a NoSQL database engine like MongoDB was the right decision for this particular client and project, but there are a couple of key points to remember before throwing all your existing RDBMS solutions away:

  • NoSQL is not a silver bullet. It still takes a lot of hard work to write solid applications using these “new” database engines, and although the transition can be very rewarding, it takes a lot of effort. So if your current solution isn’t broken, perhaps it isn’t worth trying to fix it.
  • If you rely on ACID transactions and cascading updates as a core part of your application, for instance when managing inventory in e-commerce, NoSQL may be the wrong choice. There are ideas like polyglot persistence that are worth understanding, but in most scenarios that will cause more problems than it solves, in my opinion.
  • The horizontal scalability aspects of NoSQL are truly amazing, but if the concept of eventual consistency is in direct opposition to your business requirements (again, inventory in e-commerce springs to mind), NoSQL is probably not for you.
  • If you are looking at NoSQL but really could just as well have implemented a proper caching solution, it may not be worth the effort — at least not in the short term.