Sunday 20 January 2013

Query Optimising by Reducing Database Round Trips

We'd been suffering from a poorly constructed query that resulted in multiple trips to our database. I can say it was poorly constructed because I wrote it. I knew exactly how inefficient it was. It worked, which was enough to go release early with. But that knowledge of a slack query in one of our core systems was enough to make me take a few days to tighten it up. This is how I went about improving that query.

The Problem
There are two entities involved our query: a Pixel and a UrlTagContainer. Pixels hold a small amount of html that needs to be returned along with the response of the requested Url for a webpage. Think of the system as a micro CMS for invisible content. Pixels contain the tiny (pixel-sized) bit of content in one property and a string collection of Tags in another property. UrlTagContainers contain a Url property and also have a string collection of Tags. Pixels are linked to a Url by these Tags. The query needs to retrieve the correct Pixels for a given Url where one or more of the Url's Tags intersects with the Tags of the Pixel. In addition, if a Pixel has no Tags then it should always be returned by the query.The benefit of this relationship is that it makes it possible to group pixels together. The application of Pixels to a Url can be managed through a Tag, so as to break the linear relationship between Pixels and Urls.

As stated, the query was inefficient because it made two calls to the database per single attempt to get the results. The first step was to retrieve the Tags applied to the Url. The second step was to find all the Pixels with Tags that intersected the results of the first step (or had no Tags at all). The graph below shows typical throughput over 24hrs. throughput:


The Database
Our pixel system uses RavenDB. Instead of managing our own instance we are using RavenHQ. We license a database adequate for the volume of data we had in production - but not the throughput we were generating. It was RavenHQ who first alerted us to the high traffic and suggested we use Aggressive Caching. This we did, but it always felt like papering over the cracks for this problem, because we knew where the real issue was.

The Solution
The goal was to reduce the number of calls to the database per request. I had to query the database in such a way that the results would contain the Pixel as well as the Url(s) that the Pixel had intersecting Tags with. This would allow me to filter on this projected 'MatchedUrls' property.

Sounds straight forward. However, the challenge lay in thinking in terms of RavenDB Indexes - as opposed to the old SQL statement. To get the results in the shape I needed required the index to perform several steps. Roughly, the index flattens out the data, then pulls it back together and returns the result. To help me understand what was happening at each step I wrote out the exploded data. I have posted the individual parts of the index with an explanation of what is happening and a view of the results at each step.

First off, the raw data. We begin with a collection of Pixels and a Collection of Url Tags. (I recognise that RavenDB doesn't store the data in a tabular fashion, but this is simply the easiest way to think about this...)

Pixel Id Tags
1 abc
2 def
3 abc, ghi
4
5 def
     
Url Tags
jim.com ghi
bob.com abc, def
tom.com def
ann.com def
pam.com ghi

Every RavenDB index require's a Map function. This Map function works with a particular data type (of which there is a collection of in the database) produces some output of that data. The output of the Map can be the end result of the query, or it can feed into subsequent steps of the Index. We will be further using the output of our Map.

Infact, because we are working with two object collections here (Pixels and UrlTagsContainers), the index needs have multiple maps. To accomplish this our index needs to be an instance of AbstractMultiMapIndexCreationTask. The objective of the Map steps in our index is to project each Tag from inside the Pixel or UrlTagsContainer entity into its own object. In other words, we need to find a way to aggregate two different types into the same object type - one for each Tag across the two collections of objects.

The object that we would like needs to contain our three key elements: The Tag, the Pixel Id and the UrlTagContainer Url. When mapping Pixels, we know that only two of these properties will be populated: Tag and Pixel Id. We can leave the Url blank at this stage. For the UrlTagContainer we can take on the Tag and the Url. The Pixel Id will be left empty. After the two map steps have taken place. The data is pulled together into one set of results of the single projected data type. The Maps of the index are shown below, followed by how the data will look after the Maps have been applied:

Tag Pixel Ids Urls
abc 1
def 2
abc 3
ghi 3
_notag_ 4
def 5
ghi jim.com
abc bob.com
def bob.com
def tom.com
def ann.com
ghi pam.com

This collection of data is perfect for performing a grouping function on. That is exactly what we do in the Reduce step of the index. The group key is obviously the Tag. We group on the Tag and then 'SelectMany' on both the Pixel Id collection and Url collection properties of the projected object to flatten them into a single list of Pixel Ids and Urls for each Tag. The result of the reduce therefore is a collection of all the Tags, where each Tag only exists once in the list and is accompanied by the Pixel Ids and Urls to which the Tag is applied. Best described as in the table below:

Tag Pixel Ids Urls
abc 1,3 bob.com
def 2,5 bob.com, tom.com, ann.com
ghi 3 jim.com, pam.com
_notag_ 4

A Map and Reduce function is usually the end of a RavenDB index. However, the result in its current format is not enough to query on yet. We have to make use of the TransformResults Function of RavenDB Indexes. This is useful for two reasons. Firstly, because it gives us a final opportunity to distort the data into something we can use. Secondly, it provides an argument to the function that lets us re-query the database. As you can see, we are working with the Ids of the Pixels, not the Pixels themselves. We need to return the full Pixel in our results so we can pull out the required piece of html content to return with the page. In addition, we may need to query on other properties of the Pixel.

The aim of the TransformResults function is to give us the final output of the query from the output of the Reduce function. In simple terms, we are selecting every Pixel Id from each element in the results of the Reduce function into one flat list of pixel Ids. For each Pixel Id we then select the corresponding Urls from the same Reduce function result into a new projected object. We populate the projected object with the actual pixel, loaded from the database argument using the Pixel Id. Alongside the Pixel is the collection of Urls it applies to. This meets the demands of our query.

Pixel Urls
{ Id: 1 } bob.com
{ Id: 2 } bob.com, tom.com, ann.com
{ Id: 3 } bob.com, jim.com, pam.com
{ Id: 4 }
{ Id: 5 } bob.com, tom.com, ann.com


The Result
Now, with a query that doesn't require hitting the database twice, we have seen a sharp decrease in the traffic we send. We are still making use of caching, so the dramatic change in throughput in the graph below is not down purely to the query, but it certainly was a factor. We now have the room required to lower the cache duration, and still perform better than we were.




The Cost
Additionally, I brought the Tag querying into code so that all that we ask for from the index is the full set of results (which is very small). This means that there are fewer unique queries to hit the data base with, so it is easier to cache the result. However, it does mean that the CPU has to work harder to query our cached data. On the other hand, Application memory footprint has reduced, presumably as a result of not having to hold the results of so many distinct queries in memory:



The Index
Here is the actual index in its final form:



No comments:

Post a Comment