Performance testing a graph database

We’re building a new data service for Parliament and an important component of that service is our triple store, aka a graph database. Triple here refers to the underlying structure of data, consisting of a subject, a predicate, and an object.

You might have heard of graph databases in coverage of the Panama Papers. In this blog post, I'm going to explain the fundamental concepts behind graph data modelling and querying. And then take a look at how we tested the performance before moving to production.

Welcome to the Semantic Web!

Query graph databases

I expect I lost some of you to Wikipedia for a moment there. But fear not. Graph is a much more natural and flexible way of describing and storing data than trying to fit it into tables.

This flexibility allows us to describe very complex domain models. Such as the one of Parliamentary business. In turn, we can retrieve information using a graph pattern matching query language. The W3C has a spec describing such a language. It's called SPARQL and it's awesome.

For example:

data is described with statements (known as triples)
how to describe a House?
- “House of Commons” a “House”
- “House of Lords” a “House”
now we've got 2 statements
how to get the Houses?
- SELECT ?x WHERE ?x a “House”
result: “House of Commons” and “House of Lords”

That’s it. Easy, isn’t it?

This example is slightly simplified but there's not much more to it. If you're curious, you can find alternate links in the source of our beta website Houses page. They will take you to the statements describing our Houses (available in various serialisation formats).

Visualisation of what a House looks like in data terms — “Around a House” visualisation of our Graph Database

This formal representation of a House (as in the knowledge domain of Parliament) is called an ontology. You can see the UK Parliament ontologies on GitHub.

A graph database for beta.parliament.uk

The new website for Parliament relies on our graph database and we expose all our data through an API.

The reasons for an API

We could let the website run SPARQL queries against our database but we built an API because:

beware serialisation formats. We want our users to focus on using our data. Not on parsing JSON to XML or CSV, right? We support 13 formats and plan to grow with user needs. That’s what we call good user experience in the world of open data.
caching makes everything so fast. No matter how fast the database is, serving a known answer will always be faster than processing a question. A good caching system in front of our API helps to focus processing power where it’s needed.
control. First, writing and deleting statements is not available to everyone. Second, we want to provide a good level of service to all our readers. So maybe we’ll need a few rules (throttling) when we become famous.

So how does our API work?

We expose a set of 127 SPARQL queries (and growing) to Parliament’s new website. Those queries can take parameters and the statements matching the queries can be served in any of 13 formats (and growing).

For the graph API curious:

how to get the House of Commons?
- graph pattern: ?house_id a :House
- replace ?house_id with the relevant identifier, for example, “House of Commons”
- run this graph pattern (aka SPARQL query) against our database
how does it look in our API?
- endpoint: house_by_id
- SPARQL query parameter: house_id = “House of Commons”
- format: JSON
- URL: https://parliament-api.com/house_by_idhouse_id=”House of Commons”&format=json

This example is another simplified version. If you're curious, you can find alternate links in the source on the new website's House of Commons page. They'll take you to the API endpoint serving the House of Commons statements.

You can find our SPARQL queries as well as the source code of our API on GitHub.

House of Commons presented in SPARQL — “House of Commons” presented in application/sparql-results+xml format in a web browser

How does our API perform?

Our API served 40,000 queries in the past month with 89% of requests running in under a second (98.5%<10s, 94%<3s, 73%<500ms, 38%<250ms). 1,300 queries a day can hardly be considered as stress testing a database. Even if you generously consider that all the traffic happens during the eight-hour working day, it would still be one request every 22 seconds. And yet, 1.5% of requests are taking over 10 seconds to resolve.

API analytics: average request duration in milliseconds (ms) - left bar in each pair - and number of request (right bar) as seen in Microsoft Application Insights

The day with the most page requests (2663) got the lowest average response time of 350ms.

This raises some questions from a practical perspective, including:

is our database able to withstand production user loads with satisfactory performance?
where do the slowest 10% requests come from?

Satisfactory performance is a relative concept. For example, displaying a webpage in under 300ms is widely considered as optimal. You can find good articles on performance such as “Measure Performance with the RAIL Model” or “Powers of 10: Time Scales in User Experience.”

Our database will have a more complex range of use cases than serving content for a webpage. But I consider serving 1,000 statements under 100ms to be a good objective. It's a large amount of data and a reasonable time dependency to build and serve a webpage with the optimal user experience.

Performance testing

Part one: the existing infrastructure

Here's an overview of our infrastructure:

our graph database runs on a cluster of virtual machines (we use Ontotext GraphDB EE)
our API runs in an Azure App Service
the API points to an Azure API management service
an API management service does all the routing
the API management service has a caching policy

So, how do we test our running infrastructure? There's good news here.

We have analytics, therefore we can create realistic tests based on the 40,000 queries that hit our API in the past month. We have a staging environment and so we shouldn't crash the live database or pollute our tests with other traffic. We have Visual Studio load tests, which are a convenient way of simulating large user loads and collecting metrics like “average requests per second” and “average response time”

To cut a long story short, I ran many tests on our staging infrastructure. I used Visual Studio team services to run them in the cloud. I had 100 virtual users hitting our API. I tried clusters with 1 to 8 workers. I provisioned virtual machines with different processing power. I played with the memory configuration. And, pleasingly, the results were very consistent.

Average response time:

40ms with API management service caching
3s with no caching

Average requests per second:

100 with the API management service caching
30 with no caching

There were four issues with this round of testing.

scope - we should target efficiency in our testing. That comes with better questions. What do we treat as a variable? My main mistake here was testing different memory configurations. Had I read the documentation more thoroughly, I would have found out that Ontotext recommends 6GB of memory for 100 million statements. With our 2.3 million statements, it seems clear that testing configurations with more than 4GB of memory should not affect performance. And it didn’t.
dependencies - we should test each component of our infrastructure independently. Otherwise, how would we know where the bottleneck lies? We have an API management service, an API, a virtual network, and a cluster of virtual machines. It’s hardly a good configuration if we want to understand how our Graph Database performs on different hardware
data - some of our SPARQL queries might be performing much slower than others and skew the global averages. And VSTS cloud tests do not provide detailed statistics per request
convenience - a major drawback lies in the complex and lengthy process of provisioning a cluster of virtual machines segregated in a virtual network. It makes testing slow and inconvenient.

There's a silver lining to this round of tests. The current infrastructure can sustain 100 requests per seconds with 40ms response time thanks to caching. This is a very good performance. It's also worth mentioning that with enough budget, Azure API Management Service comes in high traffic and large cache flavour. In summary, we'll happily welcome more users to our API.

But how could we test the performance of our graph database alone?

SPARQL query with cache — SPARQL query API with cache

Part two: less is more

That’s when I set on a quest for easier and faster deployment of our database. And thanks to Azure App Services and PowerShell scripts I could get rid of virtual machines and automate everything.

This means I can start a PowerShell script, let it run for about 15 minutes, check my Azure subscription, and find a running cluster of Graph Databases with 2.3 million statements worth of data ready to query. Not bad.

For the curious:

Azure App Services are simple containers allowing you to run applications written in a variety of languages. This includes Java, making it suitable for our database. Azure App Services run off App Service Plans. Plans define a set of compute resources. Plans can scale Up and Out (better hardware/more hardware).
PowerShell is a scripting language built on the .NET Framework. A lot of functions are dedicated to administer cloud infrastructure. Specifically, you can use it to create Azure resources based on a template. Declarative automation - powerful stuff.

Graph Database cluster running on App Services

Where are we now in terms of testing?

scope: I can scale my App Service Plans (better hardware) and change the number of workers in my cluster
dependencies: I got rid of the API Management, API, Virtual Network and Virtual machines. I can change my tests to SPARQL queries. With SPARQL, I can query the database directly instead of using our API
data: I will run the load tests locally in Visual Studio. Local tests have detailed statistics per query (unlike cloud run tests)
convenience: I can deploy a different configuration of my database cluster in under 15 minutes

The setup for this round of tests:

1000+ SPARQL queries based on real traffic
three different App Service plans (variation on hardware):
- large standard (4 cores, 7GB RAM, 50GB storage; about £218/month)
- medium premium v2 (2 faster cores, 7GB RAM, 250GB SSD storage; about £314/month)
- large premium v2 (4 faster cores, 14GB RAM, 250GB SSD storage; about £627/month)
number of workers (3 or 6)
test running for 1 hour with 50 virtual users

Cluster bar chart — A bar chart of our test results: cluster price, requests per second, and average response time in milliseconds

The results of this round of testing are good news. We can serve 186 requests per second with an average response time of 270 ms.

Let’s play a bit with numbers for perspective. Responses have a 40KB content length on average. Statements are about 130 bytes if you exclude large literals (like constituency borders). A query would get you 300 statements worth of content on average. Imagine a user presented with 1,500 statements worth of data. That seems to me like a reasonable corpus of data. I would imagine an interested user to spend a few minutes going through that much content. Let’s say 2m30s. We could satisfy 5,550 of those hypothetical users. Without caching.

We can also compare price and performance gains.

We can see here that performance has a very proportional increase to hardware price. Let’s compare the baseline to the best configuration (last and first lines in the table). By spending 2.75 times more, we can process 2.45 times more requests and divide our average response time by 2.44. It's comforting to see logic in the performance gains.

It's equally interesting to look at worker scaling (second and third line). Doubling the number of Premium v2 Medium workers let us process 1.55 times more requests per second. Doubling workers is 1.75 times more expensive. It might not be the most cost-effective performance gain. It is, however, comforting to know that scaling our cluster increases performance.

In brief, processor speed and number of cores have a very clear impact on database performance. We can only look forward to options for beefier App Service Plans made available in Azure. Given the per-core licensing model Ontotext chose for GraphDB, faster cores would be a huge benefit for us.

Performance would allow us to cope with much more traffic than we currently have without caching. But what about the details? For now, we only looked at general averages. It's likely that the test results are skewed by the slowest queries.

Performance Satisfaction

We use 41 SPARQL query templates in our tests and the average response content length ranges from 10MB to 94 bytes. Diving into the details of each query will finally give us a clear idea of where we stand.

Requests response time in seconds plotted against bar chart of content length in kilobytes

Requests: faster half response time plotted against content length in kilobytes

Plotting the response time against the content length we can see that the size of content length is almost consistently proportional to the speed of responses. There is, however, some variation that can be explained by how complex those queries are.

The slower requests with under 1MB of content length could be singled out to find out what makes them slower. We can spot a lot of queries with the suffix “by initial” among those. This suffix indicates that in our SPARQL query we use a filter function of type STRSTARTS. This function is probably computation intensive.

Optimising our queries is not the purpose of this round of testing. This kind of insights is however interesting and a good prospect for performance optimisation.

The slower half of requests have on average 1.8MB of content length and average response times in the seconds. It is interesting to notice that increasing the number of workers seems to provide a significant performance gain for large amounts of content.

The faster half of requests have on average 16kB of content length. That is about 120 statements with an average of 130 bytes per statement. The average response time of this slower half is 271ms across all configurations.

The gap between the slower and faster half of requests is huge both in terms of content length and in terms of response time. Despite optimisations on request complexity, the content length seems to be our largest toll in terms of performance.

The four largest request content length average at 7MB for 13.5 seconds of response time across all configurations. The next five largest average at 1.3MB for 3.3 seconds of response time across all configurations.

It's worth giving some perspective to that story. As a user, if I need such a large corpus of content as 10MB (or 80,000 statements), would I be happy to wait six to 45 seconds to get it? In my opinion, the answer is yes.

Finally, it's worth remembering a few of our limitations:

all those tests were run without caching. Azure’s API management service Premium tier would allow us to cache 5GB of data and has a throughput of 4,000 requests per second per unit
all those tests were run within the limitation of our licensing (12 cores). Out of curiosity, I ran a cluster with 9 P2V2 workers. My performance went up to 290 average requests per seconds.

The hardest limit here is budget. As far as I am concerned, our graph database is ready for production. And we have room to grow.

That’s all folks?

Well, not really.

We have many challenges ahead. Including the exciting prospect of testing our performance with much more than 2.3 million statements. Documentation on hardware requirements recommends 14GB of RAM for 250 million statements. We should therefore have room to grow, even on App Services.

Another exciting prospect beyond the performance of our database system. We uncovered the potential of basing our infrastructure on App Services instead of Virtual Machines. App Services are easier and faster to deploy and require less maintenance. Migrating our infrastructure to those standard containers is yet another story to tell. A journey to focusing more of our time on making the data platform better for our users (rather than on mundane maintenance).

Data and Search for Parliament has interesting days ahead. We can look forward with reasonable confidence and add many more important statements to our database.

For those who want more...

Alternate links in the source of the beta pages lead to a variety of statements available in various serialisation formats. Some people have built cool visualisations based on our previous go at an open data platform. Nothing really stops you from hacking something together with some Subject, Predicate, Object magic today!

Tweet our boss Dan and let us know if you’re using our data. We’d love to hear about it.

You can read more about the data and search team in our weeknotes. Please let us know if you'd like high-res copies of any of the graphs or data above.

Performance testing a graph database

Query graph databases