Can it really scale?

In this post, we’ll look at how Dgraph performs on varying the number of nodes in the cluster, specs of the machine and load on the server to answer the ultimate question: Can it really scale?

The Dataset

Freebase is an online collection of structured data which includes contributions from many sources including individual and user-generated contributions. Currently, it has 1.9 Billion RDF N-Triples worth 250GB of uncompressed data. On top of that, this dataset is over 95% accurate with a complex and rich real world schema. It is an ideal data set to test the performance of Dgraph . We decided not to use the entire data set as it wasn’t necessary for our goal here.

Given our love for movies, we narrowed it down to the film data. We ran some scripts and filtered in the movie data only. All the data and scripts are present in our benchmarks repository. There are two million nodes, which represent directors, actors, films and all the other objects in the database. Moreover, 21 million edges (including 4M edges for names) are representing the relationships between actors, films, directors and all the other nodes in the database.


Some interesting information about this data:

# film.film --{film.film.starring}--> [mediator] --{film.performance.actor}--> film.actor
# Film --> Mediator
$ zgrep "<film.film.starring>" rdf-films.gz | wc -l
1397647

# Mediator --> Actor
$ zgrep "<film.performance.actor>" rdf-films.gz | wc -l
1396420

# Film --> Director
$ zgrep "<film.film.directed_by>" rdf-films.gz | wc -l
242212

# Director --> Film
$ zgrep "<film.director.film>" rdf-films.gz | wc -l
245274

# Film --> Initial Release Date
$ zgrep "<film.film.initial_release_date>" rdf-films.gz | wc -l
240858

# Film --> Genre
$ zgrep "<film.film.genre>" rdf-films.gz | wc -l
548152

# Genre --> Film
$ zgrep "<film.film_genre.films_in_this_genre>" rdf-films.gz | wc -l
546698

# Generated language names from names freebase rdf data.
$ zcat langnames.gz | awk '{print $1}' | uniq | sort | uniq | wc -l
55

# Total number of countries.
$ zgrep "<film.film.country>" rdf-films.gz | awk '{print $3}' | uniq | sort | uniq | wc -l
304

This data set contains information about ~480K actors, ~100K directors and ~240K films. Some example of entries in the dataset are :

<m.0102j2vq> <film.actor.film> <m.011kyqsq> .
<m.0102xz6t> <film.performance.film> <m.0kv00q> .
<m.050llt> <type.object.name> “Aishwarya Rai Bachchan”@hr .
<m.0bxtg> <type.object.name> “Tom Hanks”@es .

Terminology

  • Throughput: Number of queries served by the server per second and received by the client
  • Latency: Difference between the time when the server received the request and the time it finished processing the request
  • 95 percentile latency: The worst case latency which 95 percentage of users that query the database face
  • 50 percentile latency: The worst case latency which half the users that query the database face

Setup

All the testing was done on GCE instances. Each machine had 30GB of SSD and at least 7.5 GB of RAM. The number of cores varied depending on the experiments performed.

The tests were run for 1-minute intervals during which all the parallel connections made requests to the database. This was repeated ten times and throughput, mean latency, 95th percentile latency, 50th percentile latency were measured. Note that for user-facing systems, measuring percentile latency is better than mean latency as the average can be skewed by outliers.

In a multi-node cluster set up, the queries were distributed among each node in a round-robin fashion. Note that no single machine contains all the data to answer these queries, in a multi-node cluster. They still have to communicate with each other to respond to the queries.

Variables

The parameters that were varied were:

  1. A number of parallel connections to the database. In Go, this equated to the number of goroutines a client would have. Each goroutine would run in an infinite loop, querying the database via a blocking function.
  2. Number of cores per server
  3. Number of servers in the cluster

This gave us an idea of what to expect from the system and would help in predicting the configuration required to handle a given load.

Queries

We ran broadly 2 categories of queries.

  • For each actor (478,936 actors), get their name, the films they acted in, and those films’ names.
{
  me ( _xid_ : XID ) {
    type.object.name.en
    film.actor.film {
      film.performance.film {
        type.object.name.en
      }
    }
  }
}
  • For each director (90,063 directors), get their name, the films they directed, and names of all the genres of those films.
{
  me ( _xid_ : XID ) {
    type.object.name.en
    film.director.film {
      film.film.genre {
        type.object.name.en
      }
    }
  }
}

During each iteration, either an actor or a director category was chosen randomly. Furthermore, for that category, an actor or director was chosen randomly; their XID filled in in the query template.

Performance

Let us look at some graphs obtained by varying the machine specs and the number of nodes in the cluster under different loads.

Vary the number of cores in a single instance

Throughput on varying number of cores

50 percentile latency on varying number of cores

95 percentile latency on varying number of cores

mean latency on varying number of cores

  • With the same number of cores, when we increase the number of connections, i.e. load on the system, the throughput as well as the latency increase.
  • Throughput increases till some point and then flattens out. This is the point where the computational capacity is being utilized almost fully.
  • As expected, the latency increases almost linearly with the number of connections.
  • When we increase the number of cores, the latency decreases and the throughput increases.

Vary number of instances

Throughput on varying number of instances

50 percentile latency on varying number of instances

95 percentile latency on varying number of instances

mean latency on varying number of instances

  • When we increase the number of parallel connections, the throughput increases, but then flattens out. This is the point where the computational capacity is being utilized almost fully.
  • The latency increases almost linearly with the number of connections.
  • Latency in the case of a single instance is observed to be the equal to (or a bit lower than) that of distributed configurations as the former doesn’t require any network calls. However, as the number of requests/load increase, the cumulative computational power comes into play and overshadows the latency incurred due to network calls. Hence, the latency reduces in the distributed version under higher loads.
  • On comparing across the one, two and five node clusters, we can see that the latency, as well as the throughput, are better for configurations with a higher number of nodes, i.e., when there is more computational capacity at disposal. The throughput increases as we have greater computational power and can handle more queries.

Conclusion

From the above experiments, we can see a relationship between the throughput, latency and the overall computational power of the cluster. The graphs show that the throughput increases as the computational power increases. Which can be achieved either by increasing the number of cores on each server or the number of nodes in the cluster.

The latency increases as the amount of load on the database increases. However, the rate of the increase differs based on how much computational power we have available.

This experiment also shows that there is a limit on how much computational power a single node can have, and once we reach that limit, scaling horizontally is the right option. Not only that, but it also proves that scaling horizontally improves the performance. Hence, having more replicas, distributing the dataset optimally across machines are some factors which help in improving the throughput and reducing the latency that the users face.

Based on this experiment, our recommendation for running Dgraph would be:

  • Use as many cores as possible
  • Have the servers geographically close-by so that network latency is reduced
  • Distribute the data among servers and query them in a round-robin fashion for greater throughput

These might seem pretty obvious recommendations for a distributed system, but this experiment proves that the underlying design of Dgraph is scalable.

Hope this helps you get a sense of what sort of performance you could expect out of Dgraph!

This post is derived from my report for B.tech Project on “A Distributed Implementation of the Graph Database System, Dgraph”. The full report is available for download here.