Datetime Indexes in Dgraph

I recently started working at Dgraph Labs in Bengaluru as a Software Engineer. One of the first issues that I worked on was related to the Dgraph’s datetime datatype. This article covers how I discovered and resolved the issue. The article assumes a basic understanding of Dgraph, but you can learn everything you need in docs.dgraph.io.

Bug Discovery

While working on Flock, I observed that timestamp comparisons were not working correctly in Dgraph when the data or query had timezone information in it. To see the issue, let’s first quickly set up Dgraph using the commands below.

The easiest way is to use Docker Compose. Create a docker-compose.yml file with the contents below.

version: "3.2"
services:
  zero:
    image: dgraph/dgraph:v1.0.16
    restart: on-failure
    command: dgraph zero --my=zero:5080
  server:
    image: dgraph/dgraph:v1.0.16
    ports:
      - 8080:8080
    restart: on-failure
    command: dgraph alpha --my=server:7080 --lru_mb=2048 --zero=zero:5080
  ratel:
    image: dgraph/dgraph:v1.0.16
    ports:
      - 8000:8000
    command: dgraph-ratel

Next, from the folder containing the newly created file, run:

docker-compose up

We will use Ratel to showcase the issue. Ratel allows you to run queries, mutations and manage the Dgraph cluster. To open Ratel, you can go to the URL http://localhost:8000/?latest in your browser after running the commands above.

Now, let’s set up the schema. To set up the schema, you have to go to the Schema tab on the left navigation bar in Ratel, click on Bulk Edit and paste the following schema:

created_at : datetime @index(hour) .

Then, click on Apply Schema. This will create a new predicate created_at of type datetime.

Now you can perform the following mutation to add data to Dgraph in the Console tab after choosing Mutate option. This will store two tweets with a creation time and author name.

{
    "set": [
        {
            "uid" :"_:user1",
            "created_at": "2019-03-28T14:00:00-06:00",
            "author_name": "Rahul"
        },
        {
            "uid" :"_:user2",
            "created_at": "2019-03-28T18:00:00+01:00",
            "author_name": "Ashish"
        }
    ]
}

Finally, run the following query to get all the tweets that were created after UTC March 28th, 3 PM. You can run queries in the Console after selecting Query option:

{
  tweets(func: gt(created_at, "2019-03-28T15:00:00+00:00")) {
    created_at
    uid
  }
}

This is the response we get:

{
  "extensions": { ... },
  "data": {
    "tweets": [
      {
        "created_at": "2019-03-28T18:00:00+01:00",
        "uid": "0x2"
      }
    ]
  }
}

The created_at timestamp for the both the users is as follows. For brevity, we will remove the date (2019-03-28) from timestamps in the rest of the article.

Created At In UTC Greater than query ts (15h +00:00)
user1 14h -06:00 20h +00:00 True
user2 18h +01:00 17h +00:00 True

Even when the created_at timestamps for both the users are larger than the timestamp in the query, only user2 shows up in the response of the query. Before I explain how we found and fixed the root cause of the issue, let me first explain how indexes work in Dgraph.

Indexes in Dgraph

Along with JSON, Dgraph takes RDF data format as input which looks like <Subject> <Predicate> <Object> (SPO Triples). The Subject is generally a UID value representing a node in the graph whereas the Predicate represents a relationship of the node with the Object value. Indexes are created on a given Predicate to quickly find nodes in the graph for the provided Object value.

Dgraph indexes are stored in BadgerDB, an embedded key-value store written purely in Go. For each object value, one or more tokens are generated based on the provided tokenizer. For example, in the case of a hash tokenizer, the tokenizer computes the hash of the Object value and returns it as a token.

Once the tokens are generated, data is stored in badgerDB in the following format. Each key in badgerDB, is a combination of predicate and generated token, and the value is the list of UIDs that match the predicate and the token (the key). An example is shown below for created_at predicate:

key value
< created_at, Tokenizer-Tok1 > List(uid1, uid3)
< created_at, Tokenizer-Tok2 > List(uid2)

We prepend a unique Tokenizer identifier in the key (whether it’s a hash tokenizer, int tokenizer, hour tokenizer etc.) to the token generated, as shown above, to be able to traverse through all the tokens belonging to only that particular tokenizer (index).

Datetime Tokenizers

For a datetime index, Dgraph provides tokenizers for hour, day, month and year granularity. Each key generated by the tokenizer can be treated as a time bucket of specified granularity, storing the UIDs with a date falling within that bucket. The granularity of a datetime index should be carefully chosen to keep the size of each bucket reasonable. Choosing a lower granularity time index would wipe out the performance benefits of UID consolidation into time buckets (think one entry per bucket). On the other hand, choosing very high granularity would end up putting too many UIDs within the same bucket, which could also affect performance (millions of entries per bucket).

Root Cause

In order to generate tokens for datetime datatype, we use the Hour(), Day(), Month() and Year() methods of the time package as you can see in the (source code). These methods return the information accounting for the timezone information in the timestamp. For example, for timestamp 14h +06:00, the Hour() method returns the value 14 and for timestamp 14h +00:00, the Hour() method still returns the same value 14. Due to this behavior, the index is not correctly built.

In our example above, the data has two timestamps 1) 14h -06:00 and 2) 18h +01:00 and we have created an hour index on the created_at predicate. The hour index incorrectly stores roughly the following information:

<created_at, HourTokenizer-2019-03-28T14> -> List(0x01)
<created_at, HourTokenizer-2019-03-28T18> -> List(0x02)

The query is trying to find out all users having created_at timestamp bigger than 15h +00:00. While processing the query, Dgraph computes the same hour tokenizer for this timestamp to be 2019-03-28T15. Then, it scans through all the users in the index having the token bigger than 2019-03-28T15 and only returns the user with uid 0x02.

Fixing the Bug

UTC

Once the issue was clear, the fix was pretty straight forward. Instead of building tokens directly on the given timestamp, we first convert the timestamp into UTC() timezone and then compute various tokens. This normalizes all timestamps into single timezone and the Hour(), Day(), Month() and Year() information can be correctly compared. The PR for this fix is available here. Once this change is included, the index looks like this instead:

<created_at, HourTokenizer-2019-03-28T17> -> List(0x01)
<created_at, HourTokenizer-2019-03-28T20> -> List(0x02)

The timestamp in the query now generates the token as 2019-03-28T15 and we see both the users while scanning through the index. The correct response, therefore, looks like as follows:

{
  "extensions": { ... },
  "data": {
    "tweets": [
      {
        "created_at": "2019-03-28T14:00:00-06:00",
        "uid": "0x1"
      },
      {
        "created_at": "2019-03-28T18:00:00+01:00",
        "uid": "0x2"
      }
    ]
  }
}

You can see that both the users are included in this response.

Conclusion

In this article, we discussed how indexes are implemented in Dgraph. I also showed how I discovered and resolved the issue with datetime indexes. While backporting it to v1.0 series would require change in data format on disk, the fix will be available in v1.1 which we plan to release end of this month. Stay tuned!