Migrating data from SQL to Dgraph

Dgraph is rapidly gaining reputation as an easy to use database to build apps upon. Many new users of Dgraph have existing relational databases that they want to migrate from. In particular, we get asked a lot about how to migrate data from MySQL to Dgraph. In this article, we present a tool that makes this migration really easy: all a user needs to do is write a small 3 lines configuration file and type in 2 commands. In essence, this tool bridges one of the best technologies of the 20th century with one of the best ones of the 21st (if you ask us).

Two ways of understanding the universe

Before we dive into the details, let's quickly glance over the data models of SQL databases and Dgraph. Real life objects have certain attributes: a person has a name, age, and date of birth; a movie has a name, release date, and genre. SQL databases represent all the attributes of an object by storing them into a single row in a table. Dgraph, on the other hand, represents these attributes as edges around a node.

Both systems work equally well if we stop here and consider attributes of objects as the only type of relationships in the universe. However, the real universe is more complex than that. The 2nd type of relationship is between objects.

As a concrete example, an actor is a person that has starred in a number of movies; a movie has one or more directors, one or more writers, and a generally larger number of actors. In order to logically link a concept (e.g. a movie) with another (e.g. an actor), a SQL database would normally have separate tables to represent the relationships. Thus, finding the names of the people that starred in a given movie requires:

locating the movie identifier movie_id by searching for its name in the movie table,
finding the identifiers of the people who starred in that movie person_id from a second acting table, and finally
finding the rows in the people table with the people_ids above and extracting the values in the name column.

If that seems already cumbersome, you might now want to consider how much more difficult it can get for recursive relationships. Imagine there is a people table, and a children table mapping each person_id to the person_ids of their children. The logic to find all the descendants of a person with SQL is not trivial and requires a bunch of back and forth interactions between the application and the database.

Graph Databases come to rescue in situations where SQL would have needed many nested join operations: in Dgraph, inter-object relationships are represented by edges and edge traversal from one node to another is supported as a first-class operation. This seemingly simple difference makes an entirely new class of operations possible. Finding all people who are related to a movie regardless of how they are related (using the expand(_all_) operation), finding all of a person's descendants regardless of how many generations they are apart (using the @recurse directive), finding the shortest connection path between two people in a social network (using the shortest function), etc. The possibilities are limitless.

Meet the SQL to Dgraph migration tool

If you feel Dgraph is the right fit for your use case, let's dive a little deeper and see how the migration tool works by studying an example. Below, you can see a simplified version of the database schema supporting StackOverflow:

create table users (
	Id int,
	DisplayName varchar(200),
	PRIMARY KEY (Id)
);

create table posts (
	Id                    int,
	AcceptedAnswerId      int NULL,
	Body                  text,
	OwnerUserId           int NULL,
	LastEditorUserId      int NULL,
	Title                 varchar(1000),
	PRIMARY KEY (Id),
	FOREIGN KEY (OwnerUserId) REFERENCES users (Id),
	FOREIGN KEY (LastEditorUserId) REFERENCES users (Id),
	FOREIGN KEY (AcceptedAnswerId) REFERENCES posts (Id)
);

create table comments (
	Id           int,
	PostId       int,
	Text         varchar(1000),
	UserId       int,
	PRIMARY KEY (Id),
	FOREIGN KEY (PostId) REFERENCES posts (Id),
	FOREIGN KEY (UserId) REFERENCES users (Id)
);


create table votes(
	Id           int,
	PostId       int,
	CreationDate varchar(100),
	PRIMARY KEY (Id),
	FOREIGN KEY (PostId) references posts (Id)
);

Each of the four tables stores objects that have their own attribute(s):

A user in users has a DisplayName;
A post in posts has a Title and a Body;
A comment in comments has some Text;
And a vote in votes has its CreationDate.

Besides the basic attributes, the objects are also interconnected:

A post in posts has an owner and hence a column OwnerUserId referencing a row in the users table.
A comment belongs to a post and hence the comments table has a PostId column referencing a row in the posts table.

To illustrate the migration process, we've selected some sample rows from these tables. We will walk through each table, and explain how they can be converted into N-Quad entries. The RDF N-Quad format is one of the formats that can be used for importing data into Dgraph (the other accepted format is JSON.)

# users

| Id    | DisplayName       |
| ----- | ----------------- |
| -1    | NULL              |
| 14    | Jimmy Hoffa       |
| 15    | Mooseman          |
| 2089  | MrPhooky          |
| 12370 | Paul Wesselkamper |

# posts

| Id    | AcceptedAnswerId | Body      | OwnerUserId | LastEditorUserId | Title     |
| ----- | ---------------- | --------- | ----------- | ---------------- | --------- |
| 9     | 11075            | <p>I try  | 14          | 2089             | How can.. |
| 11075 | NULL             | <p>When I | 12370       | -1               |           |

# comments

| Id  | PostId | Text                                  | UserId |
| --- | ------ | ------------------------------------- | ------ |
| 4   | 9      | Of what fabric are the blankets made? | 15     |

# votes

| Id  | PostId | CreationDate            |
| --- | ------ | ----------------------- |
| 10  | 9      | 2014-12-09T00:00:00.000 |

Converting SQL rows to RDF N-Quad entries

N-Quad entries have the format <subject> <predicate> <object> ., so the single row in the comments table is converted into the following N-Quad entries:

_:comments.4 <comments.Id> "4" .
_:comments.4 <comments.Text> "Of what fabric are the blankets made?" .

JSON equivalent (for reference):

{
  "uid": "_:comments.4",
  "comments.Id": 4,
  "comments.Text": "Of what fabric are the blankets made?"
}

Take the first entry _:comments.4 <comments.Id> "4" . as an example:

_:comments.4 is the subject,
comments.Id is the predicate,
"4" is the object, and finally
. is the terminal symbol of each N-Quad statement.

In Dgraph, a subject that begins with _: is called a blank node label, and it represents an object that is still to be created.

Thus the two N-Quad entries tell Dgraph that

a new object needs to be created for the comment,
the new object has two attributes comments.Id and comments.Text, whose values are "4", and "Of what..." respectively.

Notice that the blank node label for a given object must be unique across all the generated N-Quad entries. To ensure uniqueness, we leverage the fact that the values in the primary key columns of a table are unique so we construct the blank node labels by appending the primary key values after the table name. The row in the comments table whose primary key is 4 has the blank node label _:comments_4. However some SQL tables do not have any primary keys. To solve the problem, we would maintain a counter to generate unique labels for each row in such tables.

Another interesting design decision is that we dropped the values in the PostId and UserId columns, because they should be represented using edges in Dgraph. We will explain this in more details a little later.

Going back to the four N-Quad entries, it demonstrates the first and second rule for the conversion:

(1) To generate unique blank node labels, the values in the primary key columns are appended to table names if the table has a primary key; otherwise an increasing counter will be used.
(2) Each SQL table cell except those in foreign key columns is stored under a predicate name, which is constructed by concatenating the table name and the corresponding column name.

Similarly the 3 other tables "posts", "users", and "votes" can be converted by following the two rules:

_:posts.9       <posts.Body>                "<p>I try to use..." .
_:posts.9       <posts.Id>                  "9" .
_:posts.9       <posts.Title>               "How can I keep 2 blankets together on a bed?" .
_:posts.11075   <posts.Body>                "<p>When I was in..." .
_:posts.11075   <posts.Id>                  "11075" .
_:posts.11075   <posts.Title>               "" .
_:users.-1      <users.DisplayName>         "" .
_:users.-1      <users.Id>                  "-1" .
_:users.14      <users.DisplayName>         "Jimmy Hoffa" .
_:users.14      <users.Id>                  "14" .
_:users.15      <users.DisplayName>         "Mooseman" .
_:users.15      <users.Id>                  "15" .
_:users.2089    <users.DisplayName>         "MrPhooky" .
_:users.2089    <users.Id>                  "2089" .
_:users.12370   <users.DisplayName>         "Paul Wesselkamper" .
_:users.12370   <users.Id>                  "12370" .
_:votes.10      <votes.CreationDate>        "2014-12-09T00:00:00.000" .
_:votes.10      <votes.Id>                  "10" .

Note: the column formatting was added for your reading, the migration tool would not add the extra whitespaces.

Besides the basic attributes, we also need N-Quad entries to represent the inter-object relationships. Consider how we converted the PostId and UserId foreign keys in the comments table:

_:comments.4    <comments.PostId>          _:posts.9 .
_:comments.4    <comments.UserId>          _:users.15 .

Taking the first N-Quad entry _:comments.4 <comments.PostId> _:posts.9 . as an example. The subject is a blank node _:comments.4. The predicate comments.PostId is again the table name concatenated with the column name. The object _:posts.9 is different from a plain attribute value -- it is the blank node label of a row in the posts table, and derived by looking up the value 9 in the posts table.

From these entries come our 3rd rule of conversion:

(3) Inter-object relationships are derived by following foreign key constraints, where the blank node labels in the N-Quad objects are calculated through lookups in the target table.

Following this rule, we can establish relationships between posts and users, between posts and posts, and between votes and posts:

_:posts.9 <posts.AcceptedAnswerId> _:posts.11075 .
_:posts.9 <posts.OwnerUserId> _:users.14 .
_:posts.9 <posts.LastEditorUserId> _:users.2089 .
_:votes.10 <votes.PostId> _:posts.9 .

Deriving a Dgraph schema from SQL

Besides converting the data, the migration tool also needs to derive the schema of each predicate. We have two simple rules for converting the schema:

(1) For plain attributes, there is usually a one-to-one mapping between a SQL data type and the Dgraph datatype. For instance, the Body column in the posts table is of type text, and hence, the predicate posts.Id is of type string: posts.Body: string .
(2) The predicates representing inter-object relationships, e.g. posts.OwnerUserId., simply have the type [uid], meaning following the predicate leads us to a set of other objects.

These two rules give us the following schema when converting the SQL tables above:

comments.Id: int .
comments.Text: string .
comments.PostId: [uid] .
comments.UserId: [uid] .
posts.Body: string .
posts.Id: int .
posts.Title: string .
posts.AcceptedAnswerId: [uid] .
posts.OwnerUserId: [uid] .
posts.LastEditorUserId: [uid] .
users.DisplayName: string .
users.Id: int .
votes.CreationDate: string .
votes.Id: int .
votes.PostId: [uid] .

Once the migration tool finishes, two files will have been created:

an RDF file is created containing all the N-Quad entries, and
a schema file.

The two files can then be imported into Dgraph via the Dgraph Live Loader or Bulk Loader. It's worth pointing out that sometimes you may want to customize the schema, e.g. adding an index to a predicate, or by changing an inter-object predicate from unidirectional to bidirectional links by adding the @reverse directive. If you would like such customizations, you should do it by editing the schema file generated by the migration tool before feeding the files to the Live Loader or Bulk Loader.

Demo time!

It's time to sit back, relax, grab something to drink and enjoy a demo with the migration tool in action:

The step-by-step instructions for running the demo can be found here.

Besides the demo above, we've tested the tool using a larger data set from the lifehacks site on StackExchange, which showed us that for a dataset with around 10,000 rows the migration takes around 12 seconds (might vary on your machine).

Summary

To recap, we presented the design of a migration tool that exports SQL tables into NQuad entries. To make this blog easier to understand, we have simplified the problem, and left out some tricky details such as multi-column foreign key constraints. As a result, the gist of the design is straightforward and can be summarized into 3 rules:

To generate unique blank node labels, the values in the primary key columns are appended to the table names if the table has a primary key; otherwise a counter is used.
Each SQL table cell is stored under a predicate name, which is constructed by concatenating the table name and the corresponding column name.
Inter-object relationships are derived by following foreign key constraints, where the blank nodes in the N-Quad objects are calculated through lookups in the target table.

From there, the generated NQuad entries can be imported into Dgraph with the Live Loader or Bulk Loader.

If you find this tool useful or have any suggestions for improvements, please drop us a note at our discussion forum.