Open Source Alternative to Amazon Neptune

Amazon just announced their new graph database service, called Amazon Neptune. As per a TechCrunch article,

Amazon Neptune has been optimized to handle billions of relationships and run queries within milliseconds. Neptune supports fast-failover, point-in-time recovery and Multi-AZ deployments. And you can also encrypt data at rest.

This is very exciting news for the entire tech ecosystem. It clearly shows that graph databases are going mainstream. Already many tech companies are using existing graph solutions or building their own graph-like systems.

Many devs would be happy users of Neptune, given Amazon is running this as a managed service. People trust Amazon; they have done a great job with AWS so far. So, this is good news for adoption.

But, in this article, I want to look past the blinding light of Amazon — into the design of this new graph database. Despite being built by Amazon, Neptune’s design is obsolete right out of the gate. Let’s look through the design features of Neptune.

Licensing: Nothing is mentioned about the availability of an open source version. It’s safe to presume that Neptune is closed source.

Scalability: Neptune is vertically scalable instead of being horizontally scalable. That means if your data size increases, it must still fit on a single server, as there’s no data sharding possibility.

ACID Transactions: There’s no mention of transactions on their features page, which indicates that Neptune potentially does not support ACID transactions.

Replication: The replicas are global copies of the original data, with all writes going to one server, and the rest being read-only servers. Neptune provides eventual consistency for replication, (looks to be) built on top of Amazon Aurora, where the write server writes to this storage, and rest of the servers reading from this storage. They provide fast reads by bringing the data to RAM (“in-memory optimized architecture”).

APIs: Neptune supports Gremlin and SPARQL.

There are benefits and drawbacks to this architecture.

Benefits wise, having read replicas is great for increasing the query throughput. Aurora as storage would ensure that once written; your data would have the same guarantees (protection from disk losses, etc.). Amazon has a long record in providing hosted services, so a managed service by them is great for adoption.

It comes with a fair bit of drawbacks as well. Vertical scaling introduces a single point of failure in the system — a single server crashing could bring the entire system down. While Amazon has strict SLAs on their managed service, this is still a questionable system design from the tech giant.

Vertical scalability is also disproportionately more expensive than horizontal scalability. Having to fit the entire data into a single machine requires specialized hardware, something which gets disproportionately more expensive than just buying another server. In fact, that ship sailed a long time ago when the internet exploded in the early 2000s. Bigtable, MongoDB, Cassandra, Elastic Search, Spanner, CockroachDB, etc. — all the new technologies since the 2000s have been designed to be horizontally scalable. So, Neptune is at least 10–20 years behind regarding the needs of the market. In fact, is vertical scalability even a thing anymore?

Lack of ACID transactions is the biggest complaint from NoSQL databases. Distributed transactions are a hard but solvable problem. But, Neptune is a single server architecture, so it’s almost unbelievable that they don’t mention a single word about their support of ACID transactions. I’m going to be optimistic and presume that they do support them but just failed to mention it in the feature list. It’s commonly believed that support of transactions is what distinguishes a database from a datastore.

Why are these drawbacks important?

Let’s take a step back and look at database evolution. For a long time, we only had relational databases which ran on a single server. You could potentially have an eventually consistent replica, but all the data still had to fit into one machine.

Exploding Kittens

Web exploded, and Google being the search engine for the web, had to keep up. They launched MapReduce in 2004, which among other things, helped run the indexing system at Google to process all the data. Google realized that single server databases were not able to handle the data growth and launched Bigtable in 2006. Bigtable was designed solely with the idea to build a horizontally scalable database, something which relational DBs didn’t provide. This was a remarkable feat at the time, but achieved at a cost to the developer, with lack of joins and transactions.

Bigtable kickstarted the entire NoSQL revolution, out of which MongoDB, RethinkDB, and Cassandra came. But, lack of transactions continued to remain an issue for developers. Within Google, Megastore and Percolator were born to deal with these shortcomings, but at the cost of performance.

Calling lack of transactions in Bigtable, the “biggest mistake as an engineer,” Google worked to build Spanner, a horizontally scalable, multi-version, globally-distributed, and synchronously-replicated database, with distributed transactions. Again, this kickstarted a whole new revolution, people are calling NewSQL, with CockroachDB and TiDB following Spanner’s heels.

All these show where the database market is going. A database of the 2000s is horizontally scalable, supports distributed transactions and is consistently replicated. Without these guarantees, building application on top becomes complicated, something Google’s Bigtable team and NoSQL users learned the hard way.

Given these advancements, Amazon Neptune’s design is pre-2000. Single server vertically scaled, asynchronously replicated, lack of transactions — all this screams outdated.

Open Source Alternative

Now, let’s consider an open source alternative: Dgraph, a distributed graph database.

Licensing: Dgraph is available under Apache 2.0 license, which allows anyone to use it, run it, modify it, or build proprietary services on top of it.

Scalability: This is 2017! Dgraph is horizontally scalable. As you add more servers, data automatically gets sharded and moved to fill the new ones. Also, Dgraph would automatically rebalance shards among servers to ensure that load is evenly distributed.

ACID Transactions: While Neptune doesn’t mention anything about transactions, Dgraph runs lock-free distributed ACID transactions, with snapshot isolation, designed for performance. Transactions make it a lot easier for application developers to think through database behavior, removing any data integrity issues.

Replication: Dgraph provides consistent (synchronous) replication, utilizing Raft, a consensus algorithm, to allow k servers dying in a 2k+1 replication setting. Each data shard can be replicated an odd number of times, and located in different datacenters, or availability zones to allow faster communication from your clients.

Linearizable reads: Neptune's eventual consistency model can cause issues where a read after write might not be visible through a replica; requiring the application to build logic around this limitation. Dgraph avoids this problem by guaranteeing atomic consistency of writes, which means irrespective of which replica is hit for reading, any write done before is guaranteed to be available. This guarantee makes it a lot easier to build applications.

APIs: Dgraph supports a variant of GraphQL, a modern language to replace REST. Cypher and Gremlin support are in the pipeline.

Others: While Neptune is built on top of another DB (Amazon Aurora), Dgraph is a native graph database. As such, data storage is designed for constant time edge traversals. Also, Dgraph has a unique design, which uses posting lists, a concept from search engines, making joins highly efficient. It’s designed for a real-world cluster, minimizing network calls and disk seeks, running automatic fail overs in case of communication delays, running all operations concurrently. All these innovations show in the low query latency provided by Dgraph.

Conclusion

Amazon’s Neptune is an exciting new graph database, a clear signal that graph databases are being used by companies small and large. I hope that Amazon provides more content to help developers create and switch their applications to utilize the power of graph DBs.

While Neptune comes with the convenience of a managed service, it’s design is outdated and the technology proprietary. Therefore, potential users of graph databases, who want to build smart, scalable applications and need ACID transactions, horizontally scalable design and strong replication consistency should look into the various open source graph databases available, like Dgraph.

- This post was originally written on Medium.