January 21, 2014

NoSQL Distilled - A Good Guide in the NoSQL Jungle

The book NoSQL Distilled is a useful and compact guide when trying to navigate the NoSQL jungle out there. Read it, or at least, read this book review.

The subtitle: A Brief Guide to the Emerging World of Polyglot Persistence says much about the content and how the authors envisions the future of data storage. NoSQL Distilled by Martin Fowler and Pramod J. Sadalage (2013) is divided into two parts. The first treats the concepts that are important when considering choosing a NoSQL database. The second part is focused on how to implement a data storage system with NoSQL.

The book works fine for someone with little prior knowledge of NoSQL, but is still a fruitful read for those with more background knowledge. The text is easy to navigate and it is easy to skip the material that might not be of importance to the reader.

The book starts by describing the value of traditional SQL databases with focus on transactions and the advantage of the standardization that SQL brings to the these databases. The object-relational impedance mismatch is described and is seen as one of the driving forces behind the NoSQL movement. The other highlighted force behind NoSQL is horizontal scalability to be able to handle larger amounts of data.

The book does not offer a precise definition of "NoSQL", but lists some common traits: they do not use the relational model, they can run distributed on a cluster of servers, open-source, are build for the web and are schemaless. The authors simply states that "NoSQL" is very ill-defined. Defining the meaning of NoSQL is tricky, but I would have appreciated an attempt by the authors. At least a definition for how it is used in the book would have been suitable. In general, I would have appreciated a more rigorous definition of important terms and data models. I understand the authors, however. Terms and definitions will change as they slowly get more accepted and standardized, so there is a risk to "be wrong". But on the other hand, the authors miss an opportunity to contribute to the definition and maturity of the concepts.

The chapter "Distribution Models" describes elegantly the different ways the function of a database can be shared between a set of servers (or not shared): single master servers, master-slave replication, peer-to-peer replication (multi-master).

The book contains a chapter on data consistency and treats, among other things, the CAP theorem (which is not a theorem). The CAP concept has been used and misused frequently the last years. The authors have an interesting take on it:

It is usually better to think not about the tradeoff between consistency and availability, but rather between consistency and latency.

The chapter contains fresh thought on durability. Let me quote: “it’s useful for individual calls to indicate what level of durability they need”. I completely agree. In a larger project I ran into the incredibly frustrating problem that the database did not support what I like to call "durability control". At least not on a per-transaction basis. For this actual case, we needed to wait until a transaction (larger payment) had been replicated to a remote location before the transaction is conformed to the user. The lack of durability eventually led is to change database.

The book handles map-reduce in a simple, practical and accessible way. Map-reduce is a way to run database queries in parallel on multiple servers without the need for the client to explicitly divide the task into pieces suitable for each server. This is handled by the database engine completely transparent the the client.

The authors have chosen to divide the NoSQL databases into four categories: key-value, document database, column-family stores, and graph databases. This categorization is reasonable, but not obvious. It would have been interesting to know why the authors chose these specific categories. The chapters on these database categories are well-written and gives the reader a valuable overview. Each chapter contains a section with typical use cases that suits certain types of databases. Very useful.

The chapters "Polyglot Persistence", "Beyond NoSQL", and "Choosing Your Database" finish the book. Polyglot Persistence describes how a heterogeneous collection of databases are growing in fertile soil while the total dominance of SQL databases is fading. Different storage needs are best solved by different database types. The chapter on how to choose a database is short and has no advice on specific database products. Men the book is short and introductory, so this is reasonable. One could easily write a whole book on the topic of how to choose database given specific storage needs.

"Beyond NoSQL" is indeed interesting reading. The authors here describes storage technology that does not fit under the NoSQL umbrella. Among other things, they write something I find fundamental and worth repeating:

When we think of data storage, we tend to think of a single-point-of-time worldview, which is very limiting compared to the complexity supported by a version control system. It’s therefore surprising that data storage tool haven’t borrowed some of the ideas from version control systems. After all, many situations require historic queries and support for multiple views of the world.

I completely agree! The lack of support for historic queries in existing databases is one of the main driving forces for developing BergDB (bergdb.com). BergDB handles historic queries just like a query for the last state of the database. All states (the most current state as well as historic states) are available for queries at the same time.

The book is good and I really enjoyed reading it. But there is one, nearly unforgivable, mistake. The book does not cover consistent hashing (original paper, introduction). Consistent hashing is used by Riak, Cassandra, Memcached and is fundamental to achieving reads and writes that scales horizontally and linearly.

So, read the book. But also read about consistent hashing.

Good luck in the NoSQL jungle out there!

January 16, 2014

One in a Million, PUT

Here's the performance result for PUT operations. See the previous post for information on the setup. A PUT operations updates the value of an existing key. One million keys are stored. The PUT operations are run one-by-one by a single thread. 10,000 or 100,000 PUT operations (random keys) are made in sequence and timed. The time per PUT is recorded.

This result should be taken with a grain of salt! The default settings are used. For MySQL 5.5 this means that the InnoDB storage engine is used and it flushes to disk for every write (innodb_flush_log_at_trx_commit=1) before it returns to the client. Presumably, this is the case for Riak too, while MongoDB seems to return to the client without flushing data to disk.

It would be interesting to dig further into this... As is always the case with performance testing. My ugly source code is available upon request.

As often is the case, I/O limits performance, not the CPU. When the data is cached and available in-process, a PUT operation only needs to take in the order of 10 us (BergDB, Berkeley, BerkeleyT). Thus, a throughput in the order of 100,000 transactions/s is achieved. Prevayler is a bit slower. This is likely because of the overhead of serializing Java objects.

The in-process databases write data for every transaction, but they do not flush the data all the way to physical disk storage for every transaction. It is impossible to do if one wants to achieve a throughput in the order of 100,000 transactions/s. The default approach by BergDB is to flush data to disk every 0.1 seconds.

There are many important things to say about disk flushing, how to define "durable", and what I like to call durability control. Hopefully, I will be able to address these important topics in future posts.

January 10, 2014

One in a Million

GET performance of MongoDB, Riak, BergDB, MySQL, and more

Ola Rende (a colleague at Citerus) and I will hold a presentation of a few NoSQL databases on January 21st in Stockholm. Welcome to participate! Contact Citerus for details and to sign-up.

While preparing, I wanted to write some code to test the databases for one specific use case. I chose a simple test that works for any database that can store key-value pairs.

First, the database is populated with one million 4-byte keys with corresponding 4-byte values. The value is the same as the key. The keys are the integers from 0 to 999,999 encoded in binary. They are added in random order. The key set happens to be dense (all non-negative integers < 1,000,000), but this is just a coincidence. The database must store the entries in a way that would allow any set of integers as keys.

Then the GET performance of the database is tested. This is done by accessing 100,000 of the entries selected randomly. This is repeated three times and the best result (best average GET time) is recorded. This means that the cache of the database should be filled and presumably, there is no need to read from disk for the 2nd and 3rd run. The access is made from a Java process on my computer. My setup: Ubuntu 12.04 LTS, Toshiba Z930 laptop, i7 processor, SSD disk, JVM 1.7 from Oracle. The database is either run in-process for the in-processes databases, or as a separate process on localhost. By default, I used the default settings of the databases.

Here is the result:


For a data set that fits into cache, the tested in-process databases are in the order of 100 times faster that the out-of-process databases (MySQL, MongoDB, Riak). This is not surprising. On my computer, I measured the network round-trip (TCP, one byte back and forth) between two processes on localhost to be 50 us.

For many applications, all databases tested can be considered fast. Riak responds in a little more than 1 ms which is acceptable to many applications. Note that the tests are run with client and server on localhost. The network round-trip overhead could of course be 10 ms or something for computers further away from each other. Also, if data would not fit into cache, the disk access time may be as high as 10 ms and it would limit the performance. Consider using SSD disks!

Of, course the result should not be seen as some overall evaluation of the databases. For example, Riak scales horizontally and provides high-availability and BergDB supports historic queries. There is much, much more to these databases then what is tested here.

Other comments:
  • IO limits performance. Most likely, the network or the disk IO will limit the performance of your database setup. When data is cached and available in-process, lookups can be done within a few microseconds, while disk and network access times often are in the order of milliseconds.
  • Consider an in-process database. If you want an application database, not an integration database, an in-process database may be a performant alternative.
  • Little benefits of all-in-memory databases? I question the benefits of all-in-memory databases. The in-process databases Berkeley DB and BergDB perform on par with the all-in-memory solutions (TreeMap, Prevayler). So why use an all-in-memory database with its problem of a slow startup (all data must be read from disk to RAM at startup time which may take a long time)?
  • You might need 1000 servers to beat one. Riak and Cassandra may scale horizontally and linearly, but a single server can have a very impressive throughput for some use cases. So, don't get a huge cluster of servers if you only need one server. See The LMAX Architecture by Martin Fowler as an example.

Per database comments

TreeMap. This is the java.util.TreeMap class. Not a real database, but included for comparison. Access to it is made in a synchronized block.

Prevayler. Prevayler 2.6. Prevayler is an all-in-memory database. All data is stored in memory and must be read to RAM at startup. The data is stored in one big serializable Java object. Since it is stored as a Java object in the JVM, the performance for random access is optimal. It takes less than one microsecond to get a value given its key. This is the same performance as a TreeMap; actually, a TreeMap is used to store the data for this performance test.

BergDB. BergDB is a database I created. When data is cached, the GET time is on par with what is offered by an all-in-memory solution like Prevayler or TreeMap.

Berkeley. Berkeley DB, Java Edition, 5.0.97 is a stable, high-performance in-process database. "BerkeleyT" is Berkeley DB used with transactions enabled. When comparing performance, note that BergDB and Prevayler always supports transactions (cannot be disabled).

MySQL. MySQL 5.5 with default settings (InnoDB storage engine, isolation level: repeatable read). The officially supported JDBC Java Driver is used. To save time, only 100k keys were used. The actual GET time for 1M keys could be somewhat higher.

MongoDB. MongoDB with their official Java driver. Default settings.

Riak. Latest stable release of Riak with Java client 1.4.2. For this database, I used only 100k key-value entries to save some time. So the actual GET time for 1M keys could be somewhat higher.