July 25, 2014

BergDB 2014.1

I made a release of BergDB today. bergdb-2014.1.zip can be downloaded from bergdb.com.

The release is focused on stability, clean code, and an improved API. The software is getting stable. There are no known issues and 650 units tests that say it works (85 per cent code coverage). The API is getter more stable between releases and the binary database file format is not changing much.

So, in my opinion, it is now production ready for many types of development projects.

July 20, 2014

Database Mathematics

We need mathematical rigor in the theory of databases. The current situation is that we have a very weak foundation for database theory. Not even the basic concepts such as atomicity and durability have generally accepted definitions. Instead some ill-defined (yet often somewhat useful) concepts are taken for granted without clear definitions.

As mentioned before in post ACID Does Not Make Sense, the concept of ACID transactions is ill-defined. This blog also has a post (The CAP Theorem Is Not a Theorem) on the so called CAP Theorem where I agree with Mark Burgess that the so called CAP Theorem is not a theorem at all. Brewers conjecture has not been proven with mathematical rigour.

Data storage is mathematics. I believe it is possible to describe the function of modern databases in rigorous mathematical terms using existing or mostly existing mathematics. I wish I had the resources and the skills myself to develop the much needed database mathematics (or "data storage theory", "data mathematics", "database theory").

The need for more rigor is even more important now when there is a more diverse set of database products available. Yes, I do talk about the NoSQL movement. The relational model of SQL databases has a useful mathematical background. However, a decreasing subset of databases use the relational model introduced by Edgar F. Codd in 1969.

Durability could be defined using probability theory. What is the probability that we can read a value written to a database after after one second, one hour, one day, and after ten years? What is acceptable?

Atomicity would also benefit from a mathematical treatment. With global atomicity and transaction with serializable isolation level (SQL speak), the database state evolves through a sequence of distinct states. A transaction is then a mathematical function that takes the database from one state to the next. I have a draft blog entry on the different types of atomicity that are used in common database products and programming languages. We will see if and when it will be published.

There is so much more to say and think about this. It would be interesting to get in contact with a mathematician who would be interested in making a contribution in this field. And maybe I missed something in the existing literature? Maybe there is significant work on this already?

May 8, 2014

There Is No "Now"

Let's consider a common database with distributed clients. Often we tend to think about the "current state" of the database or the "now" state. But, really, in general, there is no commonly agreed on global state that all clients would agree on.

There is no now!

Even for a centralized database running on a single server, there is no consistent current state. Clients will not observe a commonly agreed on database state "now". The observers are distributed and there is always a delay until an update has reached a client. And this delay is not deterministic. No matter how short, there is always a delay.

There can be a central consistent view of the state, but not at distributed clients / "observers". This is equivalent to the Theory of Relativity. Information travels with a finite speed. Two observers will in general observe an update at different times.

So, eventual consistency is the only consistency there ever is!

However, all observers of a centralized database can agree on a global history of the database state. If the updates are sent from the central server to the clients in a given order, the whole history of the data up to the time of the last update that has reached a client can be agreed on between all clients that have received the update. All data is old, there is no agreed on now, but the history up until the last update received, can be globally agreed on.

This is one of the reasons why databases should be append-only (called "immutable" by some) and support historic queries. They should remember the history. This is the only thing that can be globally agreed on.


The video: The Value of Values, Rich Hickley

The Special Theory of Relativity for distributed systems

January 21, 2014

NoSQL Distilled - A Good Guide in the NoSQL Jungle

The book NoSQL Distilled is a useful and compact guide when trying to navigate the NoSQL jungle out there. Read it, or at least, read this book review.

The subtitle: A Brief Guide to the Emerging World of Polyglot Persistence says much about the content and how the authors envisions the future of data storage. NoSQL Distilled by Martin Fowler and Pramod J. Sadalage (2013) is divided into two parts. The first treats the concepts that are important when considering choosing a NoSQL database. The second part is focused on how to implement a data storage system with NoSQL.

The book works fine for someone with little prior knowledge of NoSQL, but is still a fruitful read for those with more background knowledge. The text is easy to navigate and it is easy to skip the material that might not be of importance to the reader.

The book starts by describing the value of traditional SQL databases with focus on transactions and the advantage of the standardization that SQL brings to the these databases. The object-relational impedance mismatch is described and is seen as one of the driving forces behind the NoSQL movement. The other highlighted force behind NoSQL is horizontal scalability to be able to handle larger amounts of data.

The book does not offer a precise definition of "NoSQL", but lists some common traits: they do not use the relational model, they can run distributed on a cluster of servers, open-source, are build for the web and are schemaless. The authors simply states that "NoSQL" is very ill-defined. Defining the meaning of NoSQL is tricky, but I would have appreciated an attempt by the authors. At least a definition for how it is used in the book would have been suitable. In general, I would have appreciated a more rigorous definition of important terms and data models. I understand the authors, however. Terms and definitions will change as they slowly get more accepted and standardized, so there is a risk to "be wrong". But on the other hand, the authors miss an opportunity to contribute to the definition and maturity of the concepts.

The chapter "Distribution Models" describes elegantly the different ways the function of a database can be shared between a set of servers (or not shared): single master servers, master-slave replication, peer-to-peer replication (multi-master).

The book contains a chapter on data consistency and treats, among other things, the CAP theorem (which is not a theorem). The CAP concept has been used and misused frequently the last years. The authors have an interesting take on it:

It is usually better to think not about the tradeoff between consistency and availability, but rather between consistency and latency.

The chapter contains fresh thought on durability. Let me quote: “it’s useful for individual calls to indicate what level of durability they need”. I completely agree. In a larger project I ran into the incredibly frustrating problem that the database did not support what I like to call "durability control". At least not on a per-transaction basis. For this actual case, we needed to wait until a transaction (larger payment) had been replicated to a remote location before the transaction is conformed to the user. The lack of durability eventually led is to change database.

The book handles map-reduce in a simple, practical and accessible way. Map-reduce is a way to run database queries in parallel on multiple servers without the need for the client to explicitly divide the task into pieces suitable for each server. This is handled by the database engine completely transparent the the client.

The authors have chosen to divide the NoSQL databases into four categories: key-value, document database, column-family stores, and graph databases. This categorization is reasonable, but not obvious. It would have been interesting to know why the authors chose these specific categories. The chapters on these database categories are well-written and gives the reader a valuable overview. Each chapter contains a section with typical use cases that suits certain types of databases. Very useful.

The chapters "Polyglot Persistence", "Beyond NoSQL", and "Choosing Your Database" finish the book. Polyglot Persistence describes how a heterogeneous collection of databases are growing in fertile soil while the total dominance of SQL databases is fading. Different storage needs are best solved by different database types. The chapter on how to choose a database is short and has no advice on specific database products. Men the book is short and introductory, so this is reasonable. One could easily write a whole book on the topic of how to choose database given specific storage needs.

"Beyond NoSQL" is indeed interesting reading. The authors here describes storage technology that does not fit under the NoSQL umbrella. Among other things, they write something I find fundamental and worth repeating:

When we think of data storage, we tend to think of a single-point-of-time worldview, which is very limiting compared to the complexity supported by a version control system. It’s therefore surprising that data storage tool haven’t borrowed some of the ideas from version control systems. After all, many situations require historic queries and support for multiple views of the world.

I completely agree! The lack of support for historic queries in existing databases is one of the main driving forces for developing BergDB (bergdb.com). BergDB handles historic queries just like a query for the last state of the database. All states (the most current state as well as historic states) are available for queries at the same time.

The book is good and I really enjoyed reading it. But there is one, nearly unforgivable, mistake. The book does not cover consistent hashing (original paper, introduction). Consistent hashing is used by Riak, Cassandra, Memcached and is fundamental to achieving reads and writes that scales horizontally and linearly.

So, read the book. But also read about consistent hashing.

Good luck in the NoSQL jungle out there!

January 16, 2014

One in a Million, PUT

Here's the performance result for PUT operations. See the previous post for information on the setup. A PUT operations updates the value of an existing key. One million keys are stored. The PUT operations are run one-by-one by a single thread. 10,000 or 100,000 PUT operations (random keys) are made in sequence and timed. The time per PUT is recorded.

This result should be taken with a grain of salt! The default settings are used. For MySQL 5.5 this means that the InnoDB storage engine is used and it flushes to disk for every write (innodb_flush_log_at_trx_commit=1) before it returns to the client. Presumably, this is the case for Riak too, while MongoDB seems to return to the client without flushing data to disk.

It would be interesting to dig further into this... As is always the case with performance testing. My ugly source code is available upon request.

As often is the case, I/O limits performance, not the CPU. When the data is cached and available in-process, a PUT operation only needs to take in the order of 10 us (BergDB, Berkeley, BerkeleyT). Thus, a throughput in the order of 100,000 transactions/s is achieved. Prevayler is a bit slower. This is likely because of the overhead of serializing Java objects.

The in-process databases write data for every transaction, but they do not flush the data all the way to physical disk storage for every transaction. It is impossible to do if one wants to achieve a throughput in the order of 100,000 transactions/s. The default approach by BergDB is to flush data to disk every 0.1 seconds.

There are many important things to say about disk flushing, how to define "durable", and what I like to call durability control. Hopefully, I will be able to address these important topics in future posts.

January 10, 2014

One in a Million

GET performance of MongoDB, Riak, BergDB, MySQL, and more

Ola Rende (a colleague at Citerus) and I will hold a presentation of a few NoSQL databases on January 21st in Stockholm. Welcome to participate! Contact Citerus for details and to sign-up.

While preparing, I wanted to write some code to test the databases for one specific use case. I chose a simple test that works for any database that can store key-value pairs.

First, the database is populated with one million 4-byte keys with corresponding 4-byte values. The value is the same as the key. The keys are the integers from 0 to 999,999 encoded in binary. They are added in random order. The key set happens to be dense (all non-negative integers < 1,000,000), but this is just a coincidence. The database must store the entries in a way that would allow any set of integers as keys.

Then the GET performance of the database is tested. This is done by accessing 100,000 of the entries selected randomly. This is repeated three times and the best result (best average GET time) is recorded. This means that the cache of the database should be filled and presumably, there is no need to read from disk for the 2nd and 3rd run. The access is made from a Java process on my computer. My setup: Ubuntu 12.04 LTS, Toshiba Z930 laptop, i7 processor, SSD disk, JVM 1.7 from Oracle. The database is either run in-process for the in-processes databases, or as a separate process on localhost. By default, I used the default settings of the databases.

Here is the result:


For a data set that fits into cache, the tested in-process databases are in the order of 100 times faster that the out-of-process databases (MySQL, MongoDB, Riak). This is not surprising. On my computer, I measured the network round-trip (TCP, one byte back and forth) between two processes on localhost to be 50 us.

For many applications, all databases tested can be considered fast. Riak responds in a little more than 1 ms which is acceptable to many applications. Note that the tests are run with client and server on localhost. The network round-trip overhead could of course be 10 ms or something for computers further away from each other. Also, if data would not fit into cache, the disk access time may be as high as 10 ms and it would limit the performance. Consider using SSD disks!

Of, course the result should not be seen as some overall evaluation of the databases. For example, Riak scales horizontally and provides high-availability and BergDB supports historic queries. There is much, much more to these databases then what is tested here.

Other comments:
  • IO limits performance. Most likely, the network or the disk IO will limit the performance of your database setup. When data is cached and available in-process, lookups can be done within a few microseconds, while disk and network access times often are in the order of milliseconds.
  • Consider an in-process database. If you want an application database, not an integration database, an in-process database may be a performant alternative.
  • Little benefits of all-in-memory databases? I question the benefits of all-in-memory databases. The in-process databases Berkeley DB and BergDB perform on par with the all-in-memory solutions (TreeMap, Prevayler). So why use an all-in-memory database with its problem of a slow startup (all data must be read from disk to RAM at startup time which may take a long time)?
  • You might need 1000 servers to beat one. Riak and Cassandra may scale horizontally and linearly, but a single server can have a very impressive throughput for some use cases. So, don't get a huge cluster of servers if you only need one server. See The LMAX Architecture by Martin Fowler as an example.

Per database comments

TreeMap. This is the java.util.TreeMap class. Not a real database, but included for comparison. Access to it is made in a synchronized block.

Prevayler. Prevayler 2.6. Prevayler is an all-in-memory database. All data is stored in memory and must be read to RAM at startup. The data is stored in one big serializable Java object. Since it is stored as a Java object in the JVM, the performance for random access is optimal. It takes less than one microsecond to get a value given its key. This is the same performance as a TreeMap; actually, a TreeMap is used to store the data for this performance test.

BergDB. BergDB is a database I created. When data is cached, the GET time is on par with what is offered by an all-in-memory solution like Prevayler or TreeMap.

Berkeley. Berkeley DB, Java Edition, 5.0.97 is a stable, high-performance in-process database. "BerkeleyT" is Berkeley DB used with transactions enabled. When comparing performance, note that BergDB and Prevayler always supports transactions (cannot be disabled).

MySQL. MySQL 5.5 with default settings (InnoDB storage engine, isolation level: repeatable read). The officially supported JDBC Java Driver is used. To save time, only 100k keys were used. The actual GET time for 1M keys could be somewhat higher.

MongoDB. MongoDB with their official Java driver. Default settings.

Riak. Latest stable release of Riak with Java client 1.4.2. For this database, I used only 100k key-value entries to save some time. So the actual GET time for 1M keys could be somewhat higher.