The book NoSQL Distilled is a useful and compact guide when trying to navigate the NoSQL jungle out there. Read it, or at least, read this book review.
The subtitle: A Brief Guide to the Emerging World of Polyglot Persistence says much about the content and how the authors envisions the future of data storage. NoSQL Distilled by Martin Fowler and Pramod J. Sadalage (2013) is divided into two parts. The first treats the concepts that are important when considering choosing a NoSQL database. The second part is focused on how to implement a data storage system with NoSQL.
The book works fine for someone with little prior knowledge of NoSQL, but is still a fruitful read for those with more background knowledge. The text is easy to navigate and it is easy to skip the material that might not be of importance to the reader.
The book starts by describing the value of traditional SQL databases with focus on transactions and the advantage of the standardization that SQL brings to the these databases. The object-relational impedance mismatch is described and is seen as one of the driving forces behind the NoSQL movement. The other highlighted force behind NoSQL is horizontal scalability to be able to handle larger amounts of data.
The book does not offer a precise definition of "NoSQL", but lists some common traits: they do not use the relational model, they can run distributed on a cluster of servers, open-source, are build for the web and are schemaless. The authors simply states that "NoSQL" is very ill-defined. Defining the meaning of NoSQL is tricky, but I would have appreciated an attempt by the authors. At least a definition for how it is used in the book would have been suitable. In general, I would have appreciated a more rigorous definition of important terms and data models. I understand the authors, however. Terms and definitions will change as they slowly get more accepted and standardized, so there is a risk to "be wrong". But on the other hand, the authors miss an opportunity to contribute to the definition and maturity of the concepts.
The chapter "Distribution Models" describes elegantly the different ways the function of a database can be shared between a set of servers (or not shared): single master servers, master-slave replication, peer-to-peer replication (multi-master).
The book contains a chapter on data consistency and treats, among other things, the CAP theorem (which is not a theorem). The CAP concept has been used and misused frequently the last years. The authors have an interesting take on it:
It is usually better to think not about the tradeoff between
consistency and availability, but rather between consistency and
The chapter contains fresh thought on durability. Let me quote: “it’s useful for individual calls to indicate what level of durability they need”. I completely agree. In a larger project I ran into the incredibly frustrating problem that the database did not support what I like to call "durability control". At least not on a per-transaction basis. For this actual case, we needed to wait until a transaction (larger payment) had been replicated to a remote location before the transaction is conformed to the user. The lack of durability eventually led is to change database.
The book handles map-reduce in a simple, practical and accessible way. Map-reduce is a way to run database queries in parallel on multiple servers without the need for the client to explicitly divide the task into pieces suitable for each server. This is handled by the database engine completely transparent the the client.
The authors have chosen to divide the NoSQL databases into four categories: key-value, document database, column-family stores, and graph databases. This categorization is reasonable, but not obvious. It would have been interesting to know why the authors chose these specific categories. The chapters on these database categories are well-written and gives the reader a valuable overview. Each chapter contains a section with typical use cases that suits certain types of databases. Very useful.
The chapters "Polyglot Persistence", "Beyond NoSQL", and "Choosing Your Database" finish the book. Polyglot Persistence describes how a heterogeneous collection of databases are growing in fertile soil while the total dominance of SQL databases is fading. Different storage needs are best solved by different database types. The chapter on how to choose a database is short and has no advice on specific database products. Men the book is short and introductory, so this is reasonable. One could easily write a whole book on the topic of how to choose database given specific storage needs.
"Beyond NoSQL" is indeed interesting reading. The authors here describes storage technology that does not fit under the NoSQL umbrella. Among other things, they write something I find fundamental and worth repeating:
When we think of data storage, we tend to think of a
single-point-of-time worldview, which is very limiting compared to the
complexity supported by a version control system. It’s therefore
surprising that data storage tool haven’t borrowed some of the ideas
from version control systems. After all, many situations require
historic queries and support for multiple views of the world.
I completely agree! The lack of support for historic queries in existing databases is one of the main driving forces for developing BergDB (bergdb.com). BergDB handles historic queries just like a query for the last state of the database. All states (the most current state as well as historic states) are available for queries at the same time.
The book is good and I really enjoyed reading it. But there is one, nearly unforgivable, mistake. The book does not cover consistent hashing (original paper, introduction). Consistent hashing is used by Riak, Cassandra, Memcached and is fundamental to achieving reads and writes that scales horizontally and linearly.
So, read the book. But also read about consistent hashing.
Good luck in the NoSQL jungle out there!
Post a Comment