"In the beginning Codd gave Relational Databases..."
On What Was Before
In 1970 Edgar Frank Codd, at the time an employee at IBM, proposed the relational model of data management, which represented the theoretical foundation for relational databases.
In relational databases data is stored in tables as rows and columns according to a strict and predefined schema, there is minimum data redundancy due to normalization, and we can manipulate data in a standardized way via SQL.
They also support what I think is their most important attribute and that is ACID transactions.
If you took some relational databases course, or done any research on your own on the topic, you probably know that:
A stands for Atomicity
C stands for Consistency
I stands for Isolation
D stands for Durability
Without going deep in the subject, this fancy acronym is essentially all about the strong consistency of our data - which is the most important guarantee of relational databases.
In the years following these databases had been the default option for data storage in the industry.
The concept of NoSQL databases has been around for a long time just like the concept of relational databases.
Not until the early 2000s, however, have they gradually started to gain the popularity they enjoy nowadays.
But what does NoSQL stand for ?
There is some debate about the origin and the precise meaning of the term NoSQL, but the essence of the term is that it refers to data storage technologies that do not use the relational data model - non-relational databases.
Relational databases have been around for so long and they have been doing their job pretty well, so what is the reason for NoSQL to "take over" ?
What has changed?
Data, Data, Data
We could philosophy about the change that had the most significant impact on the popularity of NoSQL databases.
It might be higher internet availability across the globe including more bandwidth and more speed, a huge number of smartphone devices, more social networking websites, and an increase in the activity on those, cloud computing, and so on.
The truth is that all of the above as a direct or indirect effect had, what I believe is the main reason for NoSQL popularity - a huge data growth.
The velocity of new data being generated every moment is huge.
Let us take a look at the graph below, which shows the data growth rate only in the last 10 years.
EB stands for exabyte which is 10^18 bytes.
There you are, now you have one more reason to specialize in Big Data.
As we can see the data growth rate is exponential.
Another very important thing to notice is that the biggest percentage of the data is actually unstructured data.
What is the difference between structured and unstructured data?
Think of structured data as the kind of data that we usually store in relational databases or spreadsheets, the kind of data that we can display in rows and columns.
This data adheres to a data model or a schema that we define in advance.
Unstructured data, on the other hand, does not have a predefined data model, and it is the kind of data that we usually can not display in a tabular fashion.
It includes things like images, videos, audio files, different kinds of logs, PDF documents, surveys, ratings, reviews, likes, sensor data - with the rise of IoT, digital surveillance data, and so on.
We are living in an era where pretty much everything is being stored somewhere for whatever the reason might be, but let us leave my conspiracy theories for now.
Up Or Out
With enormous amounts of data out there, we have to make our systems able to process those amounts of data and to remain functional.
We try to add scalability to our systems.
If we search for the meaning of the word “scalability” in Cambridge Dictionary, one of the definitions is:
The ability of a business or a system to grow larger.
There are two basic approaches to making our system scalable:
Vertical scaling or scaling up
Horizontal scaling or scaling out
Let us take a look at the difference between the two.
Vertical scaling means adding more resources to the existing machine whether the resources are CPU power, memory, or storage.
If we have a single machine that stores our data set, what we do is we buy more powerful and more expensive components and add them to our machine to make it more capable of handling the workload.
But, we can reach the limitation of this type of scaling very soon.
We simply can not vertically scale more than the hardware market allows us.
We can only buy as much CPU power, memory, or storage, as it is out there at the given point of time.
Another issue with this approach is that we might have issues with the availability of our data.
Our machine might be overloaded with requests for data, or it might crash, so our data would be unavailable for a certain amount of time.
Horizontal scaling means adding more machines to our system.
We can add a bunch of less-expensive machines so that we end up with a cluster of machines that act and perform as one.
With this approach, although theoretically, we can scale to the infinity.
In case of some issues with one machine, other machines can continue to operate and by that keep our data available.
What we are doing with this approach is that we are creating some kind of a distributed system.
If we search for the meaning of the word “distribute” in Cambridge Dictionary, one of the definitions is:
To spread or scatter something over an area.
We spread our data set over a cluster of machines.
There are two basic techniques for distributing our data set:
Replication
Sharding
Replication implies having a whole copy of our data set on multiple machines in a cluster.
For example, if we have a database that stores Posts, Tags, and Comments, then we copy all the records on multiple machines.
Sharding implies having different subsets of data on different machines in a cluster.
An example would be storing Posts that were published from 2005 - 2010 on one machine, Posts that were published from 2010 - 2015 on another machine, Posts that were published from 2015 - 2020 on another one, and so on.
If we have a huge data set then replicating it on multiple nodes might not be the most efficient and economic approach.
It is usually the combination of both replication and sharding that gives us the optimal horizontal scalability. We can do sharding first, and then create replica sets based on those shards.
There is more to it but I do not want to make this article too long.
With the exponential data growth and the need for horizontal scaling, certain weaknesses of relational databases broke the surface.
Expensive Reads
Since there is minimum data redundancy in relational databases, when connecting different pieces of data we need to perform joins.
Joins are relatively expensive operations since they generally involve:
- Cartesian product
- Selection
- Projection
So if our data set is huge and we perform a lot of joins, that might cause our read operations to be inefficient.
Difficult To Scale Horizontally
I would say that this is the most important thing.
Relational databases were designed to run on a single machine because of the pure nature of the relational model.
Remember that strong consistency is the most important attribute of relational databases ?
When we scale horizontally, that is, when we distribute our data by replicating/sharding it, it becomes really difficult to keep the data always consistent.
How can we keep our data always consistent in a cluster of hundred or thousand nodes ?
Imagine that we have to make a transaction that includes locking tables T1, T2, and T3.
Now imagine that those tables are located on different machines.
Interesting.
Let us take a look at another possible issue.
Imagine that we are performing a read operation and that we have to perform 5 joins as part of the operation.
Now imagine that the data that we want to join is located on 5 different machines.
Interesting again.
This is what I meant by the "pure nature of the relational model" bit.
Expensive Schema Changes
In the context of huge amounts of data that is being generated and evolves rapidly, our systems will probably need to make a lot of schema changes.
Changes to the schema of relational databases have shown to be expensive, time-consuming, and they often involve downtime or service interruptions.
Down The Partitioning Road
When our data lives in a cluster, it becomes really hard to maintain strong consistency of it.
This sentence is somehow incomplete.
Let us try again.
When our data lives in a cluster, it becomes really hard to maintain strong consistency, without sacrificing availability.
If we want to maintain strong consistency, as it is the case with relational databases, then we have to make many machines communicate to each other.
The communication in this case would mean performing some data synchronization process that will make sure that our data is consistent across the whole cluster.
The approach above can make our data be unavailable for the time of synchronization.
Clients that requested the data might not be very keen to wait for the synchronization process to finish.
This is something that CAP Theorem is about and it boils down to the following:
If we are going down the partitioning road (think of partitioning as splitting data to different machines), then we have to decide between two things:
- Consistency
- Availability
However, it is not the case that we have to make an exclusive choice between consistency and availability.
We can tune certain levels of consistency and certain levels of availability, but the data in our system can not be fully consistent and fully available at the same time.
The levels are determined by our business needs.
This is a fairly complex topic that introduces another dimension of problems and discussions, so we will not dive deep into it.
Nevertheless, I thought it was worth mentioning.
On NoSQL Data Model And Families
As we have seen already, NoSQL stands for non-relational databases - data storage technologies that do not use the relational data model.
What are their attributes ?
Perhaps the most important attribute of NoSQL databases is that they can scale horizontally to clusters of machines.
Unlike relational databases, NoSQL databases were initially designed to support horizontal scaling relatively easily.
This attribute makes them suitable for storing huge amounts of data.
It is no wonder that the first NoSQL databases were created by Amazon and Google, DynamoDB and BigTable, respectively.
These companies have been working with huge amounts of data for quite a while now, and they were amongst the first that faced the problem of scaling their systems effectively.
Another important attribute is that NoSQL databases have non-strict and flexible schema.
This attribute makes NoSQL databases more suitable for working with unstructured data.
When discussing their data model it is important to say that NoSQL databases are aggregate oriented.
What does that mean, what is an aggregate ?
Think of an aggregate like this:
You take different objects of the same domain and you bundle them into a single unit of data management.
A simple example would be Post, Author and Tags.
In relational databases, each piece of the data would be stored in a separate table.
In NoSQL databases, we would have a post aggregate, a single structure that contains data about the post, its author, and relevant post tags, and which we can access in one go.
Because aggregates tell us which data is going to be accessed together frequently, they are the natural unit for distribution inside clusters.
Family First
NoSQL databases are classified into four NoSQL database families, based on their data model.
- Key-Value
- Document
- Wide-Column
- Graph
Key-Value stores are the simplest NoSQL databases that look like a gigantic and persistable hash tables.
We have a unique key that points to some value.
The value is opaque to the database, it can be any sequence of bytes, and it represents the aggregate.
Based on the key, we can perform operations such as retrieving, adding, updating or deleting, just like with hash tables.
These databases are simple to use, very fast, and scalable.
Unfortunately, if we need more advanced data handling they might not be the best choice.
Some representatives of this family are:
- Riak
- Voldemort
- Redis
- Dynamo DB
Document-oriented NoSQL databases are generally very similar to Key-Value stores.
The values are represented as documents where each document has a key, or an identifier, associated with it.
Documents are usually JSON files but they could be in BSON, XML, or YAML format as well, and they represent the aggregate.
They contain values that the database is aware of - be it a number, string, boolean, dictionary, or array.
The values are not opaque to the database, as it is the case with Key-Value stores.
An advantage over Key-Value stores is that we can manage data in a more advanced way - we can retrieve or update portions of the document.
With Key-Value stores it is all-or-nothing approach.
Some representatives of this family are:
- MongoDB
- CouchDB
- RavenDB
With Wide-Column databases we again have a scenario where there exists a key that maps to some value.
The key is called a row key and it maps to certain column families.
Column families are sets of columns that are frequently accessed together.
A column family represents the aggregate.
Therefore, the most important attribute of these databases is column orientation.
With this approach, we no longer have row orientation where have to read all the columns even if we are trying to retrieve only a certain set of them.
The column orientation attribute makes Wide-Column databases a good choice for analytical applications and Big Data.
Some representatives of this family are:
- Cassandra
- HBase
- Scylla
This family of NoSQL databases for me personally was a bit tricky to understand at first, possibly because I was unaware of the underlying whys of column orientation.
Therefore, I plan to explain them in more detail in one of my future articles.
Graph-oriented databases are a slightly specific case in terms that they are not aggregate-oriented.
The data model is based on graph theory and is represented as a bunch of vertices and edges.
Vertices represent the data and edges represent the relationships between the data.
The most important attribute of these databases is that the accent is on the relationships between data rather than its structure.
The data is even more granular than in relational databases and that is the reason why they are not aggregate-oriented.
With Graph-oriented databases, we can explore data very efficiently based on the relationships.
If we have a relationship dominant data model, where we are for example performing tons of joins in our queries (we are using relational databases), then Graph-oriented databases might be a better fit since they are optimized for those scenarios.
Due to the nature of their data model, these databases are good for recommendation engines, fraud detection, identity and access management, some artificial intelligence systems, and so on.
Some representatives of this family are:
- Neo4J
- ArrangoDB
- Virtuoso
On Non-Relation Thoughts
There are many more things to talk about which relate to NoSQL but which we, unfortunately, could not fit into this article.
The idea of this article was to provide the theoretical minimum which I find is very important when we want to work with NoSQL databases.
The other idea was for us to try to think in a non-relational way.
When would we want to think in a non-relational way ?
We might want to think in a non-relational way when we have huge amounts of data.
Huge amounts of data will most certainly introduce the need for efficient horizontal scaling, something we can achieve with NoSQL easier than with relational databases.
We might want to think in a non-relational way when we want a flexible schema, a developer-friendly data storage technology, or a good integration with JSON.
We might want to think in a non-relational way because we need fast time to the market, or we are not aware yet of what kind of data storage technology our business actually needs.
But what if we are really rebellious and we just do not want to use neither "SQL" nor NoSQL databases ?
If that is the case then it might be worth taking a look at NewSQL.