Category: NoSQL

Theory Behind NoSQL

1.1 CAP

CAP is first published by Eric Brewer in 2000 and is the basic theorem that describes any distributed system. (Brewer, 2000)  Figure10 explains what CAP stands for.

 

cap

Figure 1. CAP stands for Consistency, Availability and Tolerance to network Partitions.

  • Consistency – Any node in the system has the same data with other nodes.
  • Availability – A request will always be responded even though there is a node going down.
  • Partition Tolerance – The system continues to operate even there is a partition or communication break between two nodes. (Brewer, 2000)

The truth is that it’s impossible to meet all the three guarantees.  If consistency is focused, the failure of write operation caused by system unavailability may happen. If availability is focused,  It’s possible that read operation can’t get the latest value of write operation. It’s the focus of system who determines the strategy and normally a combination of two guarantees are chosen.

1.2 Election of CAP guarantee

cap2.png

Figure 2. Databases pick up  two guarantees from CAP

Figure shows how different databases discussed in this paper make their choice. Caution should be taken that each database picks up two guarantee, but it does not mean it losing everything in the third guarantee.

The explanation of possible selections is discussed as following.

  • CA – Data among all nodes is same (consistency) and in order to meet this condition, read or write operation can be accessed in any node (availability). Nevertheless, if there is a communication break occurred in the system, the data won’t be synchronized and won’t be re-synchronized even when the partition is resolved.
  • CP – Consistency and Partition Tolerance are guaranteed. If partition occurs in the system,  in order to ensure consistency, nodes will be regarded as going down (losing availability).
  • AP – Nodes can still be reached even though there is communication failure. Meanwhile, the data can be re-synchronized if the communication break is resolved.

1.3 ACID and BASE

To guarantee the database transactions, a set of properties of ACID is applied.

ACID stands for:

  • Atomicity – Everything in a transaction works successfully or failed.
  • Consistency – The transaction should ensure the database transfer from one consistent state to another consistent state. The consistent state means data in database should meet the integrity constraint.
  • Isolation – Transactions are isolated which means different transactions won’t affect each other.
  • Durability – If data is committed, it remains its state even though there are errors, crashes or power loss.(Gray,& Reuter, 2007)

In order to handle transactions and keep data safe, traditional RDBMS works following the principle of ACID. NoSQL databases on the other hand need to guarantee the availability and scalability for storing lots more data and using a distributed set of servers working together. ACID in this case cannot guarantee this property and then BASE comes up.

BASE stands for Basically Available, Soft State and Eventually Consistency.  (Mapanga & Kadebu, 2013)

  • Basically Available – The system gives up the high availability as regards CAP theorem which means the data can be failed to be responded when it’s in an inconsistent or changing state. But basically, availability is guaranteed.
  • Soft State – The state of the system can change over time due to eventual consistency property. Partition tolerance is supported, but the state can be asynchronous for some time.

evntual consis.png

Figure 3. Eventual consistency model

  • Eventual Consistency – Observing figure12, suppose Node A is written with DataN, the system should guarantee that any read operation obtains the latest value DataN if there are no other write operations to Node A, B and C before the read operations of Node B and C are completed. However, the inconsistency problem could occur due to the interactive delay, system load or the number of replications.

strong consis.png

Figure 4. Strong consistency model

On the contrary to eventual consistency, strong consistency is quite different.

  • Strong consistency – Observing figure13, if Node A is written with DataN, the latest value DataN is always returned from Node B and C. However,  in order to ensure consistency, any read operation is blocked until the update of replications.

BASE model is against to ACID model, as it sacrifices high consistency to achieve availability or reliability.  Relational database is designed based on strong consistency guarantee. To some extent, it loses scalability and performance. In current modern companies like Google, Amazon, Twitter etc., scalability availablity and performance are concerned more, as a result, most NoSQL databases follows BASE model.

1.4 Concurrency control

Assuming a data record is accessed by two users, there is no problem when they read the data record at the same time, but what will happen if they update the data record simultaneously? Collision happens then. Concurrency control is used to deal with the issue which allows multiple users to access the same data simultaneously.

Concurrency control in databases is normally working with transactions. A Transaction is a series of data operations that have to follow ACID guarantee. Most relational management systems like MySQL support transaction, as users of RDBMS consider consistency and integrity of data  as high preference.  However, not all modern new databases support transaction such as MongoDB. If MongoDB operates on a single document, operations are always atomic. But operations on multiple documents are not atomic. In this case, multi-document transactions can not be executed. Fortunately, MongoDB can apply two-phase commits to offer transaction-like semantics. (MongoDB online manual – Perform Two Phase Commits). Whereas, NoSQL databases discussed in this paper Redis, HBASE and Neo4j support transaction.

If interleaving operations are allowed, without proper concurrency control, problems such as the lost update problem, the dirty read problem and the incorrect summary problem can occur.

  • The lost update problem – from figure 14, normally transaction C wanted to update value3, but it reads 8 actually.

lost update.png

Figure 5. The lost update problem

  • The dirty read – transaction B reads a value from transaction A. But transact A has aborted the value at the same time. Transaction A does not want transaction B to read that value actually.
  • The incorrect summary – – transaction A contains data item X(10) and data item Y(20). Both two values add 10. At the same time transaction is reading the result form transaction A. After the calculation, the data items in transaction A are X(20) and Y(30). But transaction may get the result X(20) and Y(20) which Y should be 30.

The mechanism of concurrency control categories (Bernstein, Hadzilacos & Goodman, 1987):

  • Pessimistic lock blocks every operation violating data integrity. It supposes that the probability of other users who attempt to visit or change the object which you are visiting or changing is so high that it will lock the object until you have committed the update operation. Definitely pessimistic lock enhances the operational integrity, but the drawback is that the delay may be too long which can restrict the access of other users.
  • Optimistic lock will check whether data integrity has been violated only when committing the operations. Contrary to pessimistic lock, it assumes that the probability of other users who attempt to visit or change the object which you are visiting or changing is low. Therefore, it will keep unlocking when you read or change the object and lock the object when you are ready to commit the change of object. The concurrent performance is improved, but it cannot solve the dirty read problem. If the second user reads the object before the first user commits the change, the database detects the change before the first user finishes the change and wants to commit, as a result, the first user has to re-read the object and then make the change.

The selection of two lock categories depends on the application requirements. If concurrency doesn’t happen often and dirty read problem is not allowed, pessimistic lock is preferred.  If there are a lot of concurrency issues, pessimistic lock can degrade the performance significantly, as a result, optimistic lock is preferred.

There are a lot of specific concurrency methods (Bernstein, Hadzilacos & Goodman, 1987).

  • Locking methods

Read lock – If one user applies read lock on one object, other users can also read the object. But any write operation of the object is blocked. Figure 15 illustrates how read lock works.

read lock.png

Figure 6. A flow chart to explain how read lock is performed.

Write lock – If one user applies write lock, other users can not read nor write on the object. Figure16 illustrates how write lock works.

write clock.png

Figure 7. A flow chart to explain how write lock performance

  • Multiversion concurrency control (MVCC) generates a new version of  a database object when the object is written and read the most recent version of the object based on timestamp order or increasing transaction ID.

Consider two concurrent writes shown in figure 17.

concurr write.png

Figure 8. Two writes to the same row

The performance of MVCC based on row lock operation is illustrated in the following figure 18.

MVVC

Figure 9. A flow char to explain how MVCC is performed.

Considering the features of different databases, the methods of concurrent control is varied.

  • Mysql supports MVCC in InnoDB store engine and Row-level locking and MVCC in InnoDB store engine and locking the whole table in MyISAM store engine. For the two lock levels: table lock (All rows in the locked table are locked) and row lock (Some rows of a table are locked, but other rows are not locked).
  • As Redis is in single-process and single-thread mode, it adopts queue mode to replace concurrency access into serial access. There is no locking concept in Redis.
  • MongoDB supports read lock and write lock for concurrency control. However, read and write operations can yield their locks. It uses adaptive algorithms based predicted disk access patterns to resolve this problem. (MongoDB FAQ)
  • Hbase guarantees high effective concurrency control through row level lock and MVCC methods.
  • Neo4j applies default locking behavior. When there are changes on properties, nodes and relationships, write locks will be taken.  (Neo4j Manual)

Reference

Bernstein, P. A., Hadzilacos, V., & Goodman, N. (1987). Concurrency control and recovery in database systems (Vol. 370). New York: Addison-wesley.

Brewer, E. A. (2000, July). Towards robust distributed systems. In PODC (p. 7).

Gray, J., & Reuter, A. (2007). Transaction processing: concepts and techniques, 1993.

Mapanga, I., & Kadebu, P. (2013). Database Management Systems: A NoSQL Analysis. International journal of Modern Communication Technologies and Research, 1(7).

Advertisements

RDBMS vs NoSQL

Database is an organized collection of data. To make operations like definition, querying, update, and administration of database, the special designed software application called Database Management System (DBMS) is necessary. DBMS helps the user to capture and analyze data. It’s classified by database model, for example the most famous model called relational model as the data model is relational. The database model is used to determine the logical structure of database and which manner data can be stored, analyzed and manipulated.

Relational database model is based on first-order predict logic which data is represented by tuples and grouped by relations.

However, when data is not structured and relational, relational DBMS is not capable to manage such kind of data. But relational database model is not good at adapting the change. Because of the various data formats such as hierarchies, cubes, linked-lists and unstructured data, it’s not capable to organize data into tables.

As a solution, NoSQL (not only SQL) comes up. NoSQL database management systems enable data to be stored in a variety of formats like key-value store, column store, graph store and document store. NoSQL called not only SQL is to emphasize that SQL-like query languages may also be supported. But it does not guarantee the true ACID (atomicity, consistency, integrity, and durability) principle. NoSQL database management systems remove hard constraints, such as tabular row store and strict data definition, and have distributed architectures to support high performance throughput. NoSQL databases are widely used in big data and real-time web applications.

 

Relational databases

Relational databases are the most popular and widely used databases. The data model  organizes data as tables or relations. Each table consists of rows and columns which is illustrated in figure1. Each row has an unique key.

图片1

Figure 1. An example of relational database model

Different from NoSQL databases, the data model of relational databases is fixed and the data is structured. It supports transaction management and guarantees true ACID principle. Relational database management system (RDBMS) which is based on relational model has been developed for several decades and still dominates current database market. It’s widely deployed in banks, schools, hospitals, governments and so on due to its properties.

 

Key-value store database

Key-value store is one of the most simple database management system. Data is stored by key and value illustrated in figure2, and can be retrieved when the key is known so that the complex querying and management functionality of RDBMS is not needed.

A string can represent the key and the actual data can be represented by value. The data can be any kind of data types in programming language such as string, integer, array and so on, or an abstract object which bindings to the key.  The data model is flexible so that the requirement for the formatted data is less strict.

图片2

Figure 2. An example of Key-Value database model

Compared to common SQL databases, it contains the advantage of fast speed in storing and retrieving data.  This happens when relations, correlations or collations of data are not necessary. An SQL table can be organized into two columns, a key and a value. In this case, for querying, just find the value and return it. This is very fast.

Specifically speaking, SQL language has great advantages of dealing with structured data and allows highly dynamic queries. However, for current web applications, it’s another case. It’s an object oriented way of thinking such as the back-end database of MVC (Model-View-Control) pattern, instead of a highly dynamic range of queries which are full of outer and inner joins, unions and complex calculations over large tables. Meanwhile, it will result in complex hierarchies of tables if relational data models are transferred into object oriented models because of large amounts of normalization. For key-value store databases, the data model is schema less and an object can be just represented by a value with a key to identify the object. Therefore, the storage of arbitrary data indexed using a single key to allow retrieval  is allowed. That’s why key-vale store is also called simple store.

The code tends to look clean and simple compared with embedded SQL strings in the programming language. As for object-relational mapping frameworks, a lot of complex code between an SQL database and an object-oriented programming language will be added.

 

Document store databases 

Data is stored in the document store databases with the data format  such as XML, PDF, JSON etc. The document contains a unique key “ID” to identify a document explicitly and a collection of documents. The example of the data model is illustrated in figure3. Documents of document store databases are similar to records in relational databases. The difference is that the data model in document oriented databases is more flexible as its property of schema-less. New documents, no matter which kinds of attribute are contained, can be stored as adding new attributes  in existing documents at runtime.

图片3

Figure 3. The left figure is an example of document format of JSON and the right figure is an example of document format of XML.

Unlike relational databases whose records inside the same database have same data fields, document databases have the property that document may have similar as well as dissimilar data. The data model of document databases is slightly more complex than that of key-value databases, which instead of key-value store, the data model of document store can be represented as key-document pairs. If the database has a lot of relations and normalization, it’s not appropriate to use document database. Instead document stores are used for content management system, blog software etc.

 

Column store databases

Column store databases support the standard relational logical data model. Databases consist of a collection of tables and each table has a named collection of attributes which are columns instead of rows for relational data model. Attributes can form a unique primary key or foreign key referring to another primary key in another table.

The most different point of the two kinds of databases is that the data model of the relational databases is row oriented, however, on the contrast, the data model of column store databases are column oriented.  Figure4 can illustrate the difference simply and clearly.

 图片4

Figure 4. The upper figure illustrates the row- oriented data model and the lower figure illustrates the column-oriented data model.

From figure4, in a row-oriented database management system, the data would be stored as “1, John Smith, 19; 2, Jim Green, 18; 3, Lucy King, 16; 4, Freda Ford, 15”. Whereas in a column-oriented database management system, the data would be store as “1, 2, 3, 4; John Smith, Jim Green, Lucy King, Freda Ford; 19, 18 ,16, 15”.

The column store databases store the data with the way to be aggregated rapidly with less I/O activities and offer high scalability in data storage. They are efficient in applications including customer relationship management (CRM) systems, electronic library car catalogs, data warehousing and other ad-hoc query systems.

 

Graph store databases

Graph store databases are common used to handle relationships due to their efficient management of heavily linked data. (Neo4j)  The data model of graph contains nodes representing entities which hold proper types and  numbers of properties like key-value pairs. Figure5 is an example of the graph data model. The connection between two nodes are revealed by directed, named semantic relationships. The relationships also have properties such as know, own, like etc. Two nodes can have not only one number or type of relationships , but many if there are. As relationships are stored efficiently, this won’t sacrifice performance.

图片5

Figure 5. An example of graph data model. The ellipses represent nodes. Each node is a data entity with types and values.  The arrows represent connections with  relationships and their properties.

 

 

MySQL vs Redis vs MongoDB vs HBase vs Neo4j

1.1 MySQL

MySQL is the second most widely used relational database management system. As it’s open source, nothing needs to be paid for using it. It uses a standard form of the well-known SQL data language and can run on many operating systems and embedded in many programming languages.(MySQL online manual)

1.2 Redis

Redis is one of  the most popular key-value store database.  The data type of keys can be not only simple strings, but also hashes, lists, sets, sorted sets and bitmaps, with a number of server-side atomic operations associated to these data types, which atomic operation means that the processing of read or write operation is not disturbed by other read or write operations until it is completed.

The data sets of Redis are kept in memory so that it can achieve very high writing and reading speed with the limitation that data sets can’t be larger than memory. Meanwhile, the memory representation of complex data structures is much simpler to manipulate compared to the same data structure on disk, as internal complexity is reduced. The complex data sets are simply stored as a value if they are not necessary to be mapped into relational data structure.  (Redis FAQ)

Redis supports master-slave replication which is simple to use and configure. This makes Redis servers be exact copies of master servers so that it won’t be an issue if even few lost records are not acceptable in the application. Data is written asynchronously. If there is system crash, the last few queries can get lost. However, this is acceptable in many applications. (Redis Manual) Redis is recommended to run on Linux.

1.3 MongoDB

MongoDB is a document-oriented database. MongoDB supports document data with the format BSON (binary JSON).  The data model of JSON provides seamless mapping to native programming language types. As the schema is dynamic, it’s flexible to evolve the data model.

For simple query, the speed in MongoDB where the relational data sets are simply kept in a single document is faster than that in a relational database where relational data is separated into multiple tables. Join operations in relational databases are eliminated in MongoDB.

MongoDB supports indexes. Indexes can make the execution of queries more efficient. MongoDB has to scan every document in a collection to meet the match of query statement without indexes. All indexes in MongoDB are B-tree (Comer, 1979) indexes. In this case,  only the smallest possible number of documents is scanned, which optimizes the performance of query. Figure6 simply illustrates how indexes work in MongoDB.

mongodb

Figure 1. Illustration of how MongoDB indexes work

MongoDB also has property of ease of use with features of easy installation, configuration, maintenance.  MongoDB can run on Linux, Windows and OS X. (MongoDB online manual)

1.4 HBASE

HBase known as Hadoop database, is a column-store oriented and scalable distributed store system with high reliability and high performance. HBase can be used to set up large-scale structured storage clusters on cheap PC servers.

HBase is an open source after Google’s BigTable. The reason that HBase is called Hadoop database, is because HBase relies on Hadoop HDFS as its file storage system, Hadoop MapReduce to realize massive data processing and Zookeeper to do coordination. Figure7 illustrates the role of  HBase in Hadoop ecosystem.(Taylor, 2010)

hadoop ecosys

Figure 2. Diagram of Hadoop ecosystem

As shown in table 1, the data model of HBase contains “table” and “column family”.

hbase

Table 1. Data model of HBase

Table – Collection of rows. When table becomes larger and larger with the increasing number of records, it will spit into multiple parts called regions.  RowKey is the unique primary key of the table.

Column family – Collection of columns. One table can have one or multiple column families and one column family consists of any number of columns. Column family supports dynamic scaling without predefined number or types of columns.

1.5 Neo4j

Neo4j is a graph-store database. Neo4j reveals the relationship of data sets. The data model of Neo4j is also schema less.

Neo4j expresses queries as traversals. A graph is traversed by visiting the nodes following relationships under some rules. In reality, not the whole graph is visited, as not all nodes and relationships are interested in the graph. Neo4j supports fast deep traversal instead of slow SQL queries. Figure8 illustrates how traversal works.

neo4j1.png

Figure 3. Illustration of how traversal works.

Neo4j also supports indexes. It’s quite useful when a specific node or relationship can be found based on its property and the speed is faster than traversal which needs to loop up the entire graph. The figure9 illustrates how index look up works in Neo4j.

neo4j2.png

Figure 4. Illustration of how index look up works in Neo4j.

New Database Models

As we all know database is an organized collection of data. To make operations like definition, querying, update, and administration of database, the special designed software application called Database Management System (DBMS) arises. DBMS helps user to capture and analyze data. It’s classified by database model, for example the most famous model called relational model as represented by the SQL language. The database model is used to determine the logical structure of database and which manner data can be stored, analyzed and manipulated.

Relational database model is based on first-order predict logic which data is represented by tuples and grouped by relations.

However, when data is not structured and relational, relational is not capable to manage such kind of data. Relational database model is not good at adapting the change, as a result, it’s incompatible with agile development. Because of the various data formats such as hierarchies, cubes, linked-lists and unstructured data, it’s not capable to organize data into tables. Relevance, which requires text and data to be stored in document context with links to and from other documents, is also one drawback of relational database model.

Therefore, NoSQL (not only SQL) comes up. NoSQL database management systems enable data to be stored in a variety of formats like key-value store, graph store and document store. NoSQL is call not only SQL is to emphasize that SQL-like query languages may also be supported. But it does not guarantee the true ACID (atomicity, consistency, integrity, and durability) principle. NoSQL database management systems remove hard constraints, such as tabular row store and strict data definition, and have distributed architectures to support high performance throughput .NoSQL database are widely use in big data and real-time web applications.

The latest database management system is called NewSQL. It retains both SQL and ACID, and meanwhile achieves the scalability of NoSQL for online transaction processing (OLTP).