Friday, 24 July 2015

Introducing NoSQL


Broadly, we have three Database categories:-

1) RDBMS/OLTP/Real Time --> Oracle, MySQL, MS SQL, DB2
2) DSS/OLAP/DW --> Netezza, SAP HANA, Teradata
3) NoSQL/New SQL/BigData --> MongoDB, Cassandra, HBase, CouchDB

Lets quickly revise OLTP and OLAP:
 


Now, Big Data. Big data databases i.e. NoSQL databases are actually non-RDBMS i.e. they do not follow relational model of Parent - Child relationship type. Neither they support SQLs in their basic form. We need to use some utilities like Hive (in case of Hadoop framework)  which actually acts as mediator for converting SQLs we DBA's understand to convert into corresponding Java programs which NoSQL databases run on.
 
In RDBMS, to increase the txns handling capacity, we need 'Vertical Scaling (Scale-Up)' i.e. increasing RAM, CPU etc. which requires downtime of application. But in NoSQL we achieve same using 'Horizontal Scaling (Scale-Out)' where no. of nodes are increased without downtime, it required shard (which is partitioning table and keeping it in multiple instances/servers).They could afford so due to this concept of data "partitioning" (not to be confused with Oracle's Table partitioning feature which is just 'horizontal partitioning'). Data partition is often referred as "Sharding".

 
Shard , in short, is breaking of database into partitions and storing them into different servers. This further called Partition Tolerance, which RDBMS does not supports.

**********************************************************************************
Shards compared to horizontal partitioning:

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated as complete units.

This is also why sharding is related to a "shared nothing architecture" . Once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards. This makes replication across multiple servers easy (simple horizontal partitioning does not). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck. This introduces the concept of "distributed computing" followed by NoSQL systems.

There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.

Shared nothing architecture have replica of data (prefered 3), which saves in case of any node failure.

***********************************************************************************
 


To summarize: RDBMS keeps all its data at single location i.e. shared storage, NoSQL sherds data into multiple locations.

Next topic:-

Now that we discussed the philosophy behind NoSQL databases, lets understand what is Big Data. Please note that not necessarily NoSQL database will be Big Data database. It is actually three characteristics which a data must follow to term it as "Big Data" and it is 3Vs - Volume , Velocity and Variety.


 
Will continue further on it in next blog.

Happy Learning!

- Deepesh


No comments:

Post a Comment