Understanding HBase cluster components

In fully distributed and pseudo-distributed modes, a HBase cluster has many components such as HBase Master, ZooKeeper, RegionServers, HDFS DataNodes, and so on, discussed as follows:

HBase Master: HBase Master coordinates the HBase cluster and is responsible for administrative operations. It is a lightweight process that does not require too many hardware resources. A large cluster might have multiple HBase Master components to avoid cases that have a single point of failure. In this highly available cluster with multiple HBase Master components, only once HBase Master is active and the rest of HBase Master servers get in sync with the active server asynchronously. Selection of the next HBase Master in case of failover is done with the help of the ZooKeeper ensemble.
ZooKeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Similar to HBase Master, ZooKeeper is again a lightweight process. By default, a ZooKeeper process is started and stopped by HBase but it can be managed separately as well. The HBASE_MANAGES_ZK variable in conf/hbase-env.sh with the default value true signifies that HBase is going to manage ZooKeeper. We can specify the ZooKeeper configuration in the native zoo.cfg file or its values such as client, port, and so on directly in conf/hbase-site.xml. It is advisable that you have an odd number of ZooKeeper ensembles such as one/three/five for more host failure tolerance. The following is an example of hbase-site.xml with ZooKeeper settings:
```
<property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2222</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>ZooKeeperhost1, ZooKeeperhost2, ZooKeeperhost3<value>
</property>
<property>
  <name>hbase.zookeeper.property.dataDir</name>
  <value>/opt/zookeeper<value>
</property>
```
RegionServers: In HBase, horizontal scalability is defined with a term called region. Regions are nothing but a sorted range of rows stored continuously. In HBase architecture, a set of regions is stored on the region server. By default, the region server runs on port 60030. In an HBase cluster based on HDFS, Hadoop DataNodes and RegionServers are typically called slave nodes as they are both responsible for server data and are usually collocated in the cluster. A list of the region servers is specified in the conf/regionservers file with each region server on a separate line, and the start/stop of these region servers is controlled by the script files responsible for an HBase cluster's start/stop.
HBase data storage system: HBase is developed using pluggable architecture; hence, for the data storage layer, HBase is not tied with HDFS. Rather, it can also be plugged in with other file storage systems such as the local filesystem (primarily used in standalone mode), S3 (Amazon's Simple Storage Service), CloudStore (also known as Kosmos filesystem) or a self-developed filesystem.

Apart from the mentioned components, there are other considerations as well, such as hardware and software considerations, that are not within the scope of this book.

Note

The backup HBase Master server and the additional region servers can be started in the pseudo-distributed mode using the utility provided bin directory as follows:

[root@localhost hbase-0.96.2-hadoop2]# bin/local-master-backup.sh 2 3

The preceding command will start the two additional HBase Master backup servers on the same box. Each HMaster server uses three ports (16010, 16020, and 16030 by default) and the new backup servers will be using ports 16012/16022/16032 and 16013/16023/16033.

[root@localhost hbase-0.96.2-hadoop2]# bin/local-regionservers.sh start 2 3

The preceding command will start the two additional HBase region servers on the same box using ports 16202/16302.

Start playing

Now that we have everything installed and running, let's start playing with it and try out a few commands to get a feel of HBase. HBase comes with a command-line interface that works for both local and distributed modes. The HBase shell is developed in JRuby and can run in both interactive (recommended for simple commands) and batch modes (recommended for running shell script programs). Let's start the HBase shell in the interactive mode as follows:

[root@localhost hbase-0.98.7-hadoop2]# bin/hbase shell

The preceding command gives the following output:

Type help and click on return to see a listing of the available shell commands and their options. Remember that all the commands are case-sensitive.

The following is a list of some simple commands to get your hands dirty with HBase:

status: This verifies whether HBase is up and running, as shown in the following screenshot:
create '<table_name>', '<column_family_name>': This creates a table with one column family. We can use multiple column family names as well, as shown in the following screenshot:
list: This provides the list of tables, as shown in the following screenshot:
put '<table_name>', '<row_num>', 'column_family:key', 'value': This command is used to put data in the table in a column family manner, as shown in the following screenshot. HBase is a schema-less database and provides the flexibility to store any type of data without defining it:
get '<table_name>', '<row_num>': This command is used to read a particular row from the table, as shown in the following screenshot:
scan '<table_name >': This scans the complete table and outputs the results, as shown in the following screenshot:
delete '<table_name>', '<row_num>', 'column_family:key': This deletes the specified value, as shown in the following screenshot:
describe '<table_name>': This describes the metadata information about the table, as shown in the following screenshot:
drop '<table_name>': This command will drop the table. However, before executing this command, we should first execute disable '<tablename>', as shown in the following screenshot:
Finally, exit the shell and stop using HBase, as shown in the following screenshot: