Learning Hive-HBase Integration

Most people get extremely frustrated when it comes to working with the traditional native HBase commands for data interaction with HBase. Don’t worry – don’t stress!

There exist two more techniques to perform data interaction. These are:

  1. Using Hive-HBase integration
  2. Using Apache Phoenix

In this blog, we will see how to perform Hive-HBase integration.

Step1: Ensure Hive and HBase is installed. For this blog, we are including and writing this step keeping in mind a Single Node cluster. However the steps are the same in MultiNode cluster also. You can also try this in an Hortonworks Sandbox or Cloudera CDH QuickStart VM.

In case you missed installing Hive  & HBase installation, you can get the same in the links given below:

Hive Installation:

HBase Installation:

Step2: Ensure Hadoop and HBase services are live and active

Step3: Start HBase shell and create a table in HBase which we will integrate with Hive

hbase shell

Version 1.3.1, r930b9a55528fe45d8edce7af42fef2d35e77677a, Thu Apr  6 19:36:54 PDT 2017

hbase(main):001:0> create 'hr_hbase','employee'

0 row(s) in 1.4930 seconds

=> Hbase::Table - hr_hbase

Step4: Let us insert some data in the employee column family. For this scenario, we will assume empid will be the row key.

hbase(main):002:0> put 'hr_hbase','1','employee:fname','Prashant'

0 row(s) in 0.1260 seconds

 

hbase(main):003:0> put 'hr_hbase','1','employee:lname','Nair'

0 row(s) in 0.0030 seconds

 

hbase(main):004:0> put 'hr_hbase','1','employee:city','Mumbai'

0 row(s) in 0.0060 seconds

 

hbase(main):005:0> put 'hr_hbase','1','employee:state','Maharashtra'

0 row(s) in 0.0050 seconds

 

hbase(main):006:0> put 'hr_hbase','1','employee:country','India'

0 row(s) in 0.0030 seconds

 

hbase(main):007:0> put 'hr_hbase','2','employee:fname','Utkarsha'

0 row(s) in 0.0050 seconds

 

hbase(main):008:0> put 'hr_hbase','2','employee:lname','Nair'

0 row(s) in 0.0160 seconds

 

hbase(main):009:0> put 'hr_hbase','2','employee:city','Aurangabad'

0 row(s) in 0.0030 seconds

 

hbase(main):010:0> put 'hr_hbase','2','employee:state','Maharashtra'

0 row(s) in 0.0040 seconds

 

hbase(main):011:0> put 'hr_hbase','2','employee:country','India'

0 row(s) in 0.0030 seconds

hbase(main):012:0> scan 'hr_hbase'

ROW          COLUMN+CELL

 1           column=employee:city, timestamp=1498254419237, value=Mumbai
 1           column=employee:country, timestamp=1498254446996, value=India
 1           column=employee:fname, timestamp=1498254396342, value=Prashant
 1           column=employee:lname, timestamp=1498254407336, value=Nair
 1           column=employee:state, timestamp=1498254435503, value=Maharashtra
 2           column=employee:city, timestamp=1498254493925, value=Aurangabad
 2           column=employee:country, timestamp=1498254510077, value=India
 2           column=employee:fname, timestamp=1498254464771, value=Utkarsha
 2           column=employee:lname, timestamp=1498254470755, value=Nair
 2           column=employee:state, timestamp=1498254501865, value=Maharashtra

2 row(s) in 0.0360 seconds

hbase(main):013:0> exit

Step5: Start Hive. Create an external table and link the same with the HBase table.

hive

hive> show databases;

OK

bigdataclassmumbai

default

Time taken: 0.02 seconds, Fetched: 2 row(s)

hive> use bigdataclassmumbai;

OK

Time taken: 0.026 seconds

hive> create EXTERNAL table hr_hive (empid int,fname string,lname string,city string,state string,country string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES
    > ("hbase.columns.mapping" = ":key,employee:fname,employee:lname,employee:city,employee:state,employee:country")
    > TBLPROPERTIES ("hbase.table.name" = "hr_hbase");
OK
Time taken: 0.759 seconds

Step6: Since we have already entered data in HBase, let us check to confirm whether the same data is reflected in Hive.

hive> show tables;
OK

hr_hive

Time taken: 0.037 seconds, Fetched: 1 row(s)

hive> select * from hr_hive;

OK

1       Prashant        Nair    Mumbai  Maharashtra     India
2       Utkarsha        Nair    Aurangabad      Maharashtra     India

Time taken: 0.781 seconds, Fetched: 2 row(s)

So that’s how you complete a Successful integration. Hope you like this tutorial.

Prashant Nair

Bigdata Consultant | Author | Corporate Trainer | Technical Reviewer Passionate about new trends and technologies. More Geeky. Contact me for training and consulting !!!

9 thoughts on “Learning Hive-HBase Integration

  1. Hey, Prashant I consider hive and hbase on the same node..when hive and hbase are residing on two different nodes how will hive come to know about whereabouts of hbase node..

    i was also more keen on knowing how can i connect elasticsearcfh with hive.. so the stack is as follows

    Hbase — > Hive –> elasticsearch

    Right now i am using HbaseStorageHandler to transfer data from Hbase to Hive.
    And then ElasticStorageHandler to transfer data from Hive to ElasticSearch.

    Is there any other way around to transfer data from Hbase to ElasticSearch directly ?

    So the following two questions are of my concern from above context.
    1. When hive and hbase are residing on two different nodes how will hive come to know about whereabouts of hbase node ?
    2. Is there any other way around to transfer data from Hbase to ElasticSearch directly ?

    1. Question 1. When hive and HBase are residing on two different nodes how will hive come to know about whereabouts of HBase node?

      Integration of Hive and HBase is possible only if both the components are installed in same server. If services are installed in different nodes, you can try any of the following two methods to successfully complete the integration.

      Method 1:
      You need to configure your hive-site.xml with the following properties

      Property Name: hbase.zookeeper.quorum
      Property Value: 192.168.1.1
      Description: A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the cluster.

      Property Name: hbase.zookeeper.property.clientPort
      Property Value:2181
      Description: The Zookeeper client port

      HBaseStorageHandler contacts Zookeeper to figure out where HBase HMaster exists. Once he gets the information of HBase HMaster, it performs the necessary operation.

      Method 2:

      Start hive shell using the following command in Linux terminal:

      env HIVE_OPTS=”-hiveconf hbase.zookeeper.quorum=zookeeper1,zookeeper2,zookeeper3″ hive

      where zookeeper1,zookeeper2,zookeepeer3 is the zookeeper server address. Hive will be by default aware of HBase locality.

      Question 2. Is there any other way around to transfer data from Hbase to ElasticSearch directly ?

      The approach that you are following is correct approach. Another method that you can try is

      HBase —> Phoenix —> ElasticSearch

      You can try https://lessc0de.github.io/connecting_hbase_to_elasticsearch.html tutorial for more details. The only condition is Apache Phoenix must be installed to achieve this task. But some production system doesn’t encourage to use Apache Phoenix since there exists lots of code breaks.

      Hope this helps out !!! Stay tuned and don’t forget to subscribe my website for updates !!!!

      1. Hey thanks for the quick reply..it was very helpful..I did try phoenix as a mediator between has and elastics watch. .but it’s not supported in current version of elastic search..if you come across any other solution then do let me know..thanks 🙂

  2. I wish to use hbase as back end data base in one of my projects… i want to name versions of a particular column qualifier according to application needs..i.e. a customized names for the versions..

    For example consider the follwowing scenario,

    There is a cricket match between two teams..there are six balls in a over..but the bowler balls a no ball and wide ball…so now the over becomes of 8 balls..

    Now i wish to record details of 8 balls instead of six balls.. so i was planning to keep row key in hbase as the overid, and then add column qualifier representing over_number and and each over_number having different versions, and these versions represents the ball number..

    So the column family goes as follows :

    Here column Family name is ” bowling ” and there will be another column family named “batting”
    rowKey = “TournamentName – MatchNumber – TeamName – OverNumber”

    cf = bowling, bowling:name=overnumber bowling:value=over1 bowling:runs=1, bowling:wide=0, bowling:noball=0, bowling:remarks=single, version= 0.1
    cf = bowling, bowling:name=overnumber bowling:value=over1 bowling:runs=2, bowling:wide=0, bowling:noball=0, bowling:remarks=double, version= 0.2
    cf = bowling, bowling:name=overnumber bowling:value=over1 bowling:runs=4, bowling:wide=0, bowling:noball=0, bowling:remarks=four version= 0.3
    cf = bowling, bowling:name=overnumber bowling:value=over1 bowling:runs=2, bowling:wide=0, bowling:noball=0, bowling:remarks=double, version= 0.4
    cf = bowling, bowling:name=overnumber bowling:value=over1 bowling:runs=2, bowling:wide=0, bowling:noball=0, bowling:remarks=double, version= 0.5
    cf = bowling, bowling:name=overnumber bowling:value=over1 bowling:runs=2, bowling:wide=0, bowling:noball=0, bowling:remarks=double, version= 0.6
    “cf = bowling, bowling:name=overnumber bowling:value=over1 bowling:runs=1, bowling:wide=1, bowling:noball=0, bowling:remarks=wide, version= 0.7 ” <– this is what i want to achieve

    1. So i thought dynamic column increment is there i.e. why i thought why not use hbase for this purpose..am i going in right direction ?

      One hbase table for all tournaments of game i.e. Cricket.

      Thanks

  3. Hi
    Can you please help me on fixing this issue.
    hbase(main):004:0* create ’employee’, ‘sal_details’

    ERROR: Can’t get master address from ZooKeeper; znode data == null

Leave a Reply

Your email address will not be published. Required fields are marked *