Integrating Apache Spark with Apache Hive

Apache Hive is a widely used and demanded Hadoop Ecosystem component for performing data analysis. Its simplicity lies in using existing SQL-type syntax for performing data crunching, cleansing, and analysis.

Tools used :

  1. Apache Hadoop 2.7.3
  2. Apache Spark 1.6.3
  3. Apache Hive 1.2.2

Steps:

Step 1: Ensure Hadoop and Spark Services are live and active

Step2:  Copy hive-site.xml from hive configuration folder (e.g. /home/spark/hive/conf) to spark configuration folder (e.g. /home/spark/spark/conf)

 

Step3: Copy Mysql Connector JAR file in the conf folder of Spark.

Step4: Create an example database and a table in Hive to verify the integration status.

 

Step5: Try executing COUNT in emp table to check the duration required to get the COUNT.

Step6: Quit Hive and Start Thrift Server.  Type jps and check for SparkSubmit

Step7: Now start Beeline interface. Connect Beeline at default port ( 10000 )

Step8: Now try exploring Hive database.

Step9: Let’s perform COUNT operation.

Well, as demonstrated in Step 5, the time required to count 4 records was close to 30 seconds. However when we did the same after Spark integration, the same took close to 3 seconds. This shows the power of Spark. Thanks to RDD and its features to achieve this speed batch processing. Hope you liked this tutorial.

Prashant Nair

Bigdata Consultant | Author | Corporate Trainer | Technical Reviewer Passionate about new trends and technologies. More Geeky. Contact me for training and consulting !!!

Leave a Reply

Your email address will not be published. Required fields are marked *