Apache Hive is a widely used and demanded Hadoop Ecosystem component for performing data analysis. Its simplicity lies in using existing SQL-type syntax for performing data crunching, cleansing, and analysis.
Tools used :
- Apache Hadoop 2.7.3
- Apache Spark 1.6.3
- Apache Hive 1.2.2
Step 1: Ensure Hadoop and Spark Services are live and active
Step2: Copy hive-site.xml from hive configuration folder (e.g. /home/spark/hive/conf) to spark configuration folder (e.g. /home/spark/spark/conf)
Step3: Copy Mysql Connector JAR file in the conf folder of Spark.
Step4: Create an example database and a table in Hive to verify the integration status.
Step5: Try executing COUNT in emp table to check the duration required to get the COUNT.
Step6: Quit Hive and Start Thrift Server. Type jps and check for SparkSubmit
Step7: Now start Beeline interface. Connect Beeline at default port ( 10000 )
Step8: Now try exploring Hive database.
Step9: Let’s perform COUNT operation.
Well, as demonstrated in Step 5, the time required to count 4 records was close to 30 seconds. However when we did the same after Spark integration, the same took close to 3 seconds. This shows the power of Spark. Thanks to RDD and its features to achieve this speed batch processing. Hope you liked this tutorial.