In this section, we will learn how to setup Hadoop in Standalone mode. In Standalone mode, all Hadoop operations will run in a single JVM. In Hadoop Standalone mode, your local file system is used as the storage and a single JVM that will perform all MR-related operations. Let us see how to setup CLI MiniCluster:
Step 1: Ensure package lists are updated.
sudo apt-get update
Step 2: Install Java 7. We are going to use OpenJDK7 however you can feel free to use Oracle JDK 7.
sudo apt-get install openjdk-7-jdk java -version
Step 3: Install SSH
sudo apt-get install openssh-server
Step 4: Extract the Hadoop binary tar file that you have built and copied in the home folder. Incase you missed how to create your own binaries, you can refer my post on Building Apache Hadoop 2.8.0 from Scratch
tar -xvzf hadoop-2.8.0.tar.gz
Step 5: Rename the extracted folder. This is done for our comfort.
mv hadoop-2.8.0 hadoop2
Step 6: Setup Environment Variables to identify Hadoop executables, configurations, and dependencies. You will need to edit the .bashrc file that is available in the home folder.
vi .bashrc #Add the below lines at the start of the file. export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64 export HADOOP_INSTALL=/home/hadoop/hadoop2 export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop export YARN_CONF_DIR=$HADOOP_INSTALL/etc/hadoop export PATH=$PATH:$HADOOP_CONF_DIR/bin export PATH=$PATH:$YARN_CONF_DIR/sbin
Step 7: Refresh and apply the environment variables
Step 8: Inform Hadoop where Java is
vi /home/hadoop/hadoop2/libexec/hadoop-config.sh #Add the following line at the start of file export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
Step 9: Setup Hadoop Environment Variables. This is mostly used by HDFS shell scripts that are present in the sbin location of the Hadoop framework.
vi /home/hadoop/hadoop2/etc/hadoop/hadoop-env.sh #Add the following line at the start of file export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64/ export HADOOP_INSTALL=/home/hadoop/hadoop2 export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
Step 10: Setup YARN variables. This is mostly used by YARN shell scripts present in the sbin location of the Hadoop framework.
vi /home/hadoop/hadoop2/etc/hadoop/yarn-env.sh #Add the following line at the start of file export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64/ export HADOOP_HOME=/home/hadoop/hadoop2 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin
After setting up all environment variables, Hadoop has now been installed in Standalone mode. Let us now test our installation. To do so, create two Putty sessions. One session is for monitoring JVMs using jps command, and the other session can be used to execute a MapReduce application in Standalone mode.
Step 11: We will use the example jar file provided by the framework. This is available in the location /home/hadoop/hadoop2/share/hadoop/mapreduce folder. Now create two Putty Sessions. In the first session, type the following command as shown below:
watch -n 1 jps
The resultant will be that the jps command will run every 01 (one) second which will enable monitoring of any newly created JVMs. In the second session, type the following command to execute a WordCount Program
hadoop jar /home/hadoop/hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount /home/hadoop/hadoop2/README.txt /home/hadoop/OutputWC
The syntax is as follows:
hadoop jar <jar_file> <prog_name> <input> <output>
jar_file – The location of MR JAR which is to be executed in Hadoop Cluster
prog_name – The name of class which holds the main method
input – The file input location that is to be processed
output – The folder output location where all the output will be stored
While the program is being executed, observe the first putty session. You will see a JVM that has been invoked. This takes on the entire accountability to execute your MapReduce program. Officially, this is called CLI MiniCluster as per Hadoop documentation, in which all the required operations are done in a single JVM.
Now you have learned how to setup a CLI-Minicluster. One of the most frequently asked questions while performing this setup is, “Why do we need to learn this? Nobody uses this kind of setup in production anymore.” You might also question the benefit that you gain out of this setup. The answer to both questions is that you need to learn this because, to setup a Multinode cluster, you first need to setup Hadoop in the Standalone mode in each node participating in the cluster.