In this tutorial, we shall learn how to use HDFS Storage Balancer effectively.
We will also effectively understand all possible permutations and combinations that can be applied in the Hadoop-Balancer command.
HDFS allows us to store data using ‘Write Once’ paradigm where only appends are allowed.
In production, there exists a scenario, where there might be an unequal distribution of blocks across the cluster.
The probable reasons could be:
- Datanode Failure
- Network Lag
- Load Balancing issues
To re-order the blocks in the cluster such that the data is balanced in the cluster, it is recommended by the seasoned Hadoop admins to perform Balancer at least once in 10 days for a 24/7/365 uptime cluster or once in 5 weeks in a processing need uptime cluster.
Steps to perform balancer:
Step 1: SSH to Namenode machine using Putty or any equivalent tool
Step 2 – 1 :Run the following command
Step 2 – 2 : Another way by setting threshold. The threshold defines the percentage of cluster disk space utilized, compared to the nodes in the cluster.
hdfs balancer -threshold 30
Step 2 – 3: You can also set the Concurrent Block Moves during balancing to speed up the balancing process. This can be achieved by configuring hdfs-site.xml of datanodes with
ame> <value>20</value> </property>
Please note the default value for dfs.datanode.balance.max.concurrent.moves is 5. Once the configuration is done, you can apply the configuration without restarting datanode service by typing the following command:
hdfs dfsadmin -reconfig datanode <dn_addr>:<ipc_port> start where, dn_addr is the datanode IP address/hostname ipc_port is the datanode's IPC port ( Default is 50010 )