Thursday, June 4, 2015

All about hadoop Balancer.


Hadoop Data Balancing
Hadoop Data Balancing



Hadoop Balancer:

This is tool provided to balance the disk uses throughout the Hadoop cluster. I may happen sometime that some of the nodes in the cluster becomes over utilized or underutilized, which occurs due to addition of new nodes where newly added nodes may be underutilized or if there are less number of nodes result in overutilization. We can run balancer from more than 1 machine in the cluster to increase the speed of balancing but it will increase bandwidth uses to very high.
This tool requires administrator right on the Hadoop cluster to run.




Syntax of the balancer:

bin/start-balancer.sh [-threshold <threshold>]

Where start-balancer.sh files resides in the bin directory of the Hadoop folder. And the threshold is the parameter which decides target of balance, this lies in fraction between 0,1 the default value is 10% if nothing is passed as the threshold value.

This process does the transferring of blocks between the nodes resulting network activity and if a production cluster must be used cautiously, as it result in some block missing error or late reply from the cluster.
This process can be stopped any time if required using following command:


bin/stop-balancer.sh

And it can be stopped at the machine where its running using:

bin/hadoop-daemon.sh stop balancer

This command can be used any time to stop the balancing process if required in case of error, or delay in response, it is advised to use this when there is minimal or less requests or activity on the cluster.

Cluster is said to be balanced if for each Datanode, the utilization of the node (ratio of used space at the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space in the cluster to total capacity of the cluster) by no more than the threshold value. The smaller the threshold, the more balanced a cluster will become. It takes more time to run the balancer for small threshold values. Also for a very small threshold the cluster may not be able to reach the balanced state when applications write and delete files concurrently.

Running balancer always increases the network activity and if used aggressively my cause network congestion and degraded response, so default bandwidth to be used and be limited by using parameter “dfs.balance.bandwidthPerSec”  this parameter is found in hdfs-default.xml and we can add a new parameter in hdfs-site.xml to override this value. The default value for this is 1Mb/s and can be changed accordingly according to the network bandwidth available.
It can be added as follows:

<property>
  <name>dfs.balance.bandwidthPerSec</name>
  <value>1000000</value>
</property>

If activity is is less on the cluster then this value can be set to higher value to fasten the balancing process and if activity is more,  then can be reduced as to avoid network congestion and errors.

It can be run in the case we:
·         Add a rack to the cluster.
·         Add a node to a cluster.
·         If some disks are underutilized.
·         Make changes to the default bandwidth value according to network of yours.

To be avoided:

·         Don’t make threshold too low or too high.
·         Don’t run from Namenode machine but run it from some other Datanode machine.

Note:

·         Balancer automatically stops and exits if cluster is already balanced or finish balancing.
·         In case there is not block to move or can’t be moved.
·         If a block can be moved in 3 tries.


Happy Balancing J

1 comment:


  1. Thanks for this blog. provided great information. All the details are explained clearly with the great explanation.
    hadoop training in chennai

    ReplyDelete

Thank you for Commenting Will reply soon ......

Featured Posts

#Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc

 #Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc Linux is an open-source operating system that is loved by millio...