Wednesday, January 29, 2014

Some optimization trics for hadoop & mapreduce

Here are the some parameters which we can use to optimize and utilize hadoop and maprduce in a bit better way.

these parameter and their values are not fixed and the optimization and different parameter test must be done to optimize closely Hadoop according to the set up and type of machines in the cluster.

io.sort.factor-->64
io.sort.mb-->254
Mapred.reduce.parallel.copies
-->(number of machines*number of mappers)/2 (generally)
mapred.tasktracker.(map|reduce).task.maximum
-->map less than cores(if8cores then 5-10)  (generally)
-->reduce(less than mapper, 4-6-8) (generally)
-->number of map+reduce>number of cores (generally)
mapred.(map|reduce).task.speculative.execution-->true
-->Same task to be executed on more than one machine in parallel
Tasktracker.http.threads
-->HTTP threads should be enough to support parallel copies in sort and snuffle phase.
We can use LZO compressed.
Use combiner
Impliment a custom partioner 
Input split 
~64-128-256(size of each file or block)

MySQL: SUBSTRING_INDEX - Select Patterns

Consider, a MySQL table having values in a column like below:

SELECT location  FROM geo LIMIT 3;

"location"
"India.Karnataka.Shimoga.Gopala"
"India.Karnataka.Bengaluru.BTM"
"India.Karnataka.Chikmaglore.Koppa"

My requirement is to take only 4th value from each of the rows(such as, Gopala,BTM,Koppa). 
I don't want to display remaining values. 
Its same as what 'cut' command will do in Linux.

For this, we can use SUBSTRING_INDEX function.

SELECT SUBSTRING_INDEX(location,'.',-1)  from geo LIMIT 3;
"location"
"Gopala"
"BTM"
"Koppa"

Syntax: SUBSTRING_INDEX(string,delimiter,count)
Here count means column number based on delimiter. 
Negative value indicates that the column numbers calculated from right side.

So, if I give '-2' instead of  '-1':

SELECT SUBSTRING_INDEX(location,'.',-2)  from geo LIMIT 3;
"location"
"Shimoga.Gopala"
"Bengaluru.BTM"
"Chikmaglore.Koppa"

Monday, January 20, 2014

Hadoop over utilization of Hdfs

Do you face problem of over uses of HDFS for datanode, frequently it becomes 100% and hence result in a imbalance cluster, thinking of how to solve this problem, for this what we can do is put a parameter called "dfs.datanode.du.reserved" so this will reserve the non HDFS uses disk space and hence leaving the some space remaining for non HDFS uses and solving disk overuses of HDFS .

Thursday, January 9, 2014

Hadoop small file and block allocation

One of the misconceptions about Hadoop is that smaller files (smaller than the block size 64 MB default) will still use the whole block on the filesystem and there will be space westage on hdfs. This is not the true in reality. The smaller files occupy exactly as much disk space as they require(1 mb file at local disk will somwhat same space on hdfs). But this does not mean that having many small files will use HDFS efficiently. Regardless of the block size, its metadata at namenode occupies exactly the same amount of memory. As a result, a large number of small HDFS files (smaller than the block size) will use a lot of the NameNode’s memory, thus negatively impacting HDFS scalability and performance.

So HDFS blocks are not a storage allocation unit, but a replication unit.

List of different hadoop distribution.

Cloudera CDH,Manager, and Enterprise

Based on Hadoop 2, CDH (version 4.1.2 as of this writing) includes HDFS, YARN, HBase, MapReduce, Hive, Pig, Zookeeper, Oozie, Mahout, Hue, and other open source tools (including the real-time query engine — Impala).

Hortonworks Data Platform

Sunday, January 5, 2014

Incriment variable in linux shell

Following are the listed methods by which we can increment the variables in shell script in looping statements:

  • j=$((i++))
  • j=$(( i + 1 ))
  • j=`expr $i + 1`
Happy scripting :)

Featured Posts

#Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc

 #Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc Linux is an open-source operating system that is loved by millio...