Infinite Programming Tips: 2014

Monday, September 8, 2014

WPF Architecture

WPF architecture is multilayered architecture. It has three layers mainly Managed code, Unmanaged code and Core Operating system. We can call these layers as set of assemblies that built up the entire framework.

The major components are below Presentation Framework, Presentation Core and Media Integration(Milcore) are the major components of wpf architecture.

Read It In Detail ===>>

Friday, August 22, 2014

Limit Disk uses in Datanodes Hadoop

There may in some scenario when disk attached to data node may go over utilize and you become unable to perform any operation on the data node due to no space left on the system. so we have a option of defining a limit of space which can be used by data node daemons.
Using following configuration:

<property>
<name>dfs.datanode.du.reserved</name>
<value>182400</value>
<description>Reserved space in bytes per volume. This defines to leave this much space free for non dfs use.</description>
</property>

Tuesday, July 1, 2014

Reading sequence file/ compressed file / or TextRecordInputStream file from hadoop

As we know sequence file are binary files with key/value pair specially build for hadoop. If we want to read file on hadoop we have option of cat which will not show the output of sequence or compressed file correctly in this case we can use -text option with hadoop command which follows:

hadoop fs -text <file path name/filename>

which will show they correct output of the sequence file content.

Monday, February 24, 2014

Features Supported by Hadoop Release Series

Comparison between Big Data and RDBMS

	RDBMS	Big Data
Data size	Gigabytes	Petabytes
Access	Interactive and batch	Batch
Updates	Read and write many times	Write once, read many times
Structure	Static schema	Dynamic schema
Integrity	High	Low
Scaling	Nonlinear	Linear

Sunday, February 23, 2014

HBase Backup

HBase Backup:

We need to have backup of HBase table offline in some point of time, in spite of the fact that Hadoop and HBase provide replication and redundancy. For this we have some backup option in HBase. These are categorized in two ways:

Offline backup and

Online backup

Online Backup:

Again this is categorized in three ways

Replication: In this method you need to have a 2^nd cluster where you will keep your replication for the data from the 1^st cluster.

Hadoop/HBase Export command: which runs a map reduce job to copy table from one cluster to the same cluster or to other Hadoop cluster. This does not require any kind of downtime for backing/ exporting data.

In this method we need to export the data to the cluster and if we need to restore we need to restore it by Importing.

CopyTable: this is also online backup method which copies table from one cluster to another cluster or to the same cluster.

Offline Backup:

Distcp : this is a kind of file system backup, this copies a directory from HDFS to same cluster or to other cluster.

copyToLocal : this is less reliable way of copying directories from HDFS to local backup drive. If large amount of data is there then you need lot of Hadoop tune-up to copy successfully.

Offline Backup methods are full shutdown backup method, suppose you need to copy HBase you need to stop your HBase cluster, for a successful backup, as the files are being continuously moved, modified and changes while cluster is online, and copying in this scenario may fail.

Monday, February 10, 2014

Linux: Crontab - Brief

I always get confuse whenever I want to set a new cron job. The confuse is with regard to the options too be set!
For those who new to 'cron', its nothing but, an event scheduler in Linux. That means, you can schedule any script to run at any time you wanted to. Its just the system/server should be up and running!

cron job is specific to every user in Linux/Unix. So, one can't see other's cron unless the necessary privileges or sudo root access given.

Whatever, here is the options in cron:

To check cron jobs:

[root@localhost kiran]# crontab -l
no crontab for root

To set cron jobs:

[root@localhost kiran]# crontab -e

After adding, here is how it looks:

[root@localhost kiran]# crontab -l

##Script to test

00 */2 1-31 * 0,2,3 sh /home/kiran/test.sh >> /dev/null

Every Cron job should be given with 5 options:

minute -> 0-59
hour -> 0-23
day of month -> 1-31
month -> 1-12
day of week -> 0-7 (0 is Sunday )

In the above example:

00 -- 0th Minute

*/2 -- Every 2 hours

1-31 -- Every day (1 to 31)

* -- Every Month

0,2,3 -- Sunday,Tuesday,Wednesday

@@ the Techy Talks..

Friday, January 31, 2014

Features supported by different releases of hadoop

Wednesday, January 29, 2014

Some optimization trics for hadoop & mapreduce

Here are the some parameters which we can use to optimize and utilize hadoop and maprduce in a bit better way.

these parameter and their values are not fixed and the optimization and different parameter test must be done to optimize closely Hadoop according to the set up and type of machines in the cluster.

io.sort.factor-->64

io.sort.mb-->254

Mapred.reduce.parallel.copies

-->(number of machines*number of mappers)/2 (generally)

mapred.tasktracker.(map|reduce).task.maximum

-->map less than cores(if8cores then 5-10) (generally)

-->reduce(less than mapper, 4-6-8) (generally)

-->number of map+reduce>number of cores (generally)

mapred.(map|reduce).task.speculative.execution-->true

-->Same task to be executed on more than one machine in parallel

Tasktracker.http.threads

-->HTTP threads should be enough to support parallel copies in sort and snuffle phase.

We can use LZO compressed.

Use combiner

Impliment a custom partioner

Input split

~64-128-256(size of each file or block)

MySQL: SUBSTRING_INDEX - Select Patterns

Consider, a MySQL table having values in a column like below:

SELECT location FROM geo LIMIT 3;

"location"
"India.Karnataka.Shimoga.Gopala"
"India.Karnataka.Bengaluru.BTM"
"India.Karnataka.Chikmaglore.Koppa"

My requirement is to take only 4th value from each of the rows(such as, Gopala,BTM,Koppa).

I don't want to display remaining values.

Its same as what 'cut' command will do in Linux.

For this, we can use SUBSTRING_INDEX function.

SELECT SUBSTRING_INDEX(location,'.',-1) from geo LIMIT 3;

"location"

"Gopala"

"BTM"

"Koppa"

Syntax: SUBSTRING_INDEX(string,delimiter,count)

Here count means column number based on delimiter.

Negative value indicates that the column numbers calculated from right side.

So, if I give '-2' instead of '-1':

SELECT SUBSTRING_INDEX(location,'.',-2) from geo LIMIT 3;

"location"

"Shimoga.Gopala"

"Bengaluru.BTM"

"Chikmaglore.Koppa"

Monday, January 20, 2014

Hadoop over utilization of Hdfs

Do you face problem of over uses of HDFS for datanode, frequently it becomes 100% and hence result in a imbalance cluster, thinking of how to solve this problem, for this what we can do is put a parameter called "dfs.datanode.du.reserved" so this will reserve the non HDFS uses disk space and hence leaving the some space remaining for non HDFS uses and solving disk overuses of HDFS .

Thursday, January 9, 2014

Hadoop small file and block allocation

One of the misconceptions about Hadoop is that smaller files (smaller than the block size 64 MB default) will still use the whole block on the filesystem and there will be space westage on hdfs. This is not the true in reality. The smaller files occupy exactly as much disk space as they require(1 mb file at local disk will somwhat same space on hdfs). But this does not mean that having many small files will use HDFS efficiently. Regardless of the block size, its metadata at namenode occupies exactly the same amount of memory. As a result, a large number of small HDFS files (smaller than the block size) will use a lot of the NameNode’s memory, thus negatively impacting HDFS scalability and performance.

So HDFS blocks are not a storage allocation unit, but a replication unit.

List of different hadoop distribution.

Cloudera CDH,Manager, and Enterprise

Based on Hadoop 2, CDH (version 4.1.2 as of this writing) includes HDFS, YARN, HBase, MapReduce, Hive, Pig, Zookeeper, Oozie, Mahout, Hue, and other open source tools (including the real-time query engine — Impala).

Hortonworks Data Platform