An Outlook to Hadoop MapReduce Framework

Untitled Document.md

MapReduce Operators

Mapper
Reducer
Combiner
Partitioner
Sorting [GropingComparator and KeyComparator]

MapReduce Framework

Shuffle: The process of moving map outputs to the reducers is known as shuffling
Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys.The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.
Speculative Execution: As most of the tasks in a job are coming to a close(nearly 95%),the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. By default it is true.

Common MapReduce Operations

Filtering or Grepping
Parsing, Conversion
Counting, Summing
Binning, Collating
Distributed Tasks
Simple Total Sorting
Chained Jobs

Advanced MapReduce Operations

GroupBy
Distinct
Secondary Sort
CoGroup/Join
Distributed Total Sort
Distributed Cache
Reading from HDFS programmetically

Very Advanced Operations

Classification
Clustering
Regression
Dimension Reduction
Evolutionary Algorithms

Combiner

The best part of all is that we do not need to write any additional code to take advantage of this! If a reduce function is both commutative and associative, then it can be used as a Combiner as well.

Partitioner

The key and value are the intermediate key and value produced by the map function. The numReduceTasks is the number of reducers used in the MapReduce program and it is specified in the driver program. It is possible to have empty partitions with no data (when no of partition is less then no of reducer). We do the assigned partition number modulo numReduceTasks to avoid illegal partitions if the system has a lesser number of possible reducers than the assigned partition number. The partitioning phase takes place after the map/combine phase and before the reduce phase. The number of partitions is equal to the number of reducers. The data gets partitioned across the reducers according to the partitioning function

Sequence File

SequenceFiles are flat files consisting of binary key/value pairs. SequenceFile provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.

Counting and Summing

For instance, there is a log file where each record contains a response time and it is required to calculate an average response time. Applications: Log Analysis, Data Querying

Binning and Collating

IN>> word : list of page numbers containing word Applications: Inverted Indexes, ETL,Max,Min

Filtering or Grepping

Applications: Log Analysis, Data Querying, ETL, Data Validation

Distributed Task

There is a large computational problem that can be divided into multiple parts and results from all parts can be combined together to obtain a final result. Applications: Physical and Engineering Simulations, Numerical Analysis, Performance Testing

Chaining jobs

Method 1: First create the JobConf object “job1” for the first job and set all the parameters with “input” as inputdirectory and “temp” as output directory. Execute this job: JobClient.run(job1). Immediately below it, create the JobConf object “job2” for the second job and set all the parameters with “temp” as inputdirectory and “output” as output directory. Finally execute second job: JobClient.run(job2).
Method 2: Create two JobConf objects and set all the parameters in them just like (1) except that you don’t use JobClient.run. Then create two Job objects with jobconfs as parameters:

Using the jobControl object, you specify the job dependencies and then run the jobs:

 Job job1=new Job(jobconf1); Job job2=new Job(jobconf2);
 JobControl jbcntrl=new JobControl("jbcntrl");
 jbcntrl.addJob(job1);
 jbcntrl.addJob(job2);
 job2.addDependingJob(job1);
 jbcntrl.run();

Distinct Values (Unique Items Counting)

More to come here…

WhiteBox Developer

An Outlook to Hadoop MapReduce Framework

MapReduce Operators

MapReduce Framework

Common MapReduce Operations

Advanced MapReduce Operations

Very Advanced Operations

Combiner

Partitioner

Sequence File

Counting and Summing

Binning and Collating

Filtering or Grepping

Distributed Task

Chaining jobs

Distinct Values (Unique Items Counting)

1 comments:

An Outlook to Hadoop MapReduce Framework

MapReduce Operators

MapReduce Framework

Common MapReduce Operations

Advanced MapReduce Operations

Very Advanced Operations

Combiner

Partitioner

Sequence File

Counting and Summing

Binning and Collating

Filtering or Grepping

Distributed Task

Chaining jobs

Distinct Values (Unique Items Counting)

Share this

1 comments: