On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications

On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications

The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduce tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner.

EXISTING SYSTEM

  • Intermediate data are shuffled according to a hash function in Hadoop.
  • Combiner has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data

 

Disadvantages

  • Intermediate data are shuffling lead to large network traffic because it ignores network topology and data size associated with each key.
  • Combiner fails to exploit the data aggregation opportunities among multiple tasks on different machines.

 


PROPOSED SYSTEM

  • Jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic.
  • A distributed algorithm is proposed for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel.

 

Advantages

  • To tackle problem incurred by the traffic-oblivious partition scheme, both task locations and data size associated with each key is considered.
  • By assigning keys with larger data size to reduce tasks closer to map tasks, network traffic can be significantly reduced.
  • To further reduce network traffic within a MapReduce job, aggregate data with the same keys before sending them to remote reduce tasks is considered.
  • Reduce network traffic cost in both offline and online cases.

SOFTWARE SPECIFICATION

Programming Language   : JDK 1.5 or higher

Database                  : MySQL 5.0

Leave a Reply