Self-Adjusting Slot Configurations for Homogeneous and Heterogeneous Hadoop Clusters
The MapReduce framework and its open source implementation Hadoop have become the defacto platform for scalable analysis on large data sets in recent years. One of the primary concerns in Hadoop is how to minimize the completion length (i.e., makespan) of a set of MapReduce jobs. The current Hadoop only allows static slot configuration, i.e., fixed numbers of map slots and reduce slots throughout the lifetime of a cluster. However, we found that such a static configuration may lead to low system resource utilizations as well as long completion length. Motivated by this, we propose simple yet effective schemes which use slot ratio between map and reduce tasks as a tunable knob for reducing the makespan of a given set. By leveraging the workload information of recently completed jobs, our schemes dynamically allocates resources (or slots) to map and reduce tasks. The experimental results demonstrate the effectiveness and robustness of our schemes under both simple workloads and more complex mixed workloads.
Ã‚Â A Hadoop system provides execution and multiplexing of many tasks in a common datacenter. There is a rising demand for sharing Hadoop clusters amongst various users, which leads to increasing system heterogeneity. However, heterogeneity is a neglected issue in most Hadoop schedulers.
Ã‚Â Various types of applications submitted by different users require the consideration of new aspects in designing a scheduling system for Hadoop. One of the most important aspects which should be considered is heterogeneity in the system
Ã‚Â Hadoop has been developed as a solution for performing large-scale dataparallel applications in Cloud computing. A Hadoop system can be described based on three factors: cluster, workload, and user. Each factor is either heterogeneous or homogeneous, which reflects the heterogeneity level of the Hadoop system. This paper studies the effect of heterogeneity in each of these factors on the performance of Hadoop schedulers. Three schedulers which consider different levels of Hadoop heterogeneity are used for the analysis
Ã‚Â Hadoop is a data-intensive cluster computing system, in which incoming jobs are defined based on the MapReduce programming model. MapReduce is a popular paradigm for performing computations on BigData in Cloud computing systems
Ã‚Â It is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. Under such models of distributed computing, many users can share the same cluster for different purpose. Situations like these can lead to scenarios where different kinds of workloads need to run on the same data center. For example, these clusters could be used for mining data from logs which mostly depends on CPU capability.
Ã‚Â The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code
Ã‚Â They need to process large amount of data has been enhanced in the area of Engineering, Science, Commerce and theEconomics of theworld. The ability to process huge data from multiple resources of data remains a critical challenge. Many organizations face difficulties when dealing with a large amount of data. They are unable to manage, manipulate, process, share and retrieve large amounts of data by traditional software tools due to them being costly and time-consuming for data processing. The term large-scale processing is focused on how to handle the applications with massive dataMapReduce has been facilitated by Google as a programming framework to analyse massive amounts of data. It uses for distributed data processing on large datasets across a cluster of machines. Since the input data is too large, the computation needs to be distributed across thousands of machines within a cluster in order to finish each part of computation in a reasonable amount of time. This distributed concept implies to parallelize computations easily and using reexecution as the main technique for faulttolerance sets.
- The choice of system sizes in this system is only for ease of presentation, the same issues arise in larger systems.
- For such workloads, as the FIFO algorithm does not take into account job sizes, it has the problem that small jobs potentially get stuck behind large ones.
- Consequently, the average completion time will be increased.The FIFO and the Fair Sharing algorithms both have the problem of resource and job mismatch, as they do not consider heterogeneity in the scheduling
Ãƒâ€šÃ‚Â Proposed system
Ã‚Â The Hadoop system distributes tasks among the resources to reduce a jobÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢s completion time. However, Hadoop does not consider communication costs. In a large cluster with heterogeneous resources, maximizing a taskÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢s distribution may result in overwhelmingly large communication overhead. As a result, a jobÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢s completion time will be increased. COSHH considers the heterogeneity and distribution of resources in the task assignment.
Ã‚Â Increasing locality was order to increase locality, it should increase the probability that tasks are assigned to resources which also store their input data. COSHH makes a scheduling decision based on the suggested set of job classes for each resource. Therefore, the required data of the suggested classes of a resource can be replicated on that resource. This can lead to increased locality, in particular in large Hadoop clusters, where locality is more critical.
Ã‚Â The results show that COSHH has significantly better performance in reducing the average completion time, and satisfying the required minimum shares. Moreover, its performance for the locality and the fairness performance metrics is very competitive with the other two schedulers. Furthermore, we demonstrate the scalability of COSHH based on the number of jobs and resources in the Hadoop system. The sensitivity of the proposed algorithm to errors in the estimated job execution times is examined
QueuingProcessof the stages of the queuing process. Thetwo main approaches used in the queuing process are classification and optimization based approaches, introducedin detail in respectively. At a highlevel, when a new job arrives, the classification approachspecifies the job class, and stores the job in the corresponding queue. If the job does not fit any of the currentclasses, the list of classes is updated to add a class for theincoming job. The optimization approach is used to findan appropriate matching of job classes and resources. Anoptimization problem is defined based on the properties ofthe job classes and features of the resources. The resultof the queuing process, which is sent to the routing process, contains the list of job classes and the suggested setof classes for each resource
Routing Process was scheduler receives a heartbeat message froma free resource, say , it triggers the routing process.The routing process receives the sets of suggested classesSCR and SC0Rfrom the queuing process, and uses themto select a job for the current free resource. This processselects a job for each free slot in the resource Rj, and sendsthe selected job to the task scheduling process. The taskscheduling process chooses a task of the selected job, andassigns the task to its corresponding slot.Here, it should be noted that the scheduler is not limiting each job to just one resource. When a job is selected,the task scheduling process assigns a number of appropriate tasks of this job to available slots of the current freeresource. If the number of available slots is fewer than thenumber of uncompleted tasks for the selected job, the jobwill remain in the waiting queue. Therefore, at the nextheartbeat message from a free resource, this job is considered in making the scheduling decision; however, tasksalready assigned are no longer considered. When all tasksof a job are assigned, the job will be removed from thewaiting queue.Algorithm presents the routing process. There aretwo stages in this algorithm to select jobs for the available slots of the current free resource. In the first stage,the jobs of classes in SCR are considered, where the jobsare selected in the order of their minimum share satisfaction. This means that a user who has the highest distanceto achieve her minimum share will get a resource sharesooner. However, in the second stage, jobs for classes inSC0Rare considered, and jobs are selected in the order defined by the current shares and priorities of their users.
Ã‚Â Heterogeneity is for the most part neglected in designing Hadoop scheduling systems. In order to keep the algorithm simple, minimal system information is used inmaking scheduling decisions, which in some cases couldresult in poor performance. Growing interest in applyingthe MapReduce programming model in various applications gives rise to greater heterogeneity, and thus mustbe considered in its impact on performance. It has beenshown that it is possible to estimate system parametersin a Hadoop system. Using the system information, wedesigned a scheduling algorithm which classifies the jobsbased on their requirements and finds an appropriate matching of resources and jobs in the system.
- The floor function is used to ensure that the slot assignments are integer values.
- A data generator in TPC-H can be used to create a database with the customized size.
- The mixed work-loads introduced in previous section and the TPC-H benchmarks are used to validate the effectiveness and robustness of our scheme.