FastRAQ: A Fast Approach to Range-Aggregate Queries in Big Data Environments

FastRAQ: A Fast Approach to Range-Aggregate Queries in Big Data Environments

Range-aggregate queries are to apply a certain aggregate function on all tuples within given query ranges. Existing approaches to range-aggregate queries are insufficient to quickly provide accurate results in big data environments. FastRAQ—a fast approach to range-aggregate queries is proposed in big data environments. FastRAQ first divides big data into independent partitions with a balanced partitioning algorithm, and then generates a local estimation sketch for each partition. When a range-aggregate query request arrives, FastRAQ obtains the result directly by summarizing local estimates from all partitions.

EXISTING SYSTEM

In existing system, the system introduced Prefix-Sum Cube approach to solving the numeric data cube aggregation problems in OLAP. The essential idea of PC is to pre-compute prefix sums of cells in the data cube, which then can be used to answer range-aggregate queries at run-time.The updates to the prefix sums are proportional to the size of the data cube. A range query applies an aggregation operation (e.g., SUM) over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. Range sum queries on data cubes are apowerful analysis tool. Many application domains require that data cubes are updated often and the informationprovided by analysis tools are current or “near current”. Existing techniques for range sum queries on data cubes,however, can incur update costs in the order of the size of the data cube. Since the size of a data cube is exponential inthe number of its dimensions, rebuilding the entire data cube can be very costly and is not realistic. To cope with thisdynamic data cube problem, a new approach has been introduced recently, which achieves constant time per range sumquery while constraining each update cost within, wheredis the number of dimensions of the data cube andnisthe number of distinct values of the domain at each dimension. In this paper, we provide a new algorithm for theproblem which requires time for each range sum query and time for each update. Our algorithm improves the update time by a factor of in contrast to the current one for the problem. Like all existingtechniques, our approach to answering range sum queries is also based on some precomputed auxiliary information (prefix sums) that is used to answer ad hoc queries at run time. Under both the product model and a new model introduced in this paper, the total cost for updates and range queries of the proposed algorithm is smallest compared withthe cost by all known algorithms. Therefore our algorithm reduces the overall time complexity for range sum queries significantly.A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL. For range-sum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at run-time. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the sub-cube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/b d of the size of the d-dimensional data cube. Scalable execution of expensive continuous queries over massive data streams requires input streams to be split into parallel sub streams. The query operators are continuously executed in parallel over these sub-streams. Stream splitting involves both partitioning and replication of incoming tuples, depending on how the continuous query is parallelized. We provide a stream splitting operator that enables such customized stream splitting. However, it is critical that the stream splitting itself keeps up with input streams of high volume. This is a problem when the stream splitting predicates have some costs. Therefore, to enable customized splitting of high-volume streams, we introduce a parallelized stream splitting operator, called parasplit. We investigate the performance of parasplit using a cost model and experimentally. Based on these results, a heuristic is devised to automatically parallelize the execution of parasplit. We show that the maximum stream rate of parasplit is network bound, and that the parallelization is energy efficient.Although MapReduce was originally designed as a batch-orientedsystem, it is often used for interactive data analysis: a user submitsa job to extract information from a data set, and then waits toviewthe results before proceeding with the next step in the data analysisprocess. This trend has accelerated with the development ofhigh-level query languages that are executed as MapReduce jobs.Traditional MapReduce implementations provide a poor inter-face for interactive data analysis, because they do not emitany out-put until the job has been executed to completion. However, inmany cases, an interactive user would prefer a “quick and dirty”approximation over a correct answer that takes much longer tocompute. In the database literature, online aggregation has beenproposed to address this problem, but the batch-oriented natureof traditional MapReduce implementations makes these techniquesdifficult to apply. In this demonstration, we will show how weextended our pipelined Hadoop implementation to support online aggregation within a single job and between multiple jobs. We willvisually demonstrate that online aggregation has a minimalimpacton job completion times, and can often yield an accurate approximate answer long before the job has finished executing.The Hadoop JobTracker had to be retrofitted to support pipelining between jobs. In regular Hadoop, job are submitted one atatime; a job that consumes the output of one or more other jobscannot be submitted until the producer jobs have completed.Toaddress this, we modified the Hadoop job submission interface toaccept a list of jobs, where each job in the list depends on thejobbefore it. The client interface traverses this list, annotating eachjob with the identifier of the job that it depends on.

Disadvantages:

  • Users cannot obtain an appropriate answering with satisfied accuracy in the early stages.
  • It cannot measure the quality of tuples distributions more accurately
  • It cannot support accurate multi-dimensional cardinality queries.

PROPOSED SYSTEM

In proposed system, the system proposes a fast approach to range-aggregate queries in big data environments.FastRAQ first divides big data into independent partitions with a balanced partitioning algorithm, and then generates a local estimation sketch for each partition. When a range-aggregate query request arrives, FastRAQ obtains the result directly by summarizing local estimates from all partitions.It can measure the quality of tuples distributions more accurately and can support accurate multi-dimensional cardinality queries. The system proposed FastRAQ- big data query execution in a range-aggregate queries approach. A balanced partition algorithm is used first to divide big data into independent partitions, then local estimation sketch generated for each partition. FastRAQ gave result by summarizing local estimation from all partitions. The Linux platform is helpful for implementing FastRAQ and performance evaluated on billions of data records. According to the authors, FastRAQ can give good starting points for real-time big data. It solves the 1: n format range-aggregate query problem, but m: n formatted problem still outside there.Even though Hadoop is most popular for performing best for distributed computing, its simple partitioning method does not preserve correlation between data chunks. So needed partitioning Framework for FastRAQ in which partitioning is help for balancing data chunks into respective partitions. These partitions hold data for increasing processing speed. According to large data record field partitioning algorithm is separating and analyzing that particular record. Also, it is assigned a record from large data tables to small data tables. For busting query performance partitioning algorithm play an important role. The sampling of data is necessary for the analysis on the big data as the data is present in huge amount. Sampling has various methods one most famous method is stratified sampling in which sampling independent groups and select only one sample for improvement and reducing errors. The project is implementing the partition algorithm on the idea of stratified sampling for maximum error value relatively. Firstly, divide space values into different groups and subdivide groups into different portions according to server space available for particular partition. Even Hadoop is providing the best performance for big data on respective data set, it has uncontrolled chunks which are balanced in this project with the help of a balanced partition algorithm. Means control the uncontrolled chunks of Hadoop Framework using a balance partition algorithm explained later on. Also the Hadoop system is designed to perform only System Architectures searching and sorting of particular data from a massive amount of data. The client will fire query as indicated by his necessity of data on the database. This query is sent as a request message to sever. The data set is the main input to the system which is the number of records present in one file. This file is first processed with the help of partitioning algorithm which is output as chunks of data. This data are then processed for generating of histogram. This histogram is scattered per partition, so it has been easily available for processing the query. The balanced partitioning algorithm works with a stratified sampling model. It divides all data into different groups with regard to their attribute values of interest, and further separates each group into multiple partitions according to the current data distributions and the number of available servers. The algorithm can bound the sample errors in each partition, and can balance the number of records adaptively among servers when the data distribution and/or the number of servers changes. In each partition, the sample and the histogram are updated respectively by the attribute values of the incoming record. When a query request arrives, it is delivered into each partition. We first build cardinality estimator (CE) for the queried range from the histogram in each partition. To make data balanced on each server, the partition algorithm subdivides each group into a number of partitions according to the current data distributions and sends each partition onto one server. The system use the common K-Means clustering method to analyze the vectors set and produce K clusters. A unique Cluster ID is assigned to each cluster. We construct a list of key-value pairs from the result of K clusters. The key-value pairs are in the format of < tag; Cluster ID >. We sort the key-value pairs by tag in alphabetical order. The buckets in the histogram are built from the sorted pairs. The key idea is to merge the pairs with the same Cluster IDs into the same bucket. If some tag occurring frequency is significantly different from others, its Class ID is different after the K-Means clustering, and it will be put into an independent bucket in the histogram. FastRAQ uses approximate answering approaches, such as sampling, histogram, and cardinality estimation etc., to improve the performance of range-aggregate queries. We use relative error as a statistical tool for accuracy analysis. Relative error is widely used in an approximate answering system. Also, it is easy to compute the relative errors of combined estimate variables in a distributed environment for FastRAQ. In this section, we analyze the estimated relative error and the confidence interval of final range-aggregate query result.

Advantages:

  • It acquires accurate estimations quickly for range-aggregate queries in big data environments.
  • It is used in data center networks to improve manageability and availability of big data
  • This can achieve better tradeoff between sampling errors and the number of groups.
  • It can measure the quality of tuples distributions more accurately and can support accurate multi-dimensional cardinality queries.
  • It is widely used to process extremely large data sets on commodity hardware in Facebook.

Related posts

Leave a Reply