Scalable Keyword Search on Large RDF Data

Scalable Keyword Search on Large RDF Data

IEEE 2014 Data Mining Java Project

Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely on constructing a distance matrix for pruning the search space or building summaries from the RDF graphs for query processing. Existing techniques have serious limitations in dealing with realistic, large RDF data with tens of millions of triples. Furthermore, the existing summarization techniques may lead to incorrect/incomplete results. To address these issues, an effective summarization algorithm is proposed to summarize the RDF data. Given a keyword query, the summaries lend significant pruning powers to exploratory keyword search and result in much better efficiency compared to previous works. Unlike existing techniques, this search algorithm always return correct results. Besides, the summaries we built can be updated incrementally and efficiently. Experiments on both benchmark and large real RDF data sets show that this techniques are scalable and efficient.

Keyword search on generic graphs
For keyword search on generic graphs, many techniques assume that graphs fit in memory, an assumption that breaks for big RDF graphs.
Existing the approaches maintain a distance matrix for all vertex pairs, and clearly do not scale for graphs with millions of vertices.
Furthermore, these works do not consider how to handle updates. A typical approach used here for keyword-search is backward search. Backward search when used to find a Steiner tree in the data graph is NP-hard.
Large graph data
The graph data are first partitioned into small subgraphs by heuristics. In this version of the problem, the authors assumed edges across the boundaries of the partitions are weighted. A partition is treated as a supernode and edges crossing partitions are superedges. The supernodes and superedges form a new graph, which is considered as a summary the underlying graph data. By recursively performing partitioning and building summaries, a large graph can be eventually summarized with a small summary and fit into memory for query processing.
During query evaluation, the correspondent supernodes containing the keywords being queried are unfolded and the respective portion of graph are fetched from external memory for query processing. This approach cannot be applied to RDF.
Keyword search for RDF data
Search is first applied on the schema/summary of the data to identify promising relations which could have all the keywords being queried. Then, by translating these relations into search patterns and executing them against the RDF data, the actual subgraphs are retrived.

Returns incorrect answers, i.e., the keyword search returns answers that do not correspond to real subgraphs or misses valid matches from the underlying RDF data
Inability to scale to handle typical RDF datasets with tens of millions of triples.

To design a scalable and exact solution that handles realistic RDF datasets with tens of millions of triples.
To use SPARQL query language to efficiently process the RDF data
Efficiently retrieve every partition from the data by collaboratively using SPARQL query and any RDF store without explicitly storing the partition
This approach starts by splitting the RDF graph into multiple, smaller partitions. Then, it defines a minimal set of common type-based structures that summarizes the partitions. Intuitively, the summary book keeps the distinct structures from all the partitions.

Better efficiency
Overcome scalability issues
Better results

Leave a Reply