Technology Used: Java / J2EE
Knowledge and Data Engineering, 2011
Semantic Web is an emerging area to augment human reasoning for which various technologies are being developed. These technologies have been standardized by W3C. One such standard is the RDF. With the explosion of semantic web technologies, large RDF graphs are common place. Current frameworks do not scale for large RDF graphs and as a result does not address these challenges. In this paper, we describe a framework that we built using Hadoop to store and retrieve large numbers of RDF triples by exploiting the cloud computing paradigm. We describe a scheme to store RDF data in Hadoop Distributed File System. More than one Hadoop job may be needed to answer a query because a triple pattern in a query cannot take part in more than one join in a Hadoop job. To determine the jobs, we present an algorithm to generate query plan, whose worst case cost is bounded, based on a greedy approach to answer a SPARQL query. We use Hadoop’s MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can handle large amounts of RDF data, unlike traditional approaches.