CSAIL Publications and Digital Archive header
bullet Research Abstracts Home bullet CSAIL Digital Archive bullet Research Activities bullet CSAIL Home bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line

 

Research Abstracts - 2007
horizontal line

horizontal line

vertical line
vertical line

Scalable Semantic Web Data Management

Daniel J. Abadi, Kate Hollenbach, Samuel R. Madden & Adam Marcus

Project Overview

Efficiently managing RDF/semantic web data is an important factor in realizing the semantic web vision. Performance and scalability issues are increasingly important as semantic web technology is applied to real-world applications. In this project, we examine the reasons why current data management solutions for RDF data scale poorly, and explore the fundamental scalability limitations of these approaches. Our initial results confirm that storing Semantic Web content as RDF in a naive triple format (a three-column schema in a relational database table) scales very poorly. We find that instead, property table schemas [2] should be used. We compare several implementations on column-oriented databases and traditional row-oriented databases to the current semantic web content store offerings.

Problems Addressed

RDF (Resource Description Framework) [5] is the data model behind the semantic web vision, whose goal is to enable integration and sharing of data across different applications and organizations. RDF describes a particular resource (e.g., a Website, a Web page, or a real world entity) using a set of RDF statements of the form <subject, property, object>. The subject is the resource, the property is the characteristic being described, and the object is the value for that characteristic: either a literal or another resource. For example, an RDF triple might look like: <http://www.example.org/index.html,http://www.example.org/ontology/dateCreated, "12/11/06">.

The naive way to store a set of RDF statements in a relational database is with a single table containing columns for subject, property, and object. While simple, this schema quickly hits scalability limitations, as common queries such as those that require multiple property-object value pairs for a given subject require a self-join on subject.

Initial Progress

One common way to address the self-join problem is to create separate tables for subjects that tend to have common properties defined. The rows in the table are subjects, columns are properties, and values are objects (i.e., a row in this table is a set of object values for some predefined properties of a particular subject). NULLs are used if a subject does not have a property defined. These tables are called property tables [2] or subject-property matrix materialized join views [1].

Our initial experimentation has been performed on a set of queries inspired by Longwell [4], a faceted browsing tool. We aim to make these queries run on the order of seconds to support an interactive browsing environment. With current semantic web store technology, these queries take hundreds of seconds. We have found that using property tables on traditional row-oriented databases such as PostgreSQL brings these queries down by an order of magnitude. We look to projects such as C-Store [3], a column oriented database, to bring us the rest of the way to real-time querying.

Since Semantic Web data is often semi-structured, storing data in property tables can result in very sparse, wide tables as more subjects or properties are added. This characteristic of the property table schema results in poor performance in a row-oriented database. We have found that since there is a much smaller performance penalty in storing wide, sparse data in column-oriented stores such as C-Store relative to row-oriented database technology, an ultra-wide property table might be sufficient to store an entire RDF database. As a result of this observation, we are building a new database management system, architected specifically to store Semantic Web data, designed to scale better and achieve higher performance than other existing RDF stores.

References:

[1] Eugene Inseok Chong and Souripriya Das and George Eadon and Jagannathan Srinivasan. An Efficient SQL-based RDF Querying Scheme. In VLDB, pp. 1216--1227, Trondheim, Norway, 2005.

[2] Kevin Wilkinson. Jena Property Table Implementation. Presented at the Second International Workshop on Scalable Semantic Web Knowledge Base Systems, Athens, Georgia, USA, November 2006.

[3] Michael Stonebraker and Daniel J. Abadi and Adam Batkin and Xuedong Chen and Mitch Cherniack and Miguel Ferreira and Edmond Lau and Amerson Lin and Samuel Madden and Elizabeth J. O'Neil and Patrick E. O'Neil and Alex Rasin and Nga Tran and Stanley B. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, pp. 553--564, Trondheim, Norway, 2005.

[4] Longwell website. http://simile.mit.edu/longwell/, 2007.

[5] RDF Primer. W3C Recommendation. http://www.w3.org/TR/rdf-primer, 2004.

 

vertical line
vertical line
 
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu