Refereed Original Article
A key property of linked data, i.e., the web-based representation and publication of data as interconnected labeled graphs, is that it enables querying and navigating through datasets distributed across the network. SPARQL1.1, the current standard query language for RDF-based linked data, defines a construct-called property paths (PP)-to navigate between the entities of a graph. This is potentially very useful in a number of use cases, e.g., in the biomedical domain, where large datasets are available as linked data graphs. However, the use of PP in SPARQL 1.1. is possible only on a single local graph, requiring us to merge all distributed datasets into one large, centrally stored graph, therefore reducing the value of using linked data in the first place. We propose an index-based approach-called QPPDs-for answering queries for paths distributed across multiple, distributed datasets. We provide a heuristic-based source selection mechanism to select the relevant datasets (also called data sources) for a given path query, and a technique that federates queries to selected sources, and assembles (merges) the paths (i.e., partial or complete) retrieved from those remote datasets. We demonstrate our approach on a genomics use-case, where the description of biological entities (e.g., genes, diseases, and drugs) is scattered across multiple datasets. In our preliminary investigation, we evaluate the QPPDs approach with real-world path queries-on biological data that are very heterogeneous in nature-in terms of performance (overall path retrieval time) and result completeness, i.e., the number of paths retrieved.
Digital Object Identifer (DOI):
National University of Ireland, Galway (NUIG)
Open access repository: