You are here

Efficiently Handling Skew in Outer Joins on Distributed Systems

Authors: 

Long Cheng, Spyros Kotoulas, Tomas Ward, Georgios Theodoropoulos

Publication Type: 
Refereed Conference Meeting Proceeding
Abstract: 
Outer joins are ubiquitous in databases and big data systems. The question of how best to execute outer joins in large parallel systems is particularly challenging as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding problem for parallel outer joins. Conventional approaches to this problem such as ones based on hash redistribution often lead to load balancing problems while duplication-based approaches incurs significant overhead in terms of network communication. In this paper, we propose a new algorithm, query with counters (QC), for directly handling skew in outer joins on distributed architectures. We present an efficient implementation of our approach based on the asynchronous partitioned global address space (APGAS) parallel programming model. We evaluate the performance of our approach on a cluster of 192 cores (16 nodes) and datasets of 1 billion tuples with different skew. Experimental results show that our method is scalable and, in cases of high skew, faster than the state-of-the-art.
Conference Name: 
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Proceedings: 
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Digital Object Identifer (DOI): 
10.1109/CCGrid.2014.35
Publication Date: 
26/05/2014
Conference Location: 
United States of America
Research Group: 
Institution: 
NUIM
Open access repository: 
No