RDF Data Partitioning for Efficient SPARQL Query Processing with Spark SQL


연구 분야: Databases



학회: International Conference on Information Integration and Web Intelligence


초록

In the age of big data, the volume of RDF data has been exploding due to the growing demands for open data, including Linked Open Data (LOD), semantic data processing, and knowledge graphs. Large-scale RDF data may contain millions to hundreds of millions of triples, comprising subject, predicate, and object, making fast query processing on such datasets challenging. To address this issue, distributed parallel processing systems like Apache Spark has been successfully used. One of the key issues in such systems is to partition the data to maximize performance while balancing the load and minimizing communication between processing nodes by taking into account the dataset’s characteristics and the workload. In this study, we propose a method of RDF data partitioning for efficient query processing by Spark SQL. We exploit the statistics of RDF data and the workload information representing typical user queries, allowing us to group strongly related RDF triples into the same partition. Moreover, we employ indexes whereby only the necessary partitions are loaded for answering a query, reducing the amount of data to be processed and improving query processing performance. Our evaluation experiments showed that the proposed scheme outperformed the comparative methods in table load time and query time for most benchmark queries in a single-node setting.


Author Profile
Kosuke Yamasaki

Graduate School of Science and Technology University of Tsukuba Tsukuba Japan

Andorra
Author Profile
Toshiyuki Amagasa

Graduate School of Science and Technology University of Tsukuba Tsukuba Japan

Andorra

📄 논문 정보

발행 연도 2023년
인용수 0
출판 국가 Andorra
사이트 Springer
좋아요 수 0

연관 논문 목록 (105건)