연구 분야: Databases
학회: International Conference on Information Integration and Web Intelligence
In the age of big data, the volume of RDF data has been exploding due to the growing demands for open data, including Linked Open Data (LOD), semantic data processing, and knowledge graphs. Large-scale RDF data may contain millions to hundreds of millions of triples, comprising subject, predicate, and object, making fast query processing on such datasets challenging. To address this issue, distributed parallel processing systems like Apache Spark has been successfully used. One of the key issues in such systems is to partition the data to maximize performance while balancing the load and minimizing communication between processing nodes by taking into account the dataset’s characteristics and the workload. In this study, we propose a method of RDF data partitioning for efficient query processing by Spark SQL. We exploit the statistics of RDF data and the workload information representing typical user queries, allowing us to group strongly related RDF triples into the same partition. Moreover, we employ indexes whereby only the necessary partitions are loaded for answering a query, reducing the amount of data to be processed and improving query processing performance. Our evaluation experiments showed that the proposed scheme outperformed the comparative methods in table load time and query time for most benchmark queries in a single-node setting.
| 발행 연도 | 2023년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Andorra |
| 사이트 | Springer |
| 좋아요 수 | 0 |