Max–Min semantic chunking of documents for RAG application


연구 분야: Strategies



학회: Discover Computing


초록

Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max–Min semantic chunking, a novel method utilizing semantic similarity and a Max–Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max–Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant.


Author Profile
Csaba Kiss

Department of Stochastics Institute of Mathematics Budapest University of Technology and Economics Műegyetem rkp. 3. Budapest 1111 Hungary

Andorra
Author Profile
Marcell Nagy

Department of Stochastics Institute of Mathematics Budapest University of Technology and Economics Műegyetem rkp. 3. Budapest 1111 Hungary

Andorra
Author Profile
Péter Szilágyi

Nokia Bell Labs Budapest Hungary

Hungary

📄 논문 정보

발행 연도 2025년
인용수 0
출판 국가 Hungary, Andorra
사이트 Springer
좋아요 수 0

연관 논문 목록 (100건)