ITRT(IT Research Trends)

Max–Min semantic chunking of documents for RAG application

연구 분야: Strategies

논문 키워드: #algorithm #clustering #chunking #splitter #overcome

학회: Discover Computing

초록

Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max–Min semantic chunking, a novel method utilizing semantic similarity and a Max–Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max–Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant.

📄 논문 정보

발행 연도	2025년
인용수	0
출판 국가	Hungary, Andorra
사이트	Springer
좋아요 수	0

Max–Min semantic chunking of documents for RAG application

Max–Min semantic chunking of documents for RAG application

Csaba Kiss

Marcell Nagy

Péter Szilágyi

📄 논문 정보

연관 논문 목록 (100건)

Max–Min semantic chunking of documents for RAG application

Max–Min semantic chunking of documents for RAG application

📄 논문 정보

연관 논문 목록 (100건) 내 서재 담기

연관 논문 목록 (100건)