A multilingual translator to SQL with database schema pruning to improve self-attention


연구 분야: Databases



학회: International Journal of Information Technology


초록

Databases have a large amount of information that can be accessed by the structured query language (SQL), but this language requires technical knowledge. An alternative to facilitating access to this information is to use natural language to make queries, and an artificial intelligence model to translate to SQL. Transformer-based language models have been incredibly successful in this regard. However, transformers are limited by the size of the input text; therefore, long sentences can interfere with the quality of the results. We present two techniques to improve results. The first is an innovative technique that allows long-text sequences to be handled by transformers with up to 512 input tokens. We run database schema pruning (removal of table names and column names that are useless for the query of interest) during a fine-tuning process. The second technique is a multilingual approach. The model is fine-tuned using a data-augmented Spider dataset [a specialized dataset for Natural Language to SQL (NL2SQL)] in four languages simultaneously: English, Portuguese, Spanish, and French. The combination of these techniques allowed an increase in the exact set match accuracy results from 0.718 to 0.736 in our validation dataset. The process of improving results is challenging because NL2SQL techniques are already significantly optimized, and the two techniques presented here are important because they are applied in the training dataset, allowing them to be used with any current technique. Source code, evaluations, and checkpoints are available at https://github.com/C4AI/gap-text2sql.


Author Profile
Marcelo Archanjo Jose

Institute for Advanced Studies University of São Paulo R. do Anfiteatro 513 São Paulo São Paulo 05508-060 Brazil

Brazil
Author Profile
Fabio Gagliardi Cozman

Center for Artificial Intelligence C4AI Av. Prof. Lúcio Martins Rodrigues 370 São Paulo São Paulo 05508-010 Brazil

Brazil

📄 논문 정보

발행 연도 2023년
인용수 6
출판 국가 Brazil
사이트 Springer
좋아요 수 0

연관 논문 목록 (367건)