Masked Language Modeling for Resource Constrained Biological Natural Language Processing


연구 분야: Artificial Intelligence



학회: 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)


초록

Recent advances in Natural Language Processing (NLP) have produced state of the art results on several sequence to sequence (seq2seq) tasks. Enhancements in embedders and their training methodologies have shown significant improvement on downstream tasks. Word vector models like Word2Vec, FastText & Glove were widely used over one-hot encoded vectors for years until the advent of deep contextualized embedders. Protein sequences consist of 20 naturally occurring amino acids that can be treated as the language of nature. These amino acids in combinations with each other makeup the biological functions. The choice of vector representation and architecture design for a biological task is highly dependent upon the nature of the task. We utilize unlabelled protein sequences to train a Convolution and Gated Recurrent Network (CGRN) embedder using Masked Language Modeling (MLM) technique that shows significant performance boost under resource constraint setting on two downstream tasks i.e., F1-score(Q8) of 73.1% on Secondary Structure Prediction (SSP) & F1-score of 84% on Intrinsically Disordered Region Prediction (IDRP). We also compare different architectures on downstream tasks to show the impact of the nature of biological task on the performance of the model.


Author Profile
Haasha Bin Atif

National University of Computer and Emerging Sciences Islamabad Pakistan

Andorra
Author Profile
Hamza Alvi

National University of Computer and Emerging Sciences Islamabad Pakistan

Andorra
Author Profile
Hammad Naveed

National University of Computer and Emerging Sciences Islamabad Pakistan

Andorra

📄 논문 정보

발행 연도 2023년
인용수 1
출판 국가 Andorra
사이트 IEEE
좋아요 수 0

연관 논문 목록 (2건)