연구 분야: Cryptography
학회: European Conference on Parallel Processing
Recent advancements in large language models (LLMs) have shown that smaller, fine-tuned models have comparable or better performance against larger general-purpose models in domain-specific knowledge, even when quantized. However, these models suffer from several issues in production systems: under-utilizing memory and potential data security risks. We propose a new method of mixture of experts (MoE) inference utilizing GPU partitioning combined with single-root IO virtualization (SRIOV), enabling better utilization of GPU memory and scalability, while ensuring model weights remain secure. LLMs today come in a variety of sizes and quantization levels, each with its own memory requirement. Using SRIOV, we can partition the GPU into one or more virtual functions (VFs), altering allocated memory and compute to fit the needs of these LLMs. With AMD Instinct™ MI300X [1], for example, one VF can have 24 to 192 GB of high bandwidth memory (HBM), scaling into 1.5 TB per node. These SRIOV-enabled virtual machines also address the load imbalance inherent in MoE models, eliminating the need for an auxiliary loss for load balancing, while maintaining fast interconnect between all components, providing low latency during inference. Additionally, isolation capabilities built into SRIOV ensure native data security as virtual functions are isolated from each other, creating the possibility of new use cases where different vendors may provide their own expert to the mixture.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | United States |
| 사이트 | Springer |
| 좋아요 수 | 0 |