Scaling large language models (LLMs) has become increasingly important in recent years. With the rise of deep learning and artificial intelligence (AI), the potential for LLMs to process and understand human language is greater than ever before. However, scaling LLMs comes with its challenges, such as computational and memory bottlenecks. In order to overcome these challenges, a technique called Mixture-of-Experts (MoE) has been developed.
MoE is a technique that allows LLMs to scale without incurring high computational costs. Instead of using the entire model capacity for every input, MoE architectures utilize specialized “expert” modules that are tailored to handle specific areas of expertise. These experts only make up a fraction of the overall model parameters, resulting in more efficient computation. Several popular LLMs, including Mixtral, DBRX, Grok, and potentially GPT-4, utilize MoE.
However, current MoE techniques are limited in the number of experts they can accommodate. This restricts the scalability of MoE models to a relatively small number of experts. To address this limitation, Google DeepMind has introduced a novel architecture called Parameter Efficient Expert Retrieval (PEER). PEER allows MoE models to scale to millions of experts, further improving the performance-compute tradeoff of large language models.
Scaling LLMs has been proven to enhance their performance and capabilities. However, there is a limit to how much a model can be scaled before running into computational and memory constraints. Each transformer block in an LLM consists of attention layers and feedforward (FFW) layers. The FFW layers contribute to the majority of the model’s parameters and are a bottleneck in scaling. In classic transformer architectures, all the parameters of the FFW layers are used in inference, making them computationally intensive.
MoE addresses this challenge by replacing the FFW layers with sparsely activated expert modules. These experts specialize in specific areas and only contain a fraction of the parameters of the full dense layer. By increasing the number of experts, the capacity of the LLM can be expanded without increasing computational costs.
The optimal number of experts in an MoE model depends on various factors, such as the number of training tokens and the compute budget. Studies have shown that MoEs consistently outperform dense models with the same compute resources when these variables are balanced. Increasing the “granularity” of an MoE model, which refers to the number of experts, can lead to performance gains, especially when combined with an increase in model size and training data.
Current MoE approaches have limitations that make them unscalable. They often have fixed routers designed for a specific number of experts and require readjustments when new experts are added. DeepMind’s PEER architecture addresses these limitations by replacing the fixed router with a learned index. PEER utilizes a fast initial computation to create a shortlist of potential expert candidates before selecting and activating the top experts. This allows MoE models to handle a large number of experts without sacrificing performance.
Unlike previous MoE architectures, PEER uses tiny experts with a single neuron in the hidden layer. This design enables the sharing of hidden neurons among experts, improving knowledge transfer and parameter efficiency. To compensate for the small size of the experts, PEER employs a multi-head retrieval approach similar to the multi-head attention mechanism, which is commonly used in transformer models.
PEER can be added as a layer in existing transformer models or used to replace an FFW layer. It is also related to parameter-efficient fine-tuning (PEFT) techniques, which focus on modifying the fewest number of parameters necessary for fine-tuning a model for a new task. In PEER, parameter efficiency reduces the number of active parameters in the MoE layer, influencing computation and memory consumption during pre-training and inference. Additionally, PEER could potentially be adapted to select PEFT adapters at runtime, allowing for dynamic addition of new knowledge and features to LLMs.
The researchers at DeepMind evaluated the performance of PEER on various benchmarks, comparing it against transformer models with dense feedforward layers and other MoE architectures. Their experiments demonstrated that PEER models achieved a better performance-compute tradeoff, achieving lower perplexity scores with the same computational budget. Increasing the number of experts in a PEER model further reduced perplexity.
The findings from this research challenge the belief that MoE models reach peak efficiency with a limited number of experts. PEER demonstrates that with the right retrieval and routing mechanisms, MoE can scale to millions of experts. This scalability has the potential to significantly reduce the cost and complexity of training and serving very large language models.
In conclusion, MoE is a valuable technique for scaling large language models. However, existing MoE techniques have limitations that restrict their scalability. DeepMind’s Parameter Efficient Expert Retrieval (PEER) architecture addresses these limitations and enables MoE models to scale to millions of experts. PEER outperforms other MoE architectures and achieves a better performance-compute tradeoff. The findings from this research have implications for reducing the cost and complexity of training and serving large language models.
Source link