News
Neural Magic Partners with Meta to Provide vLLM Support in Llama 3.1
- By John K. Waters
- 07/23/2024
Neural Magic has announced a partnership with Meta to provide the Llama 3.1 family of multilingual large language models (LLMs) with support in the vLLM open-source library. This new series boasts such advanced features as a longer context length of up to 128K tokens and an enlarged model size of up to 405 billion parameters.
Initially developed at UC Berkeley, vLLM is a high-throughput, memory-efficient LLM serving engine designed to support more than 40 types of open-source LLMs and a range of hardware platforms, including Nvidia and AMD GPUs, AWS Inferentia, Google TPU, and Intel CPUs and GPUs. It also offers four inference optimizations.
Neural Magic is a significant contributor to this project and has developed a supported enterprise distribution called nm-vllm, which is tailored for robust, high-performance inference across various hardware platforms, including all of the above. Their other contributions to the project include automatic prefix caching, Marlin INT4 GPTQ kernels, and production monitoring with Prometheus.
According to GitHub, "This collaboration allows enterprises to deploy state-of-the-art open-source LLMs like Llama 3.1 with significant performance improvements and cost savings."
The vLLM community has incorporated several enhancements to ensure the smooth operation of these extensive models, including chunked prefill, FP8 quantization, and pipeline parallelism. These updates are expected to optimize memory usage and reduce processing interruptions, significantly improving efficiency, the company said.
- Chunked Prefill: This feature enables efficient handling of the large context window by segmenting the input, which helps manage memory usage and reduces interruptions.
- FP8 Quantization: FP8, or 8-bit floating point, reduces memory footprint and increases throughput with minimal accuracy drops. This method is recommended for single-node setups with GPUs like H100 and MI300x. Users can run FP8 quantized models on an 8xH100 or 8xA100 setup using vLLM.
- Pipeline Parallelism: This method splits the model into smaller sets of layers and runs them across multiple nodes, enhancing performance without requiring expensive interconnects. It's suitable for setups with multiple GPUs across several nodes.
- Tensor Parallelism: For users needing to shard the model across multiple GPUs, vLLM supports tensor parallelism, which divides the model across GPUs within a node. This method is suitable for large setups with fast interconnects.
Performance metrics for FP8 quantized models show that the server can sustain 2.82 requests per second with an average input length of 1024 tokens and an output length of 128 tokens, the company said.
Neural Magic's contributions to the vLLM project include tools and expertise in model optimization techniques. such as quantization and sparsity. The company also supports scalable deployments with Kubernetes and integrates telemetry and key monitoring systems to enhance performance and efficiency.
The UC Berkeley researchers who developed vLLM detailed their efforts in their 2023 paper "Efficient Memory Management for Large Language Model Serving with PagedAttention." The technology resulting from that exploration leverages advanced memory management strategies to optimize the performance of LLMs. By implementing PagedAttention, a memory optimization method that partitions the Key-Value Cache into blocks that are accessed through a lookup table, the system can manage memory resources efficiently, which reduces latency and increases throughput.
This innovation is potentially crucial as LLMs continue to grow in size and complexity, requiring more sophisticated methods to handle their extensive data processing needs.
About the Author
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at jwaters@converge360.com.