INT8 quantization: Large Language Models (LLMs) effects on CPU inference

Written By: kinza.sabir
Last Updated On: November 10, 2023

No need to worry about extensive storage capacity and elevated data transfer rate, when it comes to Large Language Models (LLMs) performance on your hardware.

Researchers Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo and Hengyu Meng from INTEL present a competent and reliable strategy that implements LLMs more proficient and resourceful manner.

LLMs has been very useful for the AI field when implemented effectively. The key challenges for its performance is space capacity and memory utilization. Automatic INT4 weight-only quantization flow was supported to create a special LLM. This model is practically implemented on Llama2, Llama and GPT-NeoX. This model showed remarkable inference.

Prior Work Related to LLMs

Large Language Models (LLMs) have made tremendous progress in different kinds of natural language tasks, including categorization of sentiment to translation. LLMs with more than 100 billion features have recently shown an impressive ability to tackle difficult tasks by creating many reasoning stages. However, open-source models such as Llama and Llama-2 don’t work proficiently when it comes to the utilization of memory and space.

Previous models requires elevated data transfer rate and extensive storage capacity which surpasses the efficiency. This inefficient approach employs maximum memory and space.

Details about LLMs Inference and quantization

Large Language Models (LLMs) have generated a bit of excitement in the world of technology. These models are recognized for being extremely large and have enabled incredible results in natural language processing. The issue with LLMs is that they require a lot of computer power and memory which increases the cost.

The strategy to minimize the process of neural network and numeric precision of weights to decrease the computation cost of inference is known as Quantization. INT8 quantization is the most commonly used technique adjusting model accuracy and inference performance. However, deviation from the expected behavior was observed which limits the common functionality of INT8 quantization. This issue has been addressed by some researchers. The data type F8 has also been introduced but it is not available extensively due to lack of suitable hardware, whereas weight-only quantization maintains the model accuracy. That is the reason it is most commonly used.

The open-source accepts such low-bit weight-only quantization and offers the CPP-based implementations for example starcoder.cpp and llama.cpp based on ggml library. These applications are typically optimized for CUDA and does not support CPU architecture. Therefore, it becomes necessary to address the challenge of making LLM inference effective and efficient on CPU.

This paper propose the efficiency and effectiveness for LLM inference on CPUs. Inspired from ggml library tensor library for CPU was developed supporting, all the mainstream instruction sets. The outcome showed that the average latency of generation tokens from 20ms to 80ms on LLMs with 6B to 20B parameters using just a single socket of 4th Generation Intel® Xeon® Scalable Processors, while having high accuracy within only 1% loss from FP32 baseline.

The research presents following vital contributions the field of Large Language Models (LLMs)

An automatic INT4 quantization flow was proposed that generate high-quality INT4 model with accuracy loss less than 1% from FP32 baseline.
Tensor library was designed that supports common CPU instruction sets. With the help of this library an efficient LLM runtime was developed to enhance the inference.
The inference solution was applied to the most commonly used LLM models form 3B to 20B generating latency from 20ms to 80ms per-token that is extremely fast.

Research Paper And Code Accessibility

The code of this model would be available on GitHub. The research paper is also available on Arxiv and this paper is open source. Whereas, the dataset used for evaluation is also open source and can be accessed easily.

Technical Domain of LLM

The newly proposed approach consist of two major components that is

an automatic INT4 quantization flow
an efficient LLM runtime

Automatic INT4 quantization flow: Neural Compressor which is a popular quantization tool for deep learning frameworks was used to develop INT4 quantization flow. This flow was also tuned on different granularities (channel-wise or group-wise), different group size (32, 64, 128 … 1024).

Efficient LLM runtime: LLM runtime is structured in such a way that it provide the efficient inference of LLMs on CPUs. The image below describe the whole architecture. The components (CPU tensor library and LLM optimizations) in green are specialized for LLM inference, while the other components (memory management, thread scheduler, operator optimization and fusion) in blue are required for a general runtime.

CPU Tensor Library: The CPU tensor library was developed that offers a comprehensive support of INT4 kernels for x86 CPUs as shown in Table below. The library supports dynamic quantization for input along with batch or input channel per group, and weight quantization in both symmetric and asymmetric scheme.

LLM Optimization: All the recent LLMs are typically decoder-only Transformer-based model. Given the unique characteristics of next token generation, KV cache becomes performance critical for LLM inference. The optimization process is explained below.

(a) shows the default KV cache, where new token generation requires memory reallocation for all the tokens (5 in this example) whereas (b) shows the optimized KV cache with pre-allocated KV memory and only new token updated each time.

Evaluation

For result demonstration, LLMs architecture were used with the parameter size from 7B to 20B. With the help of open-source dataset from lm-evaluation-harness lambada, openai, wikitext and some others, the accuracy of both INT4 and FP32 models were evaluated.

The average accuracy is shown in the table below. The accuracy of INT4 model is nearly on par with that of FP32 model within 1% relative loss from FP32 baseline

The latency of next token generation was measured by using LLM runtime and the open-source ggml-based implementation. Following table presents the latency under a proxy configuration with 32 as both input and output tokens.

Wrap-Up

The researchers presented a comprehensive INT4 LLM inference including an automatic INT4 model quantization and efficient LLM runtime. They explained the generality on a set of popular LLMs and the performance advantage over the open-source solution on CPUs.

The research can be enhanced by improving CPU tensor library and extension APIs of Hugging Face to support INT4 LLM inference. Also, the approach can be implemented on personal computers (PCs) given the broad accessibility of CPUs, to meet the growing demands of AI generated content and empower generative AI on PCs.