使用gpu加速llama-cpp-python大模型推理

llama.cpp是一个基于C++实现的大模型推理工具，通过优化底层计算和内存管理，可以在不牺牲模型性能的前提下提高推理速度。

方法一（使用python:3.10-bullseye docker镜像）

一、下载python镜像（docker）

1 2	# 下载的是python 3.10 Debian 11的版本 $ docker pull python:3.10-bullseye

二、下载CUDA Toolkit

$ mdkir /src/ && cd /src/
# CUDA 12.0 支持 Debian 11
$ wget https://developer.download.nvidia.com/compute/cuda/12.0.0/local_installers/cuda_12.0.0_525.60.13_linux.run

# 只安装toolkit
$ sh cuda_12.0.0_525.60.13_linux.run --silent --toolkit

# 删除cuda安装文件
$ rm -f cuda_12.0.0_525.60.13_linux.run

# 将 nvcc 添加到环境变量
$ export PATH=$PATH:/usr/local/cuda/bin

# 确保已经成功安装了cuda toolkit
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0

三、安装llama-cpp-python库

1
2
3

# To install with CUDA support, set the `LLAMA_CUDA=on` environment variable before installing:

$ CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

四、下载GGUF格式模型文件
GGUF格式是用于存储大型模型预训练结果的，相较于Hugging Face和torch的bin文件，它采用了紧凑的二进制编码格式、优化的数据结构以及内存映射等技术，提供了更高效的数据存储和访问方式。llama-cpp-python 主要是使用GGUF格式的大模型文件。

从Hugging Face平台上下载

1	$ wget https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf?download=true

五、使用llama-cpp-python进行推理

from llama_cpp import Llama

# 加载模型
llm = Llama(
      model_path="./Meta-Llama-3-8B-Instruct-Q4_K_M.gguf",
      n_gpu_layers=-1, # Use GPU acceleration
)

# 使用OpenAI接口规范进行推理
output = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes images."},
          {
              "role": "user",
              "content": "Describe this image in detail please."
          }
      ]
)
# 输出结果
print(output)

六、查看GPU占用情况（nvidia-smi）

$ nvidia-smi 
Fri Apr 26 22:17:42 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:07.0 Off |                    0 |
| N/A   39C    P0    81W /  70W |   5153MiB / 15360MiB |     61%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4788      C   python                           5150MiB |
+-----------------------------------------------------------------------------+

至此，让llama-cpp-python使用GPU加速推理，大功告成！

方法二（使用cog方式制作镜像）

Cog 是一种开源工具，可让您将机器学习模型打包到标准的生产就绪容器中。您可以将打包的模型部署到您自己的基础架构或Replicate平台上。

使用Cog方式进行镜像制作，详情请查看以下github仓库。
https://github.com/shideqin/cog-codellama3-cpp