优质野生教程！CPM-Bee部署及创建接口服务

△ 本文转载自CSDN博客，原文标题为“CPM-BEE 开源大模型介绍、部署以及创建接口服务”，转载已获得作者「feifeiyechuan」的授权。

服务部署

—

01 服务器配置

配置详情

GPU：8*3080TI服务器 (一块24G显存的卡就可以了)

CUDA：12.1

02 环境安装

cat requirements.txt

(python38) root@-NF5468M5: cat requirements.txt
torch>=1.10
bmtrain>=0.2.1
jieba
tqdm
tensorboard
numpy>=1.21.0
spacy
opendelta

为了避免cuda环境和pytorch版本的冲突，一个个进行安装。

1）安装pytorch，适配cuda12.1

参考：https://pytorch.org/get-started/locally/

注意：使用cuda安装比较慢，所以用pip3安装

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

2）安装bmtrain

pip install bmtrain

3）安装其他环境

pip install jieba tqdm tensorboard numpy spacy opendelta

03 模型下载

下载地址：https://huggingface.co/openbmb/cpm-bee-10b/tree/main

1）代码克隆

git clone https://github.com/OpenBMB/CPM-Bee.git

2）下载模型 19G

下载路径: ./model

下载路径，自定义即可

04 测试

1）修改测试文件

修改 vi text_generation.py

更改模型路径

2）测试模型

python text_generation.py

05 接口设计（Python版）‍

1）新建Flask接口

vi flask_server.py

from flask import Flask, request, jsonify
import threading
import torch
from cpm_live.generation.bee import CPMBeeBeamSearch
from cpm_live.models import CPMBeeTorch, CPMBeeConfig
from cpm_live.tokenizers import CPMBeeTokenizer
from opendelta import LoraModel
from flask_cors import CORS
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '6'

app = Flask(__name__)
CORS(app)

# 加载模型
config = CPMBeeConfig.from_json_file("config/cpm-bee-10b.json")
ckpt_path = "model/pytorch_model.bin"
tokenizer = CPMBeeTokenizer()
model = CPMBeeTorch(config=config)
model.load_state_dict(torch.load(ckpt_path))
model.cuda()
beam_search = CPMBeeBeamSearch(
    model=model,
    tokenizer=tokenizer,
)

# 创建线程锁和计数器
lock = threading.Lock()
counter = 0
MAX_CONCURRENT_REQUESTS = 5  # 最大并发请求数


@app.route('/cpmbee/conversation', methods=['POST'])
def conversation():
    global counter

    # 请求过载，返回提示信息
    if counter >= MAX_CONCURRENT_REQUESTS:
        return jsonify({'message': '请稍等再试'})

    # 获取线程锁
    with lock:
        counter += 1

    try:
        # 接收 POST 请求的数据
        question = request.json['question']

        inference_results = beam_search.generate([
            {'question': question, "<ans>": ""}
        ], max_length=100, repetition_penalty=1.1)

        print('inference_results:', type(inference_results), inference_results)

        result = inference_results[0]["<ans>"]

        print('result:', type(result), result)

        # 返回结果
        response = {'result': result}
        return jsonify(response)

    finally:
        # 释放线程锁并减少计数器
        with lock:
            counter -= 1


@app.route('/cpmbee/select', methods=['POST'])
def select():
    global counter

    # 请求过载，返回提示信息
    if counter >= MAX_CONCURRENT_REQUESTS:
        return jsonify({'message': '请稍等再试'})

    # 获取线程锁
    with lock:
        counter += 1

    try:
        # 接收 POST 请求的数据
        print(request.json)
        description = request.json['description']
        options = request.json['options']
        options_index2option = {'<option_%s>' % str(index): str(option) for index, option in enumerate(options)}
        question = request.json['question']

        inference_results = beam_search.generate([
            {'input': description, 'options': options_index2option, 'question': question, "<ans>": ""}
        ], max_length=100, repetition_penalty=1.1)

        option_result = inference_results[0]["<ans>"]

        result = options_index2option.get(option_result, option_result)

        # 返回结果
        response = {'result': result}
        return jsonify(response)

    finally:
        # 释放线程锁并减少计数器
        with lock:
            counter -= 1


if __name__ == '__main__':
    print("Flask 服务器已启动")
    app.run(host='0.0.0.0', port=8000)

在上述代码中，我们通过 from flask_cors import CORS导入了 CORS 类，并在 Flask 应用程序中调用了 CORS(app)。这样就启用了默认的 CORS 配置，允许所有来源跨域访问。

未避免显存异常，在上述代码中，通过创建一个线程锁 lock 和一个计数器 counter 来控制并发请求的数量。如果请求超过了 MAX_CONCURRENT_REQUESTS 的限制，即达到了最大并发请求数，服务器将返回提示信息"请稍等再试"。