基于NebulaGraph的知识图谱增强的大语言模型

一、检索增强生成

检索增强生成(Retrieval Augmented Generation, 以下简称RAG)

基于知识图谱的检索增强生成

“基于知识图谱的检索增强生成”是一种将知识图谱与生成式模型结合使用的方法，旨在提高信息检索和文本生成任务的性能和质量。这种方法通常用于处理复杂的自然语言处理任务，其中需要深入理解实体之间的关系和上下文信息。

具体来说，这个方法可以分为以下几个步骤：

知识图谱的建立：首先，需要构建一个知识图谱，其中包含实体、关系和属性的信息。知识图谱可以从可靠的来源中获取，并且需要经过整理和验证，以确保数据的准确性和可信度。
信息检索：在需要搜索、查询或检索信息的情况下，可以利用知识图谱中的结构化数据进行信息检索。这可以通过查询图谱中的实体和关系来实现，从而获得与查询相关的知识。
生成式模型：生成式模型，例如大语言模型（如GPT系列），可以用于生成自然语言文本。这些模型在生成文本时具有创造性和灵活性，但有时可能会缺乏结构化的知识。
检索增强：在生成文本之前或之后，可以将知识图谱中的信息用作增强。这可以包括从知识图谱中提取的实体、关系或属性，以及从知识图谱中获取的上下文信息。这些信息可以指导生成模型生成更准确、更具结构性的文本，从而提高生成文本的质量。
文本生成：生成式模型根据检索到的知识图谱信息以及输入的上下文，生成与任务要求相符的文本。这种生成的文本可以更加准确、有条理，并且可以包含更丰富的知识。

这种基于知识图谱的检索增强生成方法可以用于多种应用，如智能问答、文本摘要、故事生成等。它的关键优势在于将结构化的知识与生成式模型的创造性相结合，从而实现更全面、准确和富有知识的文本生成。

二、llama-index介绍

1. 对话模式chat_mode

Context Mode

Condense Question Mode

ReAct Agent Mode

OpenAI Agent Mode

Best Mode

Context 模式基础的RAG增强

For each chat interaction:

first retrieve text from the index using the user message
set the retrieved text as context in the system prompt
return an answer to the user message

Low-API

# Context 模式 基础的RAG增强
# NOTE: lazy import
from llama_index.chat_engine import ContextChatEngine
ContextChatEngine.from_defaults(
    retriever=custom_retriever,
)

Heigh-API

from llama_index.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt="You are a chatbot, able to have normal interactions, as well as talk about an essay discussing Paul Grahams life.",
)

For each chat interaction:

first generate a standalone question from conversation context and last message, then
query the query engine with the condensed question for a response.

Low-API

from llama_index.chat_engine import CondenseQuestionChatEngine
self.custom_chat_engine = CondenseQuestionChatEngine.from_defaults(
	query_engine=custom_query_engine,
	service_context=service_context,	
)

Heigh-API

1	chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)

此模式没有sources

For each chat interaction, the agent enter a ReAct loop:

first decide whether to use the query engine tool and come up with appropriate input
(optional) use the query engine tool and observe its output
decide whether to repeat or give final response

Low-API

# NOTE: lazy import
from llama_index.agent import OpenAIAgent, ReActAgent
from llama_index.tools.query_engine import QueryEngineTool
# convert query engine to tool
service_context = llama_index.global_service_context
llm = service_context.llm
query_engine_tool = QueryEngineTool.from_defaults(query_engine=custom_query_engine)
custom_chat_engine = ReActAgent.from_tools(tools=[query_engine_tool], llm=llm, verbose=True,)

Heigh-API

1	chat_engine = index.as_chat_engine(chat_mode="react", verbose=True)

使用的LLM是ChatGPT才能使用此模式，此模式需要需要人工指定是否需要使用query_engine_tool（即是否查询文档向量）

Low-API

# NOTE: lazy import
from llama_index.agent import OpenAIAgent, ReActAgent
from llama_index.tools.query_engine import QueryEngineTool
# convert query engine to tool
service_context = llama_index.global_service_context
llm = service_context.llm
query_engine_tool = QueryEngineTool.from_defaults(query_engine=custom_query_engine)
chat_engine = index.as_chat_engine(tools=[query_engine_tool], chat_mode="openai", verbose=True)

Heigh-API

1	chat_engine = index.as_chat_engine(chat_mode="openai", verbose=True)

此模式会判断，如果 LLM是ChatGPT，并且model是function接口（text-davinci-003不是function接口，而gpt-3.5-turbo是function接口），那么会使用OpenAI Agent Mode，否则使用ReAct Agent Mode

Low-API

# BEST 模式 如果使用的是openai 那么将不会使用文档索引器
# NOTE: lazy import
from llama_index.agent import OpenAIAgent, ReActAgent
from llama_index.tools.query_engine import QueryEngineTool
# convert query engine to tool
query_engine_tool = QueryEngineTool.from_defaults(query_engine=custom_query_engine)
# get LLM
service_context = llama_index.global_service_context
llm = service_context.llm
if isinstance(llm, OpenAI) and is_function_calling_model(llm.model):
    self.custom_chat_engine = OpenAIAgent.from_tools(tools=[query_engine_tool], llm=llm, verbose=True,)
else:
    self.custom_chat_engine = ReActAgent.from_tools(tools=[query_engine_tool], llm=llm, verbose=True,)

Heigh-API

1	chat_engine = index.as_chat_engine(chat_mode="best", verbose=True)

目前使用Condense Question Mode, 因为其返回结果有resources，并且与上一个问题有一定关联性

chat mode介绍https://gpt-index.readthedocs.io/en/stable/core_modules/query_modules/chat_engines/modules.html

2. 参数设置

1.retriever_mode=’keyword’

施工中。。

3.响应模式response_mode

refine: create and refine an answer by sequentially going through each retrieved text chunk. This makes a separate LLM call per Node/retrieved chunk.

Details: the first chunk is used in a query using the text_qa_template prompt. Then the answer and the next chunk (as well as the original question) are used in another query with the refine_template prompt. And so on until all chunks have been parsed.

If a chunk is too large to fit within the window (considering the prompt size), it is splitted using a TokenTextSplitter (allowing some text overlap between chunks) and the (new) additional chunks are considered as chunks of the original chunks collection (and thus queried with the refine_template as well).

Good for more detailed answers.

compact (default): similar to refine but compact (concatenate) the chunks beforehand, resulting in less LLM calls.

Details: stuff as many text (concatenated/packed from the retrieved chunks) that can fit within the context window (considering the maximum prompt size between text_qa_template and refine_template). If the text is too long to fit in one prompt, it is splitted in as many parts as needed (using a TokenTextSplitter and thus allowing some overlap between text chunks).

Each text part is considered a “chunk” and is sent to the refine synthesizer.

In short, it is like refine, but with less LLM calls.

tree_summarize: Query the LLM using the text_qa_template prompt as many times as needed so that all concatenated chunks have been queried, resulting in as many answers that are themselves recursively used as chunks in a tree_summarize LLM call and so on, until there’s only one chunk left, and thus only one final answer.

Details: concatenate the chunks as much as possible to fit within the context window using the text_qa_template prompt, and split them if needed (again with a TokenTextSplitter and some text overlap). Then, query each resulting chunk/split against text_qa_template (there is no refine query !) and get as many answers.

If there is only one answer (because there was only one chunk), then it’s the final answer.

If there are more than one answer, these themselves are considered as chunks and sent recursively to the tree_summarize process (concatenated/splitted-to-fit/queried).

Good for summarization purposes.

simple_summarize: Truncates all text chunks to fit into a single LLM prompt. Good for quick summarization purposes, but may lose detail due to truncation.

no_text: Only runs the retriever to fetch the nodes that would have been sent to the LLM, without actually sending them. Then can be inspected by checking response.source_nodes.

accumulate: Given a set of text chunks and the query, apply the query to each text chunk while accumulating the responses into an array. Returns a concatenated string of all responses. Good for when you need to run the same query separately against each text chunk.

compact_accumulate: The same as accumulate, but will “compact” each LLM prompt similar to compact, and run the same query against each text chunk.

https://gpt-index.readthedocs.io/en/stable/core_modules/query_modules/response_synthesizers/usage_pattern.html#configuring-the-response-modehttps://gpt-index.readthedocs.io/en/stable/core_modules/query_modules/response_synthesizers/usage_pattern.html#configuring-the-response-mode

4.index选择

施工中。。

三、LLM模型选择

模型

使用过text-davinci-003, text-davinci-002,gpt-3.5-turbo模型,目前使用gpt-3.5-turbo

免费版限速情况：

MODEL	TPM每分钟Token数量	RPM每分钟请求数量	RPD每天请求数量
gpt-3.5-turbo	40,000	3	200
text-davinci-003	150,000	3	200
text-davinci-002	150,000	3	200

码字ing

https://github.com/xorbitsai/inference/blob/main/README_zh_CN.mdhttps://github.com/xorbitsai/inference/blob/main/README_zh_CN.md

使用方式

https://gpt-index.readthedocs.io/en/stable/examples/llm/XinferenceLocalDeployment.htmlhttps://gpt-index.readthedocs.io/en/stable/examples/llm/XinferenceLocalDeployment.html

ggml CPU推理较快，pytorh 使用GPU更快的推理

模型下载的目录：${USER}/.xinference/cache

示例代码

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) # logging.DEBUG for more verbose output

# If Xinference can not be imported, you may need to restart jupyter notebook
from llama_index import (
    ListIndex,
    TreeIndex,
    VectorStoreIndex,
    KeywordTableIndex,
    KnowledgeGraphIndex,
    SimpleDirectoryReader,
    ServiceContext,
)
from llama_index.llms import Xinference
from xinference.client import RESTfulClient
from IPython.display import Markdown, display
  
# Define a client to send commands to xinference
client = RESTfulClient(f"http://localhost:9997")

# Download and Launch a model, this may take a while the first time
model_uid = client.launch_model(
    model_name="chatglm2",
    # model_name="llama-2-chat",
    # model_name="baichuan-chat",
    model_size_in_billions=6,
    model_format="ggmlv3",
    quantization="q8_0",
    n_ctx=4096,
)

llm = Xinference(endpoint=f"http://localhost:9997", model_uid=model_uid)
service_context = ServiceContext.from_defaults(llm=llm)

# create index from the data
documents = SimpleDirectoryReader(input_files=["./data/大额管理识别篇.docx"]).load_data()

# change index name in the following line
index = VectorStoreIndex.from_documents(
    documents=documents, service_context=service_context
)

# ask a question and display the answer
query_engine = index.as_query_engine()

question = "什么是商业银行"

response = query_engine.query(question)
display(Markdown(f"<b>{response}</b>"))

报错：Failed to launch model, detail: baichuan-chat model can't run on Darwin system

原因：baichuan-chat模型无法在macOS运行

硬件资源需求：CPU:4C32G 下推理极慢使用 llama-2-chat模型推理一个问题用时6m28s，推荐GPU

对接其他公司封装的ChatGPT API

定义urllib接口调用

def get_api_response(prompt: str):
    url = 'https://api-tinystar.meinenghua.com/api/v1/chat/completions'
    params = {
        'questions': prompt,
        'prompt': "你是智能助手",
        'modelOptions': {
            'maxTokens': 512,
            'topP': 1
        },
        'disableSensitiveWordDetection': True
    }
    params = json.dumps(params)
    headers = {
        'Accept-Charset': 'utf-8',
        'Content-Type': 'application/json',
        'App-Key': '',
        'App-Secret': ''
        }
    # 用bytes函数转换为字节
    data = bytes(params, 'utf8')
    req = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
    res_str = urllib.request.urlopen(req).read().decode('utf-8')
    res_dict = json.loads(res_str)
    resposne = res_dict['response']
    print(f'---call api resposne: {resposne}')

    return resposne

自定义LLM

# set context window size
context_window = 2048
# set number of output tokens
num_output = 256

class OurLLM(CustomLLM):

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            # context_window=context_window,
            # num_output=num_output,
            model_name="tinystar"
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        prompt_length = len(prompt)
        # response = pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"]
        # # only return newly generated tokens
        # text = response[prompt_length:]
        text = get_api_response(prompt=prompt)

        return CompletionResponse(text=text)
    
    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        raise NotImplementedError()

定义service_context

llm = OurLLM()
# use chatgpt
# from llama_index.llms import OpenAI
# llm = OpenAI(model='gpt-3.5-turbo', max_tokens=512)
service_context = ServiceContext.from_defaults(
    llm=llm, 
    embed_model="local",
    chunk_size=512,
)

模型对比

	ChatGPT	xInference/Llama2-chat	xInference/baichuan-chat	xInference/chatGLM
来源	ChatGPT	xInference	xInference	xInference
使用方式	API	xInference	xInference	xInference
硬件资源要求	无	GPU:	GPU:	GPU:
是否需要预训练	否	否	否	否
是否支持中文	是	否	是	是
大小		2.87 GB		6.64 GB
速度
性能评分

四、Embedding模型选择

模型

text-embedding-ada-002

使用ChatGPT的接口https://api.openai.com/v1/embeddings获取

https://platform.openai.com/docs/guides/embeddings/what-are-embeddingshttps://platform.openai.com/docs/guides/embeddings/what-are-embeddings

使用方式: llama-index==0.7.17 默认该模型

service_context = ServiceContext.from_defaults(
                llm=OpenAI(model="gpt-3.5-turbo", max_tokens=512), 
                chunk_size=512
                )

使用方式: llama-index==0.8.3 local模式默认该模型

llama-index==0.8.3 后续计划使用该模型，中文支持更加友好

https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.mdhttps://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md

使用方式: llama-index==0.8.3

sentence-transformers中的模型

https://huggingface.co/sentence-transformers/all-mpnet-base-v2https://huggingface.co/sentence-transformers/all-mpnet-base-v2

使用方式: llama-index==0.7.17 local模式默认该模型

service_context = ServiceContext.from_defaults(
                llm=OpenAI(model="gpt-3.5-turbo", max_tokens=512), 
                chunk_size=512,
                embed_model="local"
                )

sentence-transformers中的模型

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

使用方式: llama-index==0.7.17

embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))
service_context = ServiceContext.from_defaults(
                llm=OpenAI(model="gpt-3.5-turbo", max_tokens=512), 
                chunk_size=512,
                embed_model=embed_model
                )

模型对比

	text-embedding-ada-002	all-MiniLM-L6-v2	all-mpnet-base-v2
来源	ChatGPT	sentence-transformers	sentence-transformers
使用方式	API	huggingface	huggingface
硬件资源要求	无	CPU:8C32G GPU:	CPU:8C32G GPU:
向量维度	1536	384	768
最大语句长度	8191	256	384
大小		80 MB	420 MB
速度(V100 GPU)		14200	2800
性能评分		58.80	63.30

模型选择

text-embedding-ada-002理论性能最高，但是使用ChatGPT的API，需要依赖VPN环境，并存在大量的token消耗

all-mpnet-base-v2性能较高，但是模型大小是all-MiniLM-L6-v2的五倍，而且速度是all-MiniLM-L6-v2的五分之一,最总选择all-MiniLM-L6-v2

后续计划使用BAAI/bge-small-zh，中文支持更加友好

sentence_transformers模型下载的目录：Mac：/Users/${USER}/.cache/torch, Linux: /root/.cache/torch

BAAI模型下载的目录: Mac：/Users/${USER}/Library/Caches/llama_index/

性能评估工具：

embedding性能评估工具https://github.com/facebookresearch/SentEval

Text search评估工具https://github.com/beir-cellar/beir

code retrieval评估工具https://github.com/github/CodeSearchNet

五、三元组提取模型选择

模型

ChatGPT

Babelscape/rebel-large

Babelscape/mrebel-large

DeepKE

施工中

多语言版本

基于深度学习的开源中文知识图谱抽取框架

https://github.com/zjunlp/DeepKE

DeepKE-LLM是基于大语言模型知识图谱抽取框架

https://github.com/zjunlp/DeepKE/blob/main/example/llm/README_CN.md

模型对比

	ChatGPT	Babelscape/rebel-large	Babelscape/mrebel-large	DeepKE	DeepKE-cnSchema	DeepKE-LLM
来源	ChatGPT	huggingface	huggingface
使用方式	API	huggingface	huggingface
硬件资源要求	无	CPU:8C32G GPU:	CPU:8C32G GPU:
是否需要预训练	否	否	否	是	否	是
是否支持中文	是	否	是	是	是	是
大小		1.63 GB	2.6 GB
速度
性能评分

huggingface模型下载的目录：Mac：/Users/${USER}/.cache/huggingface/hub, Linux: /root/.cache/huggingface/hub

六、WEB应用

1.部署

1.1 本地环境部署

1.2 测试环境部署

1.2.1 准备

redis

NebulaGraph 创建llamaindex图空间

# Create Space 
CREATE SPACE `llamaindex` (partition_num = 10, replica_factor = 1, charset = utf8, collate = utf8_bin, vid_type = FIXED_STRING(256));
:sleep 20;
USE `llamaindex`;

# Create Tag: 
CREATE TAG `entity` ( `name` string NULL) ttl_duration = 0, ttl_col = "";

# Create Edge: 
CREATE EDGE `relationship` ( `relationship` string NULL) ttl_duration = 0, ttl_col = "";

Postgres 见[安装](#2. Docker安装PostgreSQL)
算法包迁移

由于服务器无法科学上网，需要人工上传，macos: ~/Library/Caches/llama_index 上传到Liunx: /tmp/llama_index/
安装anconda 和python3.10

安装前端组件

cd frontend
nvm use
npm install yarn
yarn install

1.2.2 部署

# 0.切换环境
conda activate py310

# 1.安装依赖
pip install -r ./requirements/test.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/

# 2.在postgres数据库中创建数据"cookietest"

# 3.1 修改 .envs/.test 中的redis\Postgres\nebula 地址配置
# 3.2 修改merge_production_dotenvs_in_dotenv文件 为test环境
vim merge_production_dotenvs_in_dotenv.py
PRODUCTION_DOTENVS_DIR = BASE_DIR / ".envs" / ".test"
# 4.生成.env文件
python merge_production_dotenvs_in_dotenv.py

# 5.修改.env文件
vim .env
USE_DOCKER=False

# 6.数据迁移
python manage.py migrate

# 7.创建用户
python manage.py createsuperuser

# 8.运行djanjo
nohup uvicorn config.asgi:application --host 0.0.0.0 > application.log 2>&1 &

# 9.运行celery worker
nohup celery -A config.celery_app worker  -P gevent -l INFO > celery.log 2>&1 &

1.2.3 清理

# 1.清理embedding
cd /root/Delphic/storage
rm -rf * 

# 2.清理上传的文档
cd /root/Delphic/delphic/media/documents
rm -rf * 

# 3.截断postgres数据库中的表 indexs_collection和indexs_document

# 4.清理图空间
clear space llamaindex

2.版本更新情况：

版本	Chat	Embedding	Relation Extraction	部署方式	状态
feat_nebula_mysql	text-davinci-002	text-embedding-ada-002	text-davinci-002	docker	已结束
feat_nebula_llama_0.7.9	text-davinci-003	text-embedding-ada-002	text-davinci-003	docker	已结束
feat_nebula_llama_0.7.17	gpt-3.5-turbo	all-MiniLM-L6-v2	gpt-3.5-turbo	docker +本地部署	已结束
feat_nebula_llama_0.8.3	其他公司接口、本地模型	BAAI/bge-small-zh	其他公司接口	docker + 本地部署	已结束
feat_nebula_llama_0.8.13	其他公司接口、本地模型	BAAI/bge-small-zh	DeepKE-cnSchema	本地部署	已结束
feat_nebula_llama_0.9.1	美能华接口、混元接口	BAAI/bge-small-zh	美能华接口	GPU服务器部署	已结束

chang log：

1.添加混元大模型接口

2.GPU服务器部署的改造

3.关键词图查询语法优化的适配

1.引用文件位置定位

1.对接外部公司gpt接口

2.embedding中文模型替换

3.提示词中文化

1.优化图谱增强的文档对话功能

2.新增图谱对话功能

1.图谱增强的文档对话功能

1.添加对mysql的支持

1.添加对nebulagraph的支持

3.待定计划：

1.三元组提取使用本地模型代替api接口（硬件资源限制）

2.prompt提示词中文化

3.对图对话RAG升级

4.清理本机下载的没用的模型文件

5.本地storage文件存储替换oss

6.向量文件存储替换为向量数据库

7.React模式中没有resources的问题

8.将返回结果为markdown格式

9.自定义llm api接口设置重试、容错、连接池

10.自定义llm 流式接口实现

11.文档路径接口，用来给前端展示文档

12.三元组都存存到同一个图里，会出现知识混淆

13.提高token的限制，并效果测试

14.引用文件位置定位

七、其他

1.ChatGPT账号注册流程

2. Docker安装PostgreSQL

https://www.jianshu.com/p/17ae7f0c3fa4https://www.jianshu.com/p/17ae7f0c3fa4

docker pull postgres:14-alpine

docker run --name postgres-db -e TZ=PRC -e POSTGRES_USER=bbQXlhtKFPTMHhhCaplnwWQMFFpkTjpP -e POSTGRES_DB=database -e POSTGRES_PASSWORD=ooEjyQ1uDy4W7KMwZeXnLOQNdkgCuX02STBJv4YxP1o0siqCWWW4zVOrj49KWV96 -p 5432:5432 -v pgdata:/var/lib/postgresql/data -d postgres:14-alpine

docker start postgres-db

报错：cannot access '/docker-entrypoint-initdb.d/': Operation not permitted

docker低版本不支持postgres，改为使用postgres:14-alpine

八、出现的问题

1. celery时区的问题

报错：Substantial drift from celery@jinmeng may mean clocks are out of sync. Current drift is 28806 seconds

settings

1
2
3

# 解决方案：Django设置中国时区
TIME_ZONE = 'Asia/Shanghai' 
LANGUAGE_CODE = "zh-hans"

2.代码中访问ChatGPT失败

https://www.hangge.com/blog/cache/detail_3138.htmlhttps://www.hangge.com/blog/cache/detail_3138.html

3. 部署服务器时候 embedding报错

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method

原因：torch再使用CUDA时候，不支持multiprocessing

解决方案1：
torch.multiprocessing.set_start_method('spawn') 
#这种方式需要需改 创建HuggingFaceBgeEmbeddings时的相关源码

解决方案2:
# -P gevent(协程) 和-P solo 解决服务其上torch使用cuda不支持multiprocessing的报错, 如果不设置，部署的时会出现一个worker进程+多个子进程
nohup celery -A config.celery_app worker  -P gevent -l INFO > celery.log 2>&1 & 
# nohup celery -A config.celery_app worker  -P solo -l INFO > celery.log 2>&1 &

暂时使用解决方案2