使用LLMs为ICD代码创建知识图谱

将ICD码表示为知识图谱

由安南德·苏布拉曼尼安和西拉姆·拉吉库马尔撰写

病人到医院咨询医生和医务人员关于他们的健康问题，他们的诊断、治疗过程和其他医疗细节被记录为电子健康记录（EHRs）。例如，在美国，办公室医生中使用EHR的比例已从2008年的42%飙升至2021年的88%以上[1]，导致现在大部分医疗信息以这种格式存储。这些EHRs用于各种下游活动，例如ICD编码[2]，涉及将医疗诊断和病情链接到标准化的、独特的字母数字代码。

鉴于电子病例记录是以文本形式进行数字化组织，我们可以使用自动化方法识别文本中的医疗信息和实体，并尝试将其映射到相应的ICD代码。然而，ICD编码是一个复杂的过程 — 它需要准确识别病例记录中的诊断结果，将其与其他相关的医疗信息和实体联系起来，然后将其映射到正确的ICD代码。这项任务需要强大的自然语言理解（NLU）技能以及对ICD分类法的详细了解。通常情况下，这些任务由具有准确分配ICD代码所需专业知识的医疗从业者和临床编码员执行。在本博客文章中，我们将探讨ICD代码如何作为知识图谱（KG）进行表示以及如何构建该知识图谱。

我们能否直接应用LLMs进行ICD编码？

大型语言模型（LLMs）越来越多地用于各种临床和医疗应用，如医疗报告摘要和实体提取。最近的研究显示，LLMs在医疗资格考试以及医疗问题回答任务上表现得与医疗从业人员和医生相当甚至更好。

然而，ICD编码与总结或实体提取等任务有很大不同。正如前面提到的，ICD编码需要强大的自然语言理解（NLU）能力，以及对每个代码代表的内容有深入的理解。在ICD编码的背景下，NLU涉及实体提取和关系提取等任务，以识别文本中的医学实体如何相互关联。

LLMs被认为在零-shot环境中在实体提取和关系提取方面表现良好 [5, 6, 7, 8]。但是，仅仅依靠它们内部化的参数知识来选择正确的ICD代码是不可完全依赖的。ICD分类包括超过70,000个层次结构代码，这意味着LLMs必须精确地将临床诊断映射到这个庞大而有组织的输出空间中-这是一项更具挑战性的任务。

实际上，考虑到众多的ICD编码，当要求LLMs仅凭其参数知识预测ICD编码时，它们很可能会产生幻觉。我们通过提示GPT-4o并要求它预测相应的ICD编码来展示这一点，并观察幻觉的情况。在最左侧的例子中，我们提示LLM使用ICD编码S46.301的描述，即“右臂三头肌肌肉、筋膜和肌腱的不明伤害”。在右侧的例子中，我们使用ICD编码S48.11的描述，即“右肩和肘之间完全创伤性截肢”。

GPT-4o failing to accurately identify the ICD codes (Image by Authors)

依赖LLMs以完全自动化、端到端的方式进行ICD编码可能会带来挑战。

*但是，这并不意味着这些问题是不可解决的。LLMs可以通过微调将知识库编码到它们的参数中，作为密集检索器来使用。此外，在所提供的例子中，LLM仅预测它认为最适合每个描述的ICD代码。我们可以通过提示LLM建议多个相关的ICD代码，然后应用后处理步骤来修剪预测，来改进检索工作流程。在这个背景下，我们重点关注如何利用LLMs进行ICD编码，考虑到我们不对它们进行微调，以零-shot方式使用它们。

我们通过将ICD编码分解为语言解析和外部知识查询来处理。特别是，我们探讨了ICD编码如何被表示为知识图，该知识图可以后续与LLMs集成为外部知识，用于在RAG设置中查询ICD编码。此外，我们演示了LLMs在诸如NER和关系提取等任务中的零-shot能力如何可以利用，以高效构建这个知识图。

什么是知识图谱？

到底什么是图？简单地说，图是一种数据结构，帮助我们绘制不同实体之间的连接 - 不管是人、设备，或者医学编码。从结构上来看，图由节点和边组成。节点代表实体本身，而边则代表它们之间的关系或互动。

An example of a graph (Image by Authors)

图表使得可视化和分析复杂系统变得容易，比如社交网络，其中每个节点可能代表一个人，每条边代表一种友谊，或者计算机网络，其中节点是设备，边是连接它们的联系。总的来说，它们是表示连接数据和不同实体之间关系的强大方法。

然而，一个简单的图表并不能提供有关实体之间可能存在的具体类型和关系性质的任何信息。例如，给定代表人物A和B的两个实体，一个基本的图表可能只是表明这两个个体相连，而不揭示他们之间关系的性质。是家庭联系，专业关系，还是友谊？

An example of a knowledge graph (Image by Authors)

这就是知识图谱发挥作用的地方。知识图谱通过嵌入语义信息来增强基本图结构——即定义节点之间关系含义的其他细节。与简单图不同，知识图谱能够捕捉层次结构、属性以及实体之间更加复杂、微妙的关联。

为什么使用公斤进行ICD编码？

为什么要使用知识图谱进行ICD编码？ICD编码为分类疾病、症状、疾病、医疗程序提供了详细的框架，在一个分级结构中，广泛的类别分支出越来越具体的亚型。知识图谱非常适合这样的结构，因为它们可以建模这些复杂的关系。知识图谱可以自然地捕捉父子结构，并反映不同编码之间的关系。这使得它们非常适合代表ICD分类学。

除了简单地捕捉这些层次关系外，知识图谱还通过链接由ICD代码表示的相互关联的实体提供另一个好处。例如，知识图谱可以描绘不仅ICD代码的结构，还可以描绘各种疾病、症状和程序之间的关系。这使我们能更好地了解特定疾病可能与其相关的症状、治疗方法或风险因素之间的关系。

知识图谱可以与语言模型进行集成，特别是在检索增强生成（RAG）和访问外部知识来源的背景下。RAG将语言模型与外部知识库结合起来，以提高响应的准确性和相关性。在这种设置中，知识图谱作为结构化知识库，语言模型在生成过程中可以查询它，从而提供更具上下文意识和事实基础的输出。

创建 KG

我们如何着手创建ICD编码的知识图谱？每个ICD编码通常由一个捕捉代码本质的纯文本描述[2]来表示。使用这些描述，我们为每个ICD编码构建一个图表，以捕捉其关键关系和属性。一旦所有单独的ICD编码图表都创建好了，我们将它们合并成一个单一的、统一的知识图谱，代表整个ICD编码系统。

图形创建过程

KG 的构建包括三个主要步骤:

命名实体识别（NER）
关系抽取（RE）
实体链接（EL）

Representation of the graph creation process (Image by Authors). Representation inspired from “Information Extraction Pipeline” figure in article [11].

命名实体识别

命名实体识别（NER）是从非结构化文本中识别和提取实体的过程。在ICD编码的背景下，NER帮助我们定位和分类关键的医学术语和实体，如疾病，症状和治疗，将其应用于临床文档中。例如，NER可以识别“糖尿病”为疾病，将“胸痛”识别为症状。

实体识别是知识图谱构建过程中的重要步骤，因为它帮助我们识别最终图中的节点。我们使用的实体识别工具必须非常准确，因为错误的检测和错误会给我们的知识图谱添加噪音，这是至关重要的。

关系抽取

在确定实体之后，下一步是关系提取-确定不同实体之间的关系。在ICD编码的背景下，这可能包括医学实体（如症状和疾病或治疗）之间的关联。例如，“胸痛”状况与可能的诊断“心肌梗死”之间可能存在关系。这一步对于构建图中我们的节点（实体）之间的边（关系）至关重要。

实体链接

实体链接确保识别出的实体和关系清晰一致，通过将它们连接到特定概念。在医学背景下，许多术语具有同义词或变体，如“心脏病发作”和“心肌梗塞”。实体链接将这些术语对齐到单一统一的概念。这一步骤有助于保持图中的一致性，并允许更准确的查询。

存储和查询图

一旦图表创建完成，需要将其存储在一个能够高效处理图形数据并方便查询的数据库中。我们可以使用图形数据库，它专门设计用于存储节点和边，代表实体及其关系，以便存储我们的图表。这样可以方便我们查询和检索图形中的信息。

构建ICD图

与这个实现相关的代码和资源可以在这个github仓库链接中找到。

我们如何利用LLMs来创建知识图？

在图创建的初始阶段，我们使用LLMs来识别命名实体及它们之间的关系。对于实体检测，我们观察到像Scispacy [12] 这样的库中的模型并不能完全涵盖我们需要提取以构建图形的实体类型。Scispacy的默认实体提取模型仅识别文本中的所有医学实体，但不提供实体的类型。我们使用LLMs来完成这些任务，因为它们允许我们灵活地描述我们想要提取和链接的具体实体，而无需专门的训练。

命名实体识别

第一步是确定需要提取哪些特定类型的实体。例如，医疗条件、身体部位和症状等类别是明显的候选对象，但同时也很重要考虑其他实体类型。

为了做到这一点，我们首先随机抽取大批ICD描述，并提示LLM提出一套相关实体类型的综合集合以进行提取。根据这个分析，我们确定了12种主要实体类型，包括：

状况
身体部位
严重程度
遭遇类型
原因
定位
人
胎儿
程序
复杂化
三个月阶段
其他信息

这些实体类型代表了可以在ICD代码描述中找到的医疗实体的范围。可以在下面提供的用于执行NER和RE的提示中找到关于每个实体类型代表的解释。

关系抽取

在从ICD描述中识别实体之后，我们继续识别并链接相关实体。为此，我们考虑医学实体之间所有可能的关系。然而，大多数情况下，主要锚点是与ICD代码对应的诊断，它充当其他实体链接的中心实体。

用于命名实体识别和关系抽取的框架提示

我们使用LLM来执行在一次调用中对ICD描述进行命名实体识别（NER）和关系提取（RE），利用GPT-4o迷你LLM。第一步涉及构建一个有效的提示。

构建提示需要：

定义实体：我们需要明确定义我们想要提取的实体类型，并为每个类别提供准确的描述。
提供示例：我们需要包含代表性的ICD描述，并使用少拍法来说明输出应该是什么样的。
结构化格式：我们必须为LLm定义一种结构化格式，以提供所提取的输出，以便我们可以正确解析必要的信息。

我们首先概述实体提取任务，指定实体类型并提供示例。接下来，我们通过描述实体之间可能的关系并提供相关示例来定义关系提取任务。最后，我们包括四个少样例来展示所需的输出格式。

prompt_relation_extraction = """You are an expert medical professional, qualified in ICD coding and medical terminology.
You are given a description of an ICD code, and your task is to extract relevant entities and construct a graph by identifying the relationships between these entities. 

Follow these detailed instructions:

Step 1: Extract Entities
Identify and extract the following entity types from the provided ICD code description.

Instructions: 
1. Do not output any placeholders for entity types that are not mentioned in the description.
2. You must separate entities of the same type with ||.
3. Strictly do not output content that is not present in the description.

Entity Types:
1. condition: Identify the medical conditions and/or injuries described by the ICD code.
2. bodypart: Identify any specific body parts mentioned affected by the conditions.
3. severity: Determine the severity or degree or stage of the conditions if mentioned (e.g., first degree, second degree, third degree, mild, moderate, severe, type I, type II).
4. encounter_type: Identify the type of medical encounter described (e.g., initial encounter, subsequent encounter, sequela).
5. cause: Extract the cause or reasons for the conditions or injuries (e.g., foreign object, poisoning, burn, fall, collision).
6. laterality: Identify if a specific side of the body is affected (e.g., left, right, unspecified).
7. person: Represents the person affected (e.g., unspecified person, suspect, bystander).
8. fetus: Represents the fetus affected in maternal care cases.
9. procedure: Represents any treatment or procedures associated with the conditions (e.g., surgery, prosthesis).
10. complication: Represents any complications arising from the conditions or treatments (e.g., nonunion, delayed healing).
11. trimester: Represents the trimester of pregnancy if applicable.
12. other_info: Represents any other important medical information that are not covered by the above entities.

Step 2: Construct Relationships

From the extracted entities, identify and extract the entities that are related to each other.
Some examples of relations are:
- Example: "Dislocation" (condition) affects "knee" (bodypart).
- Example: "Burn" (condition) has "third degree" (severity).
- Example: "Pre-eclampsia" (condition) occurs during "second trimester" (trimester).
- Example: "Laceration" (condition) caused by "sharp object" (cause).
- Example: "hand" (bodypart) affected is "right" (laterality).
- Example: "Injury" (condition) affects "unspecified person" (person).
- Example: "Chorioamnionitis" (condition) affects "fetus" (fetus).
- Example: "Gastric band complication" (condition) treated by "gastric band procedure" (procedure).
- Example: "Dislocation" (condition) occurs during "initial encounter" (encounter_type).
- Example: "Burn" (condition) has "delayed healing" (complication).
   
Represent each pair of related entities in the format (Entity 1||Entity 2).
Some examples of input and output are provided below:

### Input
ICD Code Description: Laceration with foreign body of right breast, initial encounter

### Output
Entities:
condition: Laceration
bodypart: breast
encounter_type: initial encounter
cause: foreign body
laterality: right

Relations:
(Laceration||foreign body)
(Laceration||breast)
(right||breast)
(Laceration||initial encounter)

### Input
ICD Code Description: Nondisplaced segmental fracture of shaft of radius, unspecified arm, subsequent encounter for open fracture type I or II with delayed healing

### Output
Entities:
condition: Nondisplaced segmental fracture||open fracture
bodypart: shaft of radius||arm
encounter_type: subsequent encounter
severity: type I||type II
complication: delayed healing
laterality: unspecified

Relations:
(Nondisplaced segmental fracture||shaft of radius)
(Nondisplaced segmental fracture||arm)
(unspecified||arm)
(Nondisplaced segmental fracture||subsequent encounter)
(subsequent encounter||open fracture)
(open fracture||type I)
(open fracture||type II)
(open fracture||delayed healing)

### Input
ICD Code Description: Displaced oblique fracture of shaft of left fibula, subsequent encounter for open fracture type IIIA, IIIB, or IIIC with malunion

### Output
Entities:
condition: Displaced oblique fracture||open fracture
bodypart: shaft||fibula
encounter_type: subsequent encounter
severity: type IIIA||type IIIB||type IIIC
complication: malunion
laterality: left

Relations:
(Displaced oblique fracture||shaft)
(shaft||fibula)
(left||fibula)
(Displaced oblique fracture||subsequent encounter)
(subsequent encounter||open fracture)
(open fracture||type IIIA)
(open fracture||type IIIB)
(open fracture||type IIIC)
(open fracture||malunion)

### Input
ICD Code Description: Maternal care for hydrops fetalis, second trimester, fetus 2

### Output
Entities:
condition: hydrops fetalis
trimester: second trimester
fetus: fetus 2

Relations:
(hydrops fetalis||second trimester)
(hydrops fetalis||fetus 2)"""

我们使用simple_icd_10_cm库，并获取所有ICD代码及其相应描述：

import simple_icd_10_cm as cm
from tqdm import tqdm

codes = cm.get_all_codes()

icd_code_description = {}

for item in tqdm(codes):
    if cm.is_leaf(item):
        icd_code_description[item] = cm.get_description(item)

我们实例化OpenAI客户端，并开始使用GPT-4o mini提取ICD代码的实体和关系：

from openai import OpenAI

client = OpenAI(
    api_key="",
)

def get_completion(prompt, input, model="gpt-4o-mini", temperature=0.0):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "system", "content": prompt},
                  {"role": "user", "content": input}],
        temperature=temperature,
    )    
    return response.choices[0].message.content

extracted_graphs = {}

for key, value in tqdm(icd_code_description.items()):
    input_code_description = "ICD Code Description: " + value
    output = get_completion(prompt_relation_extraction, input_code_description)
    extracted_graphs[key] = output

实体链接

在提取实体和关系之后，我们通过将它们与统一医学语言系统（UMLS）[13] 进行链接来对所有实体进行规范化。 UMLS 由美国国家医学图书馆开发，集成了各种健康和生物医学词汇和标准，如 SNOMED-CT 和 MeSH。每个医学实体都链接到一个称为概念唯一标识符（CUI）的唯一标识符。

一旦与UMLS连接，我们使用LLM提取的实体的规范名称。对于链接，我们使用Scispacy的实体链接器，根据3个字符的相似性将实体匹配到UMLS概念上。由于我们已经提取了实体，因此我们绕过Scispacy的默认实体提取管道，直接将我们提取的实体传递给CandidateGenerator对象。这个过程提供可能的UMLS概念ID，从中我们选择与我们的输入具有最高相似度分数的概念ID。

candidate_generator = CandidateGenerator(name="umls")
entity_linker = EntityLinker(resolve_abbreviations=True, name="umls", candidate_generator=candidate_generator)

def get_max_similarity_concept_id(candidates):
    """
    Determines the concept ID with the maximum similarity score from a list of candidate concepts.

    Args:
    candidates (list): A list of MentionCandidate objects, each containing a 'similarities' list and a 'concept_id'.

    Returns:
    any: The concept ID associated with the highest similarity score in the given candidates list, or None if the list is empty.
    """
    max_similarity = float('-inf')
    best_concept_id = None

    for candidate in candidates:
        candidate_max_similarity = max(candidate.similarities)
        
        if candidate_max_similarity > max_similarity:
            max_similarity = candidate_max_similarity
            best_concept_id = candidate.concept_id

    return best_concept_id

def normalize_entities(entities):
    """
    Extracts the top linked concept for the first recognized entity in a given text using scispaCy.

    Parameters:
    - text (str): The input text to extract entities from.

    Returns:
    - str: The normalized entity if identified, or the original entity text.
    """
    candidates = candidate_generator(entities, k = 5)
    normalized_entity_map = {}
    
    for entity, candidate in tqdm(zip(entities, candidates)):
        most_likely_cui = get_max_similarity_concept_id(candidate)
        cui_entity = entity_linker.kb.cui_to_entity[most_likely_cui]
        normalized_entity_map[entity] = cui_entity.canonical_name
    
    return normalized_entity_map

all_entities = []
for key, value in tqdm(extracted_graphs.items()):
    entity_list = extract_entities(value)
    all_entities += entity_list

all_entities = list(set(all_entities))

normalized_entity_map = normalize_entities(all_entities)

构建KG

从每个ICD描述中提取实体和关系后，我们开始构建KG。我们首先解析LLM预测中的所有实体和关系。接下来，我们通过将实体映射到其UMLS规范名称来标准化实体，并将它们添加为图中的节点。我们在节点之间创建边，这些节点之间有基于LLM预测的关系。

另外，图中添加了代表ICD代码的节点，并在这个节点与每个提取的实体之间创建了边，方便未来的图查询，当我们需要检索ICD代码时。图是使用NetworkX库构建的。

A graph constructed for a single ICD code after running our pipeline (Image by Authors)

def parse_entities(lines, normalized_entity_map):
    """
    Parses entity information from the provided lines.

    Args:
        lines (list): List of strings containing entity data in the format "EntityType: EntityName1 || EntityName2".
        normalized_entity_map (dict): Dictionary mapping original entity names to their normalized forms.

    Returns:
        tuple: A tuple containing:
            - entities (dict): A dictionary where keys are normalized entity names and values are entity types.
            - overall_entities (set): A set of all entity names encountered.
    """    
    entities = {}
    overall_entities = set()
    for line in lines:
        if ':' in line:
            entity_type, entity_names = line.split(':', 1)
            entity_type = entity_type.strip()
            entity_names = [name.strip() for name in entity_names.split("||")]
            for entity_name in entity_names:
                overall_entities.add(entity_name)
                entities[normalized_entity_map[entity_name]] = entity_type
    return entities, overall_entities

def parse_relations(lines):
    """
    Parses relation information from the provided lines.

    Args:
        lines (list): List of strings containing relation data in the format "(EntityName1 || EntityName2)".

    Returns:
        list: A list of tuples representing relations, where each tuple is a pair of related entity names.
    """    
    relations = []
    for line in lines:
        if line.startswith('(') and line.endswith(')'):
            entity_names = [e.strip() for e in line[1:-1].split('||')]
            if len(entity_names) == 2 and all(entity_names):
                relations.append((entity_names[0], entity_names[1]))
    return relations

def build_graph(input_text, icd_code, icd_description, normalized_entity_map):
    """
    Builds a graph based on entity and relation information extracted from input text.

    Args:
        input_text (str): Text containing entity and relation information.
        icd_code (str): ICD code for the graph root node.
        icd_description (str): Description of the ICD code.
        normalized_entity_map (dict): Dictionary mapping original entity names to their normalized forms.

    Returns:
        networkx.Graph: A NetworkX graph representing the entities, relations, and the ICD root node.
    """    
    lines = input_text.strip().split('\n')
    entities_section, relations_section = split_sections(lines)
    
    entities, overall_entities = parse_entities(entities_section, normalized_entity_map)
    relations = parse_relations(relations_section)
    
    G = create_graph(icd_code, icd_description, entities, overall_entities, relations, normalized_entity_map)
    return G

def split_sections(lines):
    """
    Splits the input lines into separate sections for entities and relations.

    Args:
        lines (list): List of strings containing the lines of the input text.

    Returns:
        tuple: A tuple containing:
            - entities_section (list): List of strings corresponding to the "Entities:" section.
            - relations_section (list): List of strings corresponding to the "Relations:" section.
    """
    entities_section = []
    relations_section = []
    current_section = None
    
    for line in lines:
        line = line.strip()
        if line == 'Entities:':
            current_section = entities_section
        elif line == 'Relations:':
            current_section = relations_section
        elif line and current_section is not None:
            current_section.append(line)
    
    return entities_section, relations_section

def create_graph(icd_code, description, entities, overall_entities, relations, normalized_entity_map):
    """
    Creates a NetworkX graph using the provided ICD code, entities, and relations.

    Args:
        icd_code (str): ICD code for the root node of the graph.
        description (str): Description of the ICD code.
        entities (dict): Dictionary of entities with their types.
        overall_entities (set): Set of all entity names encountered.
        relations (list): List of tuples representing relations between entities.
        normalized_entity_map (dict): Dictionary mapping original entity names to their normalized forms.

    Returns:
        networkx.Graph: A NetworkX graph representing the entities, relations, and the ICD root node.
    """    
    G = nx.Graph()
    G.add_node(icd_code, type="ICD", description = description)
    
    for entity_name, entity_type in entities.items():
        G.add_node(entity_name, type=entity_type)
        G.add_edge(entity_name, icd_code)
    
    for entity1, entity2 in relations:
        if entity1 in overall_entities and entity2 in overall_entities:
            entity1_normalized = normalized_entity_map[entity1]
            entity2_normalized = normalized_entity_map[entity2]
            G.add_edge(entity1_normalized, entity2_normalized)
    
    return G

graphs_list = []
for key, value in tqdm(extracted_graphs.items()):
    icd_description = icd_code_description[key]
    graph = build_graph(value, key, icd_description, normalized_entity_map) 
    graphs_list.append(graph)

kg = nx.compose_all(graphs_list)

在为每个代码构建图表之后，我们将所有图表链接成一个单个更大的知识图，这个知识图现在代表了所有的ICD代码。

索引到GraphDB

我们使用Neo4j[14]，一种流行的图数据库，对我们的图进行索引。对于我们的使用案例，我们利用Neo4j Aura在线提供的免费实例。我们首先连接到我们的实例，然后使用Python中的neo4j包将我们的节点和关系批量索引。

from neo4j import GraphDatabase
import networkx as nx

BATCH_SIZE = 1000

uri = ""
username = ""
password = ""

def create_nodes(tx, nodes):
    """
    Creates or updates nodes in the Neo4j database.

    This function takes a list of nodes, and for each node, it either creates a new node or updates an existing one based on the node's `id`. The attributes of each node are set or updated as specified in the `attributes` field.

    Args:
        tx (neo4j.Transaction): The active Neo4j transaction.
        nodes (list): A list of dictionaries, where each dictionary represents a node with the following structure:
            - 'id': The unique identifier for the node.
            - 'attributes': A dictionary of key-value pairs representing node attributes.
    """    
    tx.run(
        """
        UNWIND $nodes AS node
        MERGE (n:Node {id: node.id})
        SET n += node.attributes
        """,
        nodes=nodes
    )

def create_relationships(tx, relationships):
    """
    Creates or updates relationships between nodes in the Neo4j database.

    This function takes a list of relationships and for each relationship, it either creates a new relationship or updates an existing one
    based on the source and target node identifiers (`source_id` and `target_id`). The relationship's attributes are set or updated
    as specified in the `attributes` field.

    Args:
        tx (neo4j.Transaction): The active Neo4j transaction.
        relationships (list): A list of dictionaries, where each dictionary represents a relationship with the following structure:
            - 'source_id': The unique identifier of the source node.
            - 'target_id': The unique identifier of the target node.
            - 'attributes': A dictionary of key-value pairs representing relationship attributes.
    """    
    tx.run(
        """
        UNWIND $relationships AS rel
        MATCH (a:Node {id: rel.source_id})
        MATCH (b:Node {id: rel.target_id})
        MERGE (a)-[r:RELATES_TO]-(b)
        SET r += rel.attributes
        """,
        relationships=relationships
    )

def create_index(tx):
    """
    Creates a unique constraint on the `id` property of the `Node` label in the Neo4j database.

    This function ensures that the `id` property of each `Node` is unique, preventing the creation of nodes
    with duplicate `id` values. It is typically used for enforcing data integrity and speeding up lookup operations.

    Args:
        tx (neo4j.Transaction): The active Neo4j transaction.
    """    
    tx.run("CREATE CONSTRAINT FOR (n:Node) REQUIRE n.id IS UNIQUE")

driver = GraphDatabase.driver(uri, auth=(username, password))

with driver.session() as session:
    session.write_transaction(create_index)

    nodes = [
        {'id': node_id, 'attributes': attributes}
        for node_id, attributes in kg.nodes(data=True)
    ]

    relationships = [
        {
            'source_id': source_id,
            'target_id': target_id,
            'attributes': attributes
        }
        for source_id, target_id, attributes in kg.edges(data=True)
    ]

    for i in range(0, len(nodes), BATCH_SIZE):
        batch = nodes[i:i+BATCH_SIZE]
        session.write_transaction(create_nodes, batch)

    for i in range(0, len(relationships), BATCH_SIZE):
        batch = relationships[i:i+BATCH_SIZE]
        session.write_transaction(create_relationships, batch)
        
driver.close()

查询图形

在Neo4j中为我们的图进行索引之后，我们可以开始使用Cypher [15]来查询它，Cypher是Neo4j的一种声明性查询语言，用于图数据交互。Cypher提供了一种强大且灵活的方法来查询我们的图。

在我们的初始示例中，我们查询与可能发生在右眼睑上的撕裂相关的所有ICD代码。在这里，“撕裂”代表疾病， “眼睑”代表身体部位，“右”表示侧位。关系将疾病与身体部位以及侧位与身体部位相关联。我们在Cypher查询中编码这些约束条件，使我们能够检索相关结果。

A query for ICD codes linked with Lacerations that occur on right eyelids. (Image by Authors)

查询成功地检索到满足指定条件的相关ICD编码。我们还尝试使用更灵活的Cypher查询，具体是通过检索与“撕裂伤”状况相关的所有ICD编码。通过以这种方式放宽约束条件，我们能够捕捉更广泛的相关ICD编码，为该状况找到多个相关匹配。

A query for all ICD codes which are linked to Lacerations. (Image by Authors)

通过在实体之间添加特定的约束条件，我们可以创建更有针对性和精确的查询。此外，Cypher还提供灵活的功能，如自由文本搜索（例如，使用CONTAINS关键字）、大小写不敏感匹配以及使用可选匹配来处理不同的情况。

结论和未来工作

在这篇博客文章中，我们探讨了如何将ICD编码表示为结构化的知识图（KG），以及如何对其进行索引和查询。以这种方式使用知识图可以更深入地与大型语言模型（LLMs）在检索增强生成（RAG）设置中集成，帮助我们生成更具事实依据的输出。我们计划在即将进行的工作中探索这样的集成。

然而，我们的方法有一定的局限性。

对于实体和关系提取，我们使用GPT-4o mini，这是一个较小的LLM，但存在性能限制和错误，这些错误可能传播到最终的知识图中。要解决这些挑战，需要更强大或专门的模型。
虽然Scispacy实体链接器通常是有效的，但在知识图谱创建过程中可能会引入实体链接和标准化方面的错误。我们还计划探索是否可以将其他医学知识库和存储库（如PubMed文章）集成到知识图谱创建过程中，以包含更多领域知识。

参考资料

[1] https://www.healthit.gov/data/quickstats/office-based-physician-electronic-health-record-adoption

[2] https://icd.who.int/browse10/2019/zh

[3] Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., 鍾慧薇, … & Natarajan, V. (2023). 大型語言模型編碼臨床知識。自然, 620(7972), 172-180.

[4] Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., … & Natarajan, V. (2023). 通过大型语言模型实现医学问题回答的专家水平。arXiv预印本 arXiv:2305.09617。

[5] Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., & Sontag, D. (2022年). 大型语言模型是少样本临床信息提取器. arXiv预印本 arXiv:2205.12689.

[6] 闫虎，陈庆宇，杜景城，彭雪清，维品娜·库提奇·凯洛斯，左旭，周玉嘉，李泽涵，江晓倩，卢志勇，柯克·罗伯茨，徐华，通过提示工程改进临床命名实体识别的大型语言模型，美国医学信息学协会杂志，第31卷，第9期，2024年9月，第1812-1820页。

[7] 周，H.，李，M.，肖，Y.，杨，H.，& 张，R. (2023). 用于临床关系提取的LLM指导示例自适应提示（LEAP）框架。medRxiv：医药健康科学的预印本服务器，2023年12月15日，23300059。https://doi.org/10.1101/2023.12.15.23300059

[8] 瓦德瓦，索敏，西尔维奥·阿米尔和拜伦C.华莱士。“在大型语言模型时代重新审视关系提取。”会议论文集。计算语言学协会会议。卷2023。NIH 公共访问，2023。

[9] Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., … & Larson, J. (2024). 从本地到全球：一个面向查询的图例方法进行总结。arXiv预印 arXiv:2404.16130。

[10] 何翔，田宇，孙宇，Chawla，N. V.，劳伦特，T.，LeCun，Y.，…和Hooi，B.（2024年）。G检索器：检索增强生成，用于文本图理解和问题回答。arXiv预印本arXiv:2402.07630。

[11] https://bratanic-tomaz.medium.com/constructing-knowledge-graphs-from-text-using-openai-functions-096a6d010c17

[12] Neumann, M., King, D., Beltagy, I., & Ammar, W. (2019, 八月). ScispaCy: 用于生物医学自然语言处理的快速稳健模型。在第18届BioNLP研讨会和共享任务的论文集中 (pp. 319–327)。

[13] https://www.nlm.nih.gov/research/umls/index.html [13] https://www.nlm.nih.gov/research/umls/index.html

[14] https://neo4j.com/

[15] https://neo4j.com/docs/cypher-manual/current/introduction/ [15] https://neo4j.com/docs/cypher-manual/current/introduction/