Sure, here is the translation of the English text "Implementing ‘From Local to Global’ GraphRAG with Neo4j and LangChain: Constructing the Graph" into simplified Chinese, while keeping the HTML structure intact: ```html 实现使用Neo4j和LangChain的“从本地到全局”GraphRAG:构建图 ```

Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html 结合文本提取、网络分析以及LLM提示和总结,以提高RAG准确性。 ``` This HTML snippet preserves the structure while displaying the translated text in simplified Chinese.

```html

我对实现检索增强生成(RAG)在图上的新方法始终充满兴趣,通常被称为GraphRAG。然而,听到GraphRAG这个术语时,每个人心中似乎都有不同的实现方式。在本博客文章中,我们将深入探讨微软研究人员撰写的“从局部到全局的GraphRAG”文章及其实现。我们将涵盖知识图构建和摘要部分,留下检索器的内容以供下一篇博客文章讨论。研究人员非常友好地向我们提供了代码存储库,并且他们也有一个项目页面。

``` This HTML snippet contains the translated text in simplified Chinese while maintaining the HTML structure.

```html

上述文章中采用的方法非常有趣。据我理解,它涉及在流程中使用知识图谱来压缩和整合来自多个来源的信息。从文本中提取实体和关系并不新鲜。然而,作者们引入了一个新颖的(至少对我来说是新的)想法,即将压缩的图结构和信息总结回自然语言文本中。该流程从文档中的输入文本开始,经过处理生成图形。然后,将图形转换回自然语言文本,生成的文本包含特定实体或图形社区的压缩信息,这些信息以前分散在多个文档中。

``` This HTML snippet retains the structure of the text while providing the translation into simplified Chinese.
High-level indexing pipeline as implemented in the GraphRAG paper by Microsoft — Image by author

Certainly! Here's the translated text in simplified Chinese, keeping the HTML structure: ```html

在非常高的层次上,GraphRAG管道的输入是包含各种信息的源文档。使用LLM处理这些文档,提取有关论文中出现的实体及其关系的结构化信息。然后利用提取的结构化信息构建知识图。

``` This HTML snippet contains the translated text: 在非常高的层次上,GraphRAG管道的输入是包含各种信息的源文档。使用LLM处理这些文档,提取有关论文中出现的实体及其关系的结构化信息。然后利用提取的结构化信息构建知识图。

在使用知识图谱数据表示的优势在于,它能够快速而直接地整合关于特定实体的多个文档或数据源的信息。如前所述,知识图谱并不是唯一的数据表示方式。在构建了知识图谱之后,它们使用图算法和LLM提示的组合来生成关于知识图谱中实体社区的自然语言摘要。

Certainly! Here's the translated text in simplified Chinese, keeping the HTML structure intact: ```html

这些摘要包含特定实体和社区的跨多个数据源和文档分布的简化信息。

``` This HTML snippet contains the translated text "这些摘要包含特定实体和社区的跨多个数据源和文档分布的简化信息。"

To translate the English text "For a more detailed understanding of the pipeline, we can refer to the step-by-step description provided in the original paper" into simplified Chinese, while keeping HTML structure intact, you would format it as follows: ```html

关于流程的更详细理解,我们可以参考原始论文中提供的逐步描述。

``` This HTML snippet ensures that the translation is presented clearly within the structure of a paragraph (`

` tag).

Steps in the pipeline — Image from the GraphRAG paper, licensed under CC BY 4.0

以下是我们将使用的管道的高级摘要,用于使用Neo4j和LangChain复现他们的方法。

To translate "Indexing — Graph Generation" into simplified Chinese while keeping the HTML structure, you can use the following: ```html 索引 — 图生成 ``` This maintains the structure and translates the text appropriately.

  • Sure, here's the translated text in simplified Chinese while maintaining HTML structure: ```html 源文件转换为文本块:源文件被分割成较小的文本块进行处理。 ``` This HTML snippet preserves the original structure while presenting the translated text in simplified Chinese.
  • Sure, here's the translated text in simplified Chinese while keeping the HTML structure intact: ```html 文本块到元素实例: 每个文本块都被分析以提取实体和关系,生成一个元组列表来表示这些元素。 ```
  • Sure, here's the text translated into simplified Chinese while keeping the HTML structure intact: ```html 元素实例到元素摘要:从LLM中提取的实体和关系被总结为每个元素的描述性文本块。 ``` In this translation: - `` is used to wrap the text for styling or scripting purposes in HTML. - The English text is translated into simplified Chinese within the `` tags.
  • Sure, here is the translated text in simplified Chinese, while keeping the HTML structure: ```html

    元素摘要到图形社区:这些实体摘要形成一个图形,然后使用Leiden等算法将其划分为具有层次结构的社区。

    ``` In this HTML snippet, the translated text "元素摘要到图形社区:这些实体摘要形成一个图形,然后使用Leiden等算法将其划分为具有层次结构的社区。" corresponds to "Element Summaries to Graph Communities: These entity summaries form a graph, which is then partitioned into communities using algorithms like Leiden for hierarchical structure."
  • Here's the translation with HTML structure kept intact: ```html Graph Communities to Community Summaries: 利用大型语言模型生成每个社区的摘要,以了解数据集的全球主题结构和语义。 ```

To translate "Retrieval — Answering" to simplified Chinese while keeping the HTML structure, you would use the following: ```html 检索 — 回答 ``` This ensures the translation is accurately represented within an HTML context.

  • Sure, here's the translation of the text into simplified Chinese, keeping the HTML structure in mind: ```html 社区总结到全球答案:社区总结用于通过生成中间答案来回答用户查询,然后将这些答案汇总成最终的全球答案。 ``` This translation preserves the meaning and structure of the original English text while converting it into simplified Chinese.

Sure, here is the text translated into simplified Chinese while keeping the HTML structure: ```html

请注意,我的实现是在他们的代码发布之前完成的,因此在使用的基本方法或LLM提示方面可能会有细微差异。我会在我们继续的过程中尽量解释这些差异。

```

在GitHub上可以找到这段代码。

在保留HTML结构的情况下,将以下英文文本翻译为简体中文: 设置Neo4j环境

Sure, here is the translated text in simplified Chinese: ```html 我们将使用Neo4j作为底层图数据库存储。开始的最简单方法是使用Neo4j Sandbox的免费实例,它提供了安装了Graph Data Science插件的Neo4j数据库的云实例。或者,您可以通过下载Neo4j桌面应用程序并创建本地数据库实例来设置Neo4j数据库的本地实例。如果您使用本地版本,请确保安装APOC和GDS插件。对于生产环境设置,您可以使用付费的托管AuraDS(数据科学)实例,该实例提供了GDS插件。 ``` This HTML structure will maintain the text's formatting while displaying the translated content in simplified Chinese.

Sure, here's the translated text in simplified Chinese, keeping the HTML structure: ```html

我们首先创建一个Neo4jGraph实例,这是我们在LangChain中添加的便捷包装器:

``` This HTML snippet maintains the structure and provides the translated text in simplified Chinese.
from langchain_community.graphs import Neo4jGraph

os.environ["NEO4J_URI"] = "bolt://44.202.208.177:7687"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "mast-codes-trails"

graph = Neo4jGraph(refresh_schema=False)

Sure, here's the translation of "Dataset" into simplified Chinese, keeping the HTML structure intact: ```html 数据集 ```

Sure, here's how you could structure the HTML with the translated text in simplified Chinese: ```html

我们将使用我一段时间前使用Diffbot API创建的新闻文章数据集。我已将其上传到我的GitHub,以便更轻松地重复使用:

``` This HTML snippet maintains the structure while incorporating the translated text in simplified Chinese.
news = pd.read_csv(
"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)
news["tokens"] = [
num_tokens_from_string(f"{row['title']} {row['text']}")
for i, row in news.iterrows()
]
news.head()

在保持HTML结构的情况下,将以下英文文本翻译为简体中文: 让我们检查数据集的前几行。

Sample rows from the dataset

Sure, here is the text translated into simplified Chinese while keeping the HTML structure: ```html 我们可以使用tiktoken库获取文章的标题、内容、发布日期和词数。 ``` This HTML structure retains the original English sentence inside the `

` tag: ```html

We have the title and text of the articles available, along with their publishing date and token count using the tiktoken library.

``` The Chinese translation accurately conveys the meaning of the original sentence.

To translate "Text Chunking" to simplified Chinese while keeping HTML structure, you would use the following: ```html 文本分块 ``` This HTML code ensures that the text "文本分块" (which means "Text Chunking" in simplified Chinese) is properly marked with the language attribute for Chinese.

在保持HTML结构的前提下,将以下英文文本翻译为简体中文: 文本分块步骤至关重要,且显著影响下游结果。研究作者发现,使用较小的文本块会导致总体上提取更多的实体。

Number of extract entities given the size of text chunks — Image from the GraphRAG paper, licensed under CC BY 4.0

请注意,使用每次 2,400 个标记的文本块,提取的实体较使用 600 个标记时要少。此外,他们发现语言模型可能不会在第一次运行时提取所有实体。在这种情况下,他们引入了一种启发式方法,可以多次执行实体提取。我们将在下一节详细讨论这一点。

在 HTML 结构中保持不变,将以下英文文本翻译成简体中文: 然而,总会存在权衡取舍。使用更小的文本块可能导致丢失跨文档中特定实体的上下文和指代关系。例如,如果一个文档分别提到了“约翰”和“他”,将文本分成较小的块可能会使“他”指代的是约翰这一点变得不清楚。一些指代问题可以通过重叠的文本分块策略来解决,但并非所有问题都可以解决。

Sure, here's the translated text in simplified Chinese, keeping HTML structure: ```html 让我们来检查一下文章文本的大小: ```

sns.histplot(news["tokens"], kde=False)
plt.title('Distribution of chunk sizes')
plt.xlabel('Token count')
plt.ylabel('Frequency')
plt.show()

在保留HTML结构的情况下,将以下英文文本翻译为简体中文: 文章令牌数量的分布大致呈正态分布,峰值约为400个令牌。块的频率逐渐增加到这个峰值,然后对称地减少,表明大多数文本块接近于400个令牌。

以下是简体中文翻译,并保持HTML结构: ```html 由于这种分布,我们在这里不会执行任何文本分块,以避免指代问题。默认情况下,GraphRAG项目使用300个标记的块大小,并且重叠100个标记。 ``` 请注意,以上文本已按照您的要求进行了翻译,并保留了HTML结构。

To translate "Extracting Nodes and Relationships" to simplified Chinese while keeping HTML structure, you can use the following: ```html 提取节点和关系 ``` This maintains the text in its original format while providing the translation in Chinese.

```html

下一步是从文本块构建知识。对于这个用例,我们使用一个LLM从文本中提取结构化信息,以节点和关系的形式呈现。您可以查看作者在论文中使用的LLM提示。他们有LLM提示,我们可以根据需要预定义节点标签,但默认情况下是可选的。此外,原始文档中提取的关系实际上没有类型,只有描述。我想这种选择背后的原因是允许LLM提取和保留更丰富和更细致的关系信息。但是没有关系类型规范很难形成清晰的知识图谱(描述可以作为属性)。

``` This HTML structure contains the translated text in simplified Chinese.

Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html 在我们的实现中,我们将使用LangChain库中提供的LLMGraphTransformer。与文章论文中的纯提示工程实现不同,LLMGraphTransformer使用内置的函数调用支持来提取结构化信息(LangChain中的结构化输出LLMs)。您可以查看系统提示: ``` This translation maintains the original meaning while adhering to the HTML format.

from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4o")

llm_transformer = LLMGraphTransformer(
llm=llm,
node_properties=["description"],
relationship_properties=["description"]
)

def process_text(text: str) -> List[GraphDocument]:
doc = Document(page_content=text)
return llm_transformer.convert_to_graph_documents([doc])

Sure, here's the translated text in simplified Chinese while maintaining HTML structure: ```html 在这个例子中,我们使用GPT-4o进行图提取。作者特别指示LLM提取实体、关系及其描述。通过LangChain实现,您可以使用node_properties和relationship_properties属性来指定您希望LLM提取的节点或关系属性。 ``` This HTML snippet preserves the structure and presents the translated text in simplified Chinese as requested.

在LLMGraphTransformer实现中的区别在于所有节点或关系属性都是可选的,因此并非所有节点都会有description属性。如果我们希望,我们可以定义一个自定义提取来使description属性成为必需的,但是在这个实现中我们将跳过这部分。

Sure, here is the translated text in simplified Chinese: 我们将并行处理请求,加快图提取速度,并将结果存储到Neo4j中:

MAX_WORKERS = 10
NUM_ARTICLES = 2000
graph_documents = []

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
# Submitting all tasks and creating a list of future objects
futures = [
executor.submit(process_text, f"{row['title']} {row['text']}")
for i, row in news.head(NUM_ARTICLES).iterrows()
]

for future in tqdm(
as_completed(futures), total=len(futures), desc="Processing documents"
):
graph_document = future.result()
graph_documents.extend(graph_document)

graph.add_graph_documents(
graph_documents,
baseEntityLabel=True,
include_source=True
)

Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html 在这个例子中,我们从 2,000 篇文章中提取图形信息,并将结果存储到 Neo4j 中。我们已提取约 13,000 个实体和 16,000 条关系。以下是图中一个提取文档的示例。 ``` In this HTML snippet, the translated text appears in simplified Chinese, maintaining the structure for integration into a webpage or document.

The document (blue) points to extracted entities and relationships

Sure, here's the translation of the text into simplified Chinese while keeping the HTML structure intact: ```html 完成提取大约需要35(+/- 5)分钟,使用GPT-4o的费用约为$30。 ``` In this HTML snippet, the translated Chinese text is enclosed within `` tags to maintain its structure.

在这一步骤中,作者们引入了启发式方法来决定是否需要多次抽取图信息。为简单起见,我们只进行一次抽取。然而,如果我们想进行多次抽取,我们可以将第一次抽取的结果作为对话历史,并简单地指示语言模型(LLM)有许多实体信息缺失,需要进行更多抽取,就像GraphRAG的作者们所做的那样。

Sure, here's the translation of the provided text into simplified Chinese, maintaining the HTML structure: ```html 之前我提到文本块大小的重要性及其对提取实体数量的影响。由于我们没有进行额外的文本块分割,因此我们可以根据文本块大小评估提取实体的分布: ``` In this translation: - `` tags are used to maintain the HTML structure, although in a real HTML context, you might use different tags depending on the specific structure and styling requirements of your document.

entity_dist = graph.query(
"""
MATCH (d:Document)
RETURN d.text AS text,
count {(d)-[:MENTIONS]->()} AS entity_count
"""
)
entity_dist_df = pd.DataFrame.from_records(entity_dist)
entity_dist_df["token_count"] = [
num_tokens_from_string(str(el)) for el in entity_dist_df["text"]
]
# Scatter plot with regression line
sns.lmplot(
x="token_count",
y="entity_count",
data=entity_dist_df,
line_kws={"color": "red"}
)
plt.title("Entity Count vs Token Count Distribution")
plt.xlabel("Token Count")
plt.ylabel("Entity Count")
plt.show()

Sure, here's the translated text in simplified Chinese: 散点图显示,红色线条表明存在正向趋势,但关系是次线性的。大多数数据点聚集在较低的实体计数上,即使标记计数增加。这表明提取的实体数量并不与文本块的大小成比例。虽然存在一些异常值,但总体模式显示,更高的标记计数并不一致地导致更高的实体计数。这验证了作者的发现,即较小的文本块大小将提取更多信息。

Sure, here's the text translated into simplified Chinese, while keeping the HTML structure: ```html

我还觉得检查构建的图的节点度分布会很有趣。以下代码检索并可视化节点度分布:

``` This HTML snippet translates the given English text into simplified Chinese.
degree_dist = graph.query(
"""
MATCH (e:__Entity__)
RETURN count {(e)-[:!MENTIONS]-()} AS node_degree
"""
)
degree_dist_df = pd.DataFrame.from_records(degree_dist)

# Calculate mean and median
mean_degree = np.mean(degree_dist_df['node_degree'])
percentiles = np.percentile(degree_dist_df['node_degree'], [25, 50, 75, 90])
# Create a histogram with a logarithmic scale
plt.figure(figsize=(12, 6))
sns.histplot(degree_dist_df['node_degree'], bins=50, kde=False, color='blue')
# Use a logarithmic scale for the x-axis
plt.yscale('log')
# Adding labels and title
plt.xlabel('Node Degree')
plt.ylabel('Count (log scale)')
plt.title('Node Degree Distribution')
# Add mean, median, and percentile lines
plt.axvline(mean_degree, color='red', linestyle='dashed', linewidth=1, label=f'Mean: {mean_degree:.2f}')
plt.axvline(percentiles[0], color='purple', linestyle='dashed', linewidth=1, label=f'25th Percentile: {percentiles[0]:.2f}')
plt.axvline(percentiles[1], color='orange', linestyle='dashed', linewidth=1, label=f'50th Percentile: {percentiles[1]:.2f}')
plt.axvline(percentiles[2], color='yellow', linestyle='dashed', linewidth=1, label=f'75th Percentile: {percentiles[2]:.2f}')
plt.axvline(percentiles[3], color='brown', linestyle='dashed', linewidth=1, label=f'90th Percentile: {percentiles[3]:.2f}')
# Add legend
plt.legend()
# Show the plot
plt.show()

Here's the translated text in simplified Chinese, while keeping the HTML structure: ```html

节点度分布遵循幂律模式,表明大多数节点只有很少的连接,而少数节点具有高度连接性。平均度数为2.45,中位数为1.00,显示超过一半的节点仅有一个连接。大多数节点(75%)具有两个或更少的连接,90% 的节点具有五个或更少。这种分布是许多现实世界网络的典型特征,其中少数枢纽节点具有许多连接,而大多数节点连接较少。

``` This HTML snippet translates the provided English text about node degree distribution into simplified Chinese.

To translate the English text "Since both node and relationship descriptions are not mandatory properties, we will also examine how many were extracted:" into simplified Chinese, while keeping the HTML structure, you can use the following: ```html 因为节点和关系描述都不是必需的属性,我们还将检查有多少被提取了: ``` This translation maintains the structure and ensures the text is correctly formatted for display in HTML.

graph.query("""
MATCH (n:`__Entity__`)
RETURN "node" AS type,
count(*) AS total_count,
count(n.description) AS non_null_descriptions
UNION ALL
MATCH (n)-[r:!MENTIONS]->()
RETURN "relationship" AS type,
count(*) AS total_count,
count(r.description) AS non_null_descriptions
""")

To translate the provided English text into simplified Chinese and keep the HTML structure, you can use the following code snippet: ```html The results show that 5,926 nodes out of 12,994 (45.6 percent) have the description property. On the other hand, only 5,569 relationships out of 15,921 (35 percent) have such a property. ``` And the corresponding simplified Chinese translation: ```html 结果显示,5,926个节点中有12,994个(45.6%)具有描述属性。另一方面,只有5,569个关系中的15,921个(35%)具有此类属性。 ``` Make sure your HTML document declares the language and encoding correctly (`` for simplified Chinese). This structure ensures that the numbers and percentages are preserved as text within the `` tags while allowing the correct translation to be displayed.

请注意,由于语言模型的概率性质,数字在不同运行、不同源数据、语言模型和提示下可能会有所不同。

Here is the translation in simplified Chinese, keeping the HTML structure: ```html Entity Resolution 实体解析 ```

Sure, here is the translated text in simplified Chinese, keeping the HTML structure intact: ```html 实体解析(去重)在构建知识图谱时至关重要,因为它确保每个实体都能唯一且准确地表示,防止重复并合并指向同一现实世界实体的记录。这一过程对于保持图谱内数据完整性和一致性至关重要。若没有实体解析,知识图谱将面临数据碎片化和不一致性问题,导致错误和不可靠的洞察。 ``` This translation maintains the original meaning and structure while providing the simplified Chinese version of the text.

Potential entity duplicates

Sure, here's the translation in simplified Chinese, keeping the HTML structure intact: ```html 这张图片展示了在不同文件中同一个现实世界实体可能以略有不同的名称出现,因此在我们的图表中也会有所不同。 ``` This HTML structure ensures that the translated text retains its intended formatting, which is useful for displaying text on web pages.

在不进行实体解析的情况下,稀疏数据成为一个重要问题。来自各种来源的不完整或部分数据可能导致信息分散和断裂,使得难以形成关于实体的连贯和全面的理解。精确的实体解析通过整合数据、填补空白并创建每个实体的统一视图来解决这一问题。

Before/after using Senzing entity resolution to connect the International Consortium of Investigative Journalists (ICIJ) offshore leaks data — Image from Paco Nathan

Sure, here is the translated text in simplified Chinese, keeping the HTML structure intact: ```html 左侧的可视化部分展示了一个稀疏且不连通的图表。然而,如右侧所示,通过高效的实体解析,这样的图表可以变得高度连通。 ``` This translation maintains the structure and ensures clarity in the Chinese version.

Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html

总体而言,实体解析提升了数据检索和整合的效率,能够在不同来源的信息之间提供一致的视图。它最终能够基于可靠和完整的知识图谱实现更有效的问答。

``` This HTML snippet preserves the structure of the text while displaying it in simplified Chinese.

```html

不幸的是,GraphRAG 论文的作者没有在他们的代码库中包含任何实体解析代码,尽管他们在论文中提到了这一点。一个可能的原因是,实施一个稳健且表现良好的实体解析对于任何给定的领域来说都是非常困难的。你可以为处理预定义类型的节点(当这些类型没有预定义时,它们的一致性不足,比如公司、组织、业务等)实现自定义启发式方法。然而,如果节点标签或类型事先未知,就像我们的情况一样,这将成为一个更难的问题。尽管如此,我们将在这里的项目中实现一个实体解析版本,将文本嵌入和图算法与词距和 LLMs 结合起来。

```
Entity resolution flow

在保留HTML结构的情况下,将以下英文文本翻译为简体中文: 我们的实体解析过程涉及以下步骤:

  1. Sure, here's the translation of "Entities in the graph — Start with all entities within the graph." in simplified Chinese while keeping the HTML structure: ```html 图中的实体 — 从图中的所有实体开始。 ``` This HTML snippet ensures that the text is correctly identified as simplified Chinese for proper rendering and processing.
  2. Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html K-最近邻图 — 构建K最近邻图,根据文本嵌入连接相似实体。 ``` This translation preserves the original structure and accurately conveys the meaning in simplified Chinese.
  3. Below is the translated text in simplified Chinese, keeping the HTML structure intact: ```html 弱连接组件 — 在 k-最近邻图中识别弱连接组件,将可能相似的实体分组。在识别这些组件之后添加一个词距过滤步骤。 ``` This translation maintains the structure while conveying the meaning accurately in simplified Chinese.
  4. Sure, here's the translated text in simplified Chinese while maintaining HTML structure: ```html LLM 评估 — 使用 LLM 对这些组件进行评估,并决定是否应合并每个组件内的实体,从而最终做出实体解析的决定(例如,合并“Silicon Valley Bank” 和 “Silicon_Valley_Bank”,同时拒绝对不同日期如 “2023年9月16日” 和 “2023年9月2日” 的合并)。 ``` This HTML snippet maintains the structure of the original text while presenting the Chinese translation.

Sure, here's the translated text in simplified Chinese while maintaining the HTML structure: ```html

我们首先要为实体的名称和描述属性计算文本嵌入。我们可以使用LangChain中Neo4jVector集成中的from_existing_graph方法来实现这一点:

```
vector = Neo4jVector.from_existing_graph(
OpenAIEmbeddings(),
node_label='__Entity__',
text_node_properties=['id', 'description'],
embedding_node_property='embedding'
)

Sure, here's the translated text in simplified Chinese while keeping the HTML structure intact: ```html 我们可以利用这些嵌入向量来查找在嵌入向量的余弦距离上相似的潜在候选项。我们将使用图数据科学(GDS)库中提供的图算法;因此,我们可以在Pythonic方式中轻松使用GDS Python客户端: ``` Please note that the translation provided assumes a context where technical terms like "embedding vectors", "cosine distance", "Graph Data Science (GDS) library", and "Pythonic" are understood in the target audience, as direct translations are used for these terms.

from graphdatascience import GraphDataScience

gds = GraphDataScience(
os.environ["NEO4J_URI"],
auth=(os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"])
)

Sure, here's the translation of the text into simplified Chinese while maintaining HTML structure: ```html 如果您对GDS库不熟悉,我们首先必须在内存中创建一个图形投影,然后才能执行任何图形算法。 ``` In HTML format, it would look like this: ```html 如果您对GDS库不熟悉,我们首先必须在内存中创建一个图形投影,然后才能执行任何图形算法。 ```

Graph Data Science algorithm execution workflow

请保持HTML结构,将以下英文文本翻译成简体中文: 首先,将Neo4j存储的图投影到内存图中,以加快处理和分析速度。接下来,在内存图上执行图算法。可选地,算法的结果可以存储回Neo4j数据库中。在文档中了解更多信息。

在创建 k 最近邻图时,我们将会将所有实体及其文本嵌入投影出来。

G, result = gds.graph.project(
"entities", # Graph name
"__Entity__", # Node projection
"*", # Relationship projection
nodeProperties=["embedding"] # Configuration parameters
)

```html 现在,图表已经按实体名称进行了投影,我们可以执行图算法。我们将首先构建一个k最近邻图。影响k最近邻图稀疏或密集程度的两个最重要参数是相似度截断(similarityCutoff)和topK。topK是每个节点要查找的邻居数,最小值为1。相似度截断会过滤掉低于该阈值的关系。在这里,我们将使用默认的topK值为10,并且相似度截断较高,为0.95。使用高相似度截断(如0.95)确保只有高度相似的配对被视为匹配,从而减少误报并提高准确性。 ```

Constructing k-nearest graph and storing new relationships in the project graph

To translate the English text "Since we want to store the results back to the projected in-memory graph instead of the knowledge graph, we will use the mutate mode of the algorithm:" into simplified Chinese while keeping the HTML structure intact, you can use the following code: ```html

由于我们希望将结果存储回投影的内存图而不是知识图中,我们将使用算法的变异模式:

``` This HTML snippet includes the translated Chinese text wrapped in `

` tags, which denotes a paragraph in HTML.

similarity_threshold = 0.95

gds.knn.mutate(
G,
nodeProperties=['embedding'],
mutateRelationshipType= 'SIMILAR',
mutateProperty= 'score',
similarityCutoff=similarity_threshold
)

Here is the translated text in simplified Chinese while keeping the HTML structure: ```html

下一步是识别通过新推断的相似性关系连接在一起的实体群组。在网络分析中,识别连接节点的群组是一个常见过程,通常称为社区检测或聚类,它涉及找到密集连接节点的子群组。在本例中,我们将使用弱连接组件算法,这有助于我们找到图中所有节点都连接的部分,即使我们忽略连接的方向。

``` This translates the English text into simplified Chinese within an HTML paragraph (`

`) tag, maintaining the structure for use in web contexts.

Writing the results of WCC back to the database

To translate "We use the algorithm’s writemode to store the results back to the database (stored graph):" into simplified Chinese while keeping the HTML structure, you can use the following: ```html 我们使用算法的写入模式将结果存储回数据库(存储图): ``` This HTML code ensures that the translation appears correctly and maintains the structure of the original text.

gds.wcc.write(
G,
writeProperty="wcc",
relationshipTypes=["SIMILAR"]
)

To translate the provided English text into simplified Chinese while maintaining the HTML structure, you can use the following: ```html

文本嵌入比较有助于找到潜在的重复项,但它只是实体解析过程的一部分。例如,谷歌和苹果在嵌入空间中非常接近(使用ada-002嵌入模型的余弦相似度为0.96)。宝马和奔驰也是如此(余弦相似度为0.97)。高文本嵌入相似性是一个良好的起点,但我们可以进一步改进它。因此,我们将添加一个额外的过滤器,仅允许文本距离为三个或更少的词对(这意味着只能更改字符):

``` This HTML snippet ensures that the Chinese translation is embedded within a paragraph (`

` tag), preserving the structure of the original content.

word_edit_distance = 3
potential_duplicate_candidates = graph.query(
"""MATCH (e:`__Entity__`)
WHERE size(e.id) > 3 // longer than 3 characters
WITH e.wcc AS community, collect(e) AS nodes, count(*) AS count
WHERE count > 1
UNWIND nodes AS node
// Add text distance
WITH distinct
[n IN nodes WHERE apoc.text.distance(toLower(node.id), toLower(n.id)) < $distance
OR node.id CONTAINS n.id | n.id] AS intermediate_results
WHERE size(intermediate_results) > 1
WITH collect(intermediate_results) AS results
// combine groups together if they share elements
UNWIND range(0, size(results)-1, 1) as index
WITH results, index, results[index] as result
WITH apoc.coll.sort(reduce(acc = result, index2 IN range(0, size(results)-1, 1) |
CASE WHEN index <> index2 AND
size(apoc.coll.intersection(acc, results[index2])) > 0
THEN apoc.coll.union(acc, results[index2])
ELSE acc
END
)) as combinedResult
WITH distinct(combinedResult) as combinedResult
// extra filtering
WITH collect(combinedResult) as allCombinedResults
UNWIND range(0, size(allCombinedResults)-1, 1) as combinedResultIndex
WITH allCombinedResults[combinedResultIndex] as combinedResult, combinedResultIndex, allCombinedResults
WHERE NOT any(x IN range(0,size(allCombinedResults)-1,1)
WHERE x <> combinedResultIndex
AND apoc.coll.containsAll(allCombinedResults[x], combinedResult)
)
RETURN combinedResult
""", params={'distance': word_edit_distance})

在保持HTML结构的情况下,将以下英文文本翻译为简体中文: 这个密码语句稍微复杂一些,其解释超出了本博文的范围。您可以随时询问一个LLM来解释它。

Anthropic Claude Sonnet 3.5 — Explaining the duplicate entity determination statement

在保持HTML结构的情况下,将以下英文文本翻译为简体中文: 此外,词距截断可以是单个数字的函数,也可以是单词长度的函数,并且实现可以更具可扩展性。

在保持HTML结构的情况下,将以下英文文本翻译为简体中文: 重要的是它输出我们可能想要合并的潜在实体组。以下是一个要合并的潜在节点列表:

 {'combinedResult': ['Sinn Fein', 'Sinn Féin']},
{'combinedResult': ['Government', 'Governments']},
{'combinedResult': ['Unreal Engine', 'Unreal_Engine']},
{'combinedResult': ['March 2016', 'March 2020', 'March 2022', 'March_2023']},
{'combinedResult': ['Humana Inc', 'Humana Inc.']},
{'combinedResult': ['New York Jets', 'New York Mets']},
{'combinedResult': ['Asia Pacific', 'Asia-Pacific', 'Asia_Pacific']},
{'combinedResult': ['Bengaluru', 'Mangaluru']},
{'combinedResult': ['U.S. Securities And Exchange Commission',
'Us Securities And Exchange Commission']},
{'combinedResult': ['Jp Morgan', 'Jpmorgan']},
{'combinedResult': ['Brighton', 'Brixton']},

请注意,我们的解决方案对某些节点类型的效果比其他类型更好。经过快速检查,似乎适用于人物和组织,但对日期效果很差。如果我们使用预定义的节点类型,就可以为不同的节点类型准备不同的启发式算法。在这个例子中,我们没有预定义的节点标签,因此我们将依赖于一个LLM来最终决定实体是否应该合并。

To translate the English text "First, we need to formulate the LLM prompt to effectively guide and inform the final decision regarding the merging of the nodes" into simplified Chinese while keeping HTML structure, you would use the following: ```html 首先,我们需要制定LLM提示,以有效地指导和通知最终关于节点合并的决定: ``` This HTML snippet retains the structure and effectively conveys the translated text in simplified Chinese.

system_prompt = """You are a data processing assistant. Your task is to identify duplicate entities in a list and decide which of them should be merged.
The entities might be slightly different in format or content, but essentially refer to the same thing. Use your analytical skills to determine duplicates.

Here are the rules for identifying duplicates:
1. Entities with minor typographical differences should be considered duplicates.
2. Entities with different formats but the same content should be considered duplicates.
3. Entities that refer to the same real-world object or concept, even if described differently, should be considered duplicates.
4. If it refers to different numbers, dates, or products, do not merge results
"""
user_template = """
Here is the list of entities to process:
{entities}

Please identify duplicates, merge them, and provide the merged list.
"""

Sure, here's the simplified Chinese translation of your text: 我总是喜欢在期望输出结构化数据时使用 LangChain 的 with_structured_output 方法,以避免手动解析输出。

Certainly! Here's the HTML structure with the translated simplified Chinese text embedded: ```html

在这里,我们将定义输出为一个列表的列表,其中每个内部列表包含应该合并的实体。这种结构用于处理例如输入可能为 [Sony, Sony Inc, Google, Google Inc] 的情况。在这种情况下,您希望分别合并“Sony”和“Sony Inc”,并分别合并“Google”和“Google Inc”。

``` Translated text in simplified Chinese: 在这里,我们将定义输出为一个列表的列表,其中每个内部列表包含应该合并的实体。这种结构用于处理例如输入可能为 [Sony, Sony Inc, Google, Google Inc] 的情况。在这种情况下,您希望分别合并“Sony”和“Sony Inc”,并分别合并“Google”和“Google Inc”。
class DuplicateEntities(BaseModel):
entities: List[str] = Field(
description="Entities that represent the same object or real-world entity and should be merged"
)


class Disambiguate(BaseModel):
merge_entities: Optional[List[DuplicateEntities]] = Field(
description="Lists of entities that represent the same object or real-world entity and should be merged"
)


extraction_llm = ChatOpenAI(model_name="gpt-4o").with_structured_output(
Disambiguate
)

下一步,我们将LLM提示与结构化输出集成,使用LangChain表达语言(LCEL)语法创建链,并将其封装在一个消歧函数中,保持HTML结构不变。

extraction_chain = extraction_prompt | extraction_llm


def entity_resolution(entities: List[str]) -> Optional[List[List[str]]]:
return [
el.entities
for el in extraction_chain.invoke({"entities": entities}).merge_entities
]

Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html 我们需要将所有潜在的候选节点通过entity_resolution函数运行,以决定它们是否应该合并。为了加快这一过程,我们将再次并行化LLM调用: ``` This HTML structure ensures the translated text maintains its integrity and can be directly used in a web context if needed.

merged_entities = []
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
# Submitting all tasks and creating a list of future objects
futures = [
executor.submit(entity_resolution, el['combinedResult'])
for el in potential_duplicate_candidates
]

for future in tqdm(
as_completed(futures), total=len(futures), desc="Processing documents"
):
to_merge = future.result()
if to_merge:
merged_entities.extend(to_merge)

To translate the English text "The final step of entity resolution involves taking the results from the entity_resolutionLLM and writing them back to the database by merging the specified nodes:" into simplified Chinese while keeping the HTML structure intact, you can use the following: ```html

实体解析的最后一步涉及从 entity_resolutionLLM 中获取结果,并通过合并指定节点将其写回数据库:

``` This HTML snippet maintains the paragraph structure and inserts the Chinese translation within `

` tags.

graph.query("""
UNWIND $data AS candidates
CALL {
WITH candidates
MATCH (e:__Entity__) WHERE e.id IN candidates
RETURN collect(e) AS nodes
}
CALL apoc.refactor.mergeNodes(nodes, {properties: {
description:'combine',
`.*`: 'discard'
}})
YIELD node
RETURN count(*)
""", params={"data": merged_entities})

Certainly! Here is the translated text in simplified Chinese, keeping the HTML structure intact: ```html

这个实体解析并不完美,但它为我们提供了一个可以改进的起点。此外,我们可以改进确定哪些实体应该保留的逻辑。

```

Sure, here is "Element Summarization" translated into simplified Chinese while keeping the HTML structure: ```html 元素总结 ``` In this translation: - `` specifies that the enclosed text is in simplified Chinese. - `元素总结` means "Element Summarization" in simplified Chinese characters.

在接下来的步骤中,作者进行了元素总结的步骤。基本上,每个节点和关系都通过实体总结提示进行处理。作者指出了他们方法的新颖性和趣味性。

Here is the translated text in simplified Chinese, while maintaining the HTML structure: ```html

总体而言,我们在一个潜在嘈杂的图结构中对同质节点使用丰富的描述性文本,这与LLMs的能力以及全局、查询集中的摘要需求是一致的。这些特点也使我们的图索引不同于依赖简洁一致知识三元组(主语、谓语、宾语)进行下游推理任务的典型知识图谱。

``` This HTML structure `

` represents a paragraph tag, ensuring the text is enclosed properly for web display.

Sure, here is the simplified Chinese translation of the provided text: 这个想法很激动人心。我们仍然从文本中提取主体和客体的ID或名称,这使我们能够将关系链接到正确的实体,即使这些实体出现在多个文本块中。然而,这些关系并不限于单一类型。相反,关系类型实际上是自由形式的文本,这使我们能够保留更丰富和更细致的信息。

在保持HTML结构的前提下,将以下英文文本翻译为简体中文: 此外,实体信息利用LLM进行总结,使我们能够更高效地嵌入和索引这些信息和实体,以便更准确地检索。

在保持HTML结构的情况下,将以下英文文本翻译为简体中文: 有人可能会认为,通过添加额外的、可能是任意的节点和关系属性,可以保留更丰富、更细致的信息。使用任意节点和关系属性的一个问题是,由于LLM可能在每次执行时使用不同的属性名称或关注不同的细节,因此提取信息可能会变得困难。

在保持HTML结构的前提下,将以下英文文本翻译成简体中文: 一些问题可以通过预定义的属性名称以及额外的类型和描述信息来解决。在这种情况下,您需要一个专业领域的专家来帮助定义这些属性,几乎没有余地让LLM在预定义描述之外提取任何重要信息。

在知识图谱中表达更丰富信息的方法令人兴奋。

保持HTML结构,将以下英文文本翻译成简体中文: 元素总结步骤的一个潜在问题是它的可扩展性不佳,因为它要求对图中每个实体和关系进行一次LLM调用。我们的图相对较小,有13,000个节点和16,000个关系。即使对于这样一个小图,我们也需要29,000次LLM调用,每次调用会使用几百个令牌,这使得其成本相当高且耗时。因此,我们将在这里避免这一步骤。我们仍然可以使用在初始文本处理过程中提取的描述属性。

Sure, here is the translation of "Constructing and Summarizing Communities" into simplified Chinese, while keeping the HTML structure intact: ```html 构建和总结社群 ``` This translation maintains the structure and format for HTML while conveying the meaning in simplified Chinese.

在图构建和索引过程的最后一步涉及识别图中的社区。在这个上下文中,社区是指一组节点,它们彼此之间的连接比与图中其余部分的连接更密集,表明存在更高水平的交互或相似性。以下可视化展示了社区检测结果的示例。

Countries are colored based on the community they belong to

请保持HTML结构,将以下英文文本翻译为简体中文: 一旦使用聚类算法识别出这些实体社区,语言模型生成每个社区的摘要,提供对它们的个体特征和关系的深入见解。

To translate the provided English text into simplified Chinese while keeping the HTML structure intact, you can use the following: ```html 再次使用图数据科学库。我们首先投影一个内存中的图形。为了精确地遵循原文,我们将投影实体图作为一个无向加权网络,其中网络表示两个实体之间的连接数量: ``` This HTML structure preserves the formatting and ensures the translated Chinese text fits into an HTML document context.

G, result = gds.graph.project(
"communities", # Graph name
"__Entity__", # Node projection
{
"_ALL_": {
"type": "*",
"orientation": "UNDIRECTED",
"properties": {"weight": {"property": "*", "aggregation": "COUNT"}},
}
},
)

```html

作者采用了Leiden算法,一种分层聚类方法,来识别图中的社区。使用分层社区检测算法的一个优势是能够在多个粒度级别上检查社区。作者建议在每个级别总结所有社区,从而全面了解图的结构。

```

```html

首先,我们将使用弱连通分量(WCC)算法来评估图的连通性。该算法能够识别图中的孤立部分,即检测出彼此连接但与图其余部分不连接的节点子集或组件。这些组件帮助我们了解网络内部的碎片化情况,并识别与其他节点独立的节点群。WCC 对分析图的整体结构和连通性至关重要。

``` In simplified Chinese, the translated text maintains the HTML structure for readability and preservation of formatting.
wcc = gds.wcc.stats(G)
print(f"Component count: {wcc['componentCount']}")
print(f"Component distribution: {wcc['componentDistribution']}")
# Component count: 1119
# Component distribution: {
# "min":1,
# "p5":1,
# "max":9109,
# "p999":43,
# "p99":19,
# "p1":1,
# "p10":1,
# "p90":7,
# "p50":2,
# "p25":1,
# "p75":4,
# "p95":10,
# "mean":11.3 }

To translate the given English text into simplified Chinese while keeping the HTML structure intact, you can use the following code snippet: ```html

The WCC algorithm results identified 1,119 distinct components. Notably, the largest component comprises 9,109 nodes, common in real-world networks where a single super component coexists with numerous smaller isolated components. The smallest component has one node, and the average component size is about 11.3 nodes.

``` And here's the simplified Chinese translation: ```html

WCC算法的结果显示,共识别出1,119个不同的组件。值得注意的是,最大的组件包含9,109个节点,在现实世界的网络中很常见,其中一个超大组件与许多较小的孤立组件共存。最小的组件只有一个节点,平均组件大小约为11.3个节点。

``` This HTML structure ensures that the text remains formatted and can be easily styled within a webpage or document.

Sure, here is the translated text in simplified Chinese, while keeping the HTML structure intact: ```html

接下来,我们将运行Leiden算法,该算法也可在GDS库中使用,并启用includeIntermediateCommunities参数以返回并存储所有层次的社区。我们还包括了relationshipWeightProperty参数来运行Leiden算法的加权变体。使用算法的写入模式将结果存储为节点属性。

``` This HTML snippet contains the translated text in simplified Chinese as requested.
gds.leiden.write(
G,
writeProperty="communities",
includeIntermediateCommunities=True,
relationshipWeightProperty="weight",
)

Sure, here is the translated text in simplified Chinese: 算法识别出五个社群层级,最高层级(社群最大的最粗粒度层级)有1,188个社群(而不是1,119个组件)。以下是使用Gephi可视化的最后一层社群。

Community structure visualization in Gephi

在保持HTML结构的情况下,将以下英文文本翻译为简体中文: "可视化超过1,000个社区是很困难的;甚至为每一个选择颜色几乎是不可能的。然而,它们可以成为美丽的艺术呈现。"

在此基础上,我们将为每个社区创建一个独特的节点,并将它们的层次结构表示为一个互连的图形。随后,我们还将社区摘要和其他属性存储为节点属性。

graph.query("""
MATCH (e:`__Entity__`)
UNWIND range(0, size(e.communities) - 1 , 1) AS index
CALL {
WITH e, index
WITH e, index
WHERE index = 0
MERGE (c:`__Community__` {id: toString(index) + '-' + toString(e.communities[index])})
ON CREATE SET c.level = index
MERGE (e)-[:IN_COMMUNITY]->(c)
RETURN count(*) AS count_0
}
CALL {
WITH e, index
WITH e, index
WHERE index > 0
MERGE (current:`__Community__` {id: toString(index) + '-' + toString(e.communities[index])})
ON CREATE SET current.level = index
MERGE (previous:`__Community__` {id: toString(index - 1) + '-' + toString(e.communities[index - 1])})
ON CREATE SET previous.level = index - 1
MERGE (previous)-[:IN_COMMUNITY]->(current)
RETURN count(*) AS count_1
}
RETURN count(*)
""")

在保留HTML结构的情况下,将以下英文文本翻译为简体中文: 作者还介绍了一个社区排名,表示社区中的实体出现在不同文本片段中的次数:

graph.query("""
MATCH (c:__Community__)<-[:IN_COMMUNITY*]-(:__Entity__)<-[:MENTIONS]-(d:Document)
WITH c, count(distinct d) AS rank
SET c.community_rank = rank;
""")

Sure, here's the translation of the text into simplified Chinese while keeping the HTML structure intact: ```html 现在让我们来审视一个样本层次结构,其中许多中间社群在较高层级合并。这些社群是非重叠的,意味着每个实体在每个层级上都属于且仅属于一个社群。 ``` This HTML structure preserves the original text and provides the translated content in simplified Chinese.

Hierarchical community structure; communities are orange and entities are purple

Sure, here's the translated text in simplified Chinese, keeping the HTML structure: ```html

该图片展示了由Leiden社区检测算法生成的层次结构。紫色节点代表单个实体,橙色节点代表层次社区。

``` In simplified Chinese: ``` 该图片展示了由Leiden社区检测算法生成的层次结构。紫色节点代表单个实体,橙色节点代表层次社区。 ```

Sure, here's the translation in simplified Chinese, keeping the HTML structure: ```html

层次结构显示了这些实体按不同社群组织,较小的社群合并成较大的社群,逐渐向更高层级发展。

``` This HTML snippet translates the provided English text into simplified Chinese while maintaining the HTML formatting.

Sure, here is the translation in simplified Chinese while keeping the HTML structure intact: ```html 让我们现在来研究较小的社区如何在更高的层次上合并。 ```

Hierarchical community structure

这幅图表明,连接较少的实体和因此较小的社群在不同层次之间变化很小。例如,在这里,社群结构仅在前两个层次中发生变化,但在后三个层次中保持完全相同。因此,对于这些实体来说,层次结构经常显得多余,因为整体组织在不同层次上并未显著改变。

Sure, here is the text translated into simplified Chinese while keeping the HTML structure: ```html 让我们更详细地研究社区数量、大小以及不同层级: ``` This HTML snippet preserves the structure of the original text while providing the simplified Chinese translation.

community_size = graph.query(
"""
MATCH (c:__Community__)<-[:IN_COMMUNITY*]-(e:__Entity__)
WITH c, count(distinct e) AS entities
RETURN split(c.id, '-')[0] AS level, entities
"""
)
community_size_df = pd.DataFrame.from_records(community_size)
percentiles_data = []
for level in community_size_df["level"].unique():
subset = community_size_df[community_size_df["level"] == level]["entities"]
num_communities = len(subset)
percentiles = np.percentile(subset, [25, 50, 75, 90, 99])
percentiles_data.append(
[
level,
num_communities,
percentiles[0],
percentiles[1],
percentiles[2],
percentiles[3],
percentiles[4],
max(subset)
]
)

# Create a DataFrame with the percentiles
percentiles_df = pd.DataFrame(
percentiles_data,
columns=[
"Level",
"Number of communities",
"25th Percentile",
"50th Percentile",
"75th Percentile",
"90th Percentile",
"99th Percentile",
"Max"
],
)
percentiles_df
Community size distribution by levels

在原始实现中,每个层级上的社区都被总结了。在我们的情况下,这将是8,590个社区,因此有8,590次LLM调用。我认为,根据分层社区结构的不同,不需要对每个层级进行总结。例如,倒数第二层和倒数第一层之间仅相差四个社区(1,192个 vs. 1,188个)。因此,我们会创建许多冗余的总结。一种解决方案是创建一个可以对不变层级上的社区进行单一总结的实现;另一种解决方案是折叠不变的社区层级结构。

在保持HTML结构的同时,将以下英文文本翻译成简体中文: 同时,我不确定我们是否要总结只有一个成员的社区,因为它们可能并未提供太多的价值或信息。在这里,我们将总结0级、1级和4级的社区。首先,我们需要从数据库中检索它们的信息:

community_info = graph.query("""
MATCH (c:`__Community__`)<-[:IN_COMMUNITY*]-(e:__Entity__)
WHERE c.level IN [0,1,4]
WITH c, collect(e ) AS nodes
WHERE size(nodes) > 1
CALL apoc.path.subgraphAll(nodes[0], {
whitelistNodes:nodes
})
YIELD relationships
RETURN c.id AS communityId,
[n in nodes | {id: n.id, description: n.description, type: [el in labels(n) WHERE el <> '__Entity__'][0]}] AS nodes,
[r in relationships | {start: startNode(r).id, type: type(r), end: endNode(r).id, description: r.description}] AS rels
""")

```html

目前,社区信息具有以下结构:

```
{'communityId': '0-6014',
'nodes': [{'id': 'Darrell Hughes', 'description': None, type:"Person"},
{'id': 'Chief Pilot', 'description': None, type: "Person"},
...
}],
'rels': [{'start': 'Ryanair Dac',
'description': 'Informed of the change in chief pilot',
'type': 'INFORMED',
'end': 'Irish Aviation Authority'},
{'start': 'Ryanair Dac',
'description': 'Dismissed after internal investigation found unacceptable behaviour',
'type': 'DISMISSED',
'end': 'Aidan Murray'},
...
]}

Below is the HTML structure with the translated text in simplified Chinese: ```html

现在,我们需要准备一个LLM提示,根据社区元素提供的信息生成自然语言总结。我们可以从研究人员使用的提示中获得一些灵感。

``` In this HTML snippet, the translated text "现在,我们需要准备一个LLM提示,根据社区元素提供的信息生成自然语言总结。我们可以从研究人员使用的提示中获得一些灵感。" corresponds to the English text provided.

Sure, here's the translation in simplified Chinese while keeping the HTML structure intact: ```html

作者们不仅总结了社群,还为每一个社群生成了研究发现。所谓的研究发现可以定义为关于特定事件或信息片段的简明信息。以下是一个例子:

``` In Chinese characters: ```html

作者们不仅总结了社群,还为每一个社群生成了研究发现。所谓的研究发现可以定义为关于特定事件或信息片段的简明信息。以下是一个例子:

``` This HTML snippet retains the structure of the original text while providing the translation in simplified Chinese.
"summary": "Abila City Park as the central location",
"explanation": "Abila City Park is the central entity in this community, serving as the location for the POK rally. This park is the common link between all other
entities, suggesting its significance in the community. The park's association with the rally could potentially lead to issues such as public disorder or conflict, depending on the
nature of the rally and the reactions it provokes. [records: Entities (5), Relationships (37, 38, 39, 40)]"

Sure, here's the text translated into simplified Chinese while maintaining the HTML structure: ```html 我的直觉提示,仅通过一次提取可能不如我们需要的那样全面,就像提取实体和关系一样。 ``` In this translation: - `` is used to indicate inline text. - The Chinese text inside the `` tag is the translation of "My intuition suggests that extracting findings with just a single pass might not be as comprehensive as we need, much like extracting entities and relationships."

Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html 此外,在本地或全局搜索检索器中,我也没有找到它们在代码中的任何引用或示例。因此,我们将在这个实例中避免提取发现。或者,如学者们经常说的:这项任务留给读者自行完成。另外,我们也跳过了类似于发现的主张或共变信息提取。 ``` This HTML structure maintains the content while ensuring proper formatting in HTML for display purposes.

在保持HTML结构的前提下,将以下英文文本翻译成简体中文: 我们将用以下简单直接的提示来生成社区摘要:

community_template = """Based on the provided nodes and relationships that belong to the same graph community,
generate a natural language summary of the provided information:
{community_info}

Summary:""" # noqa: E501

community_prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"Given an input triples, generate the information summary. No pre-amble.",
),
("human", community_template),
]
)

community_chain = community_prompt | llm | StrOutputParser()

Sure, here is the simplified Chinese translation of the text, keeping the HTML structure intact: ```html

唯一剩下的就是将社区表示转换为字符串,以减少令牌数量,避免 JSON 令牌开销,并将链条包装为函数:

```
def prepare_string(data):
nodes_str = "Nodes are:\n"
for node in data['nodes']:
node_id = node['id']
node_type = node['type']
if 'description' in node and node['description']:
node_description = f", description: {node['description']}"
else:
node_description = ""
nodes_str += f"id: {node_id}, type: {node_type}{node_description}\n"

rels_str = "Relationships are:\n"
for rel in data['rels']:
start = rel['start']
end = rel['end']
rel_type = rel['type']
if 'description' in rel and rel['description']:
description = f", description: {rel['description']}"
else:
description = ""
rels_str += f"({start})-[:{rel_type}]->({end}){description}\n"

return nodes_str + "\n" + rels_str

def process_community(community):
stringify_info = prepare_string(community)
summary = community_chain.invoke({'community_info': stringify_info})
return {"community": community['communityId'], "summary": summary}

现在我们可以为选定的级别生成社区摘要。再次并行调用以加快执行速度。

summaries = []
with ThreadPoolExecutor() as executor:
futures = {executor.submit(process_community, community): community for community in community_info}

for future in tqdm(as_completed(futures), total=len(futures), desc="Processing communities"):
summaries.append(future.result())

```html

我没有提到的一个方面是,作者们还讨论了在输入社区信息时可能出现的超出上下文大小的问题。随着图形的扩展,社区也可能会显著增长。在我们的案例中,最大社区包含了 545 名成员。鉴于 GPT-4o 的上下文大小超过了 100,000 个 token,我们决定跳过这个步骤。

```

Sure, here's the simplified Chinese translation of the text within an HTML structure: ```html 作为我们的最后一步,我们将把社区摘要存储回数据库中: ``` This HTML snippet retains the structure while providing the translated text in simplified Chinese.

graph.query("""
UNWIND $data AS row
MERGE (c:__Community__ {id:row.community})
SET c.summary = row.summary
""", params={"data": summaries})

Sure, here is the text "The final graph structure:" translated into simplified Chinese while keeping the HTML structure: ```html

最终的图结构:

```

在保持HTML结构的情况下,将以下英文文本翻译为简体中文: 现在的图表包含了原始文档、提取的实体和关系,以及层次化的社区结构和摘要。

Sure, the simplified Chinese translation for "Summary" would be: 总结

Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html

《从局部到全局》论文的作者在展示GraphRAG的新方法时做得非常出色。他们展示了如何将来自各种文档的信息结合并总结到一个分层知识图结构中。

``` In this translation: - "From Local to Global" is translated as "从局部到全局". - "GraphRAG" remains in English as it is often used as a proper noun. - The structure `

` and `

` are HTML tags used for paragraph formatting.

在不显式提及的一点是,我们还可以将结构化数据源集成到图中;输入不必仅限于非结构化文本。

Sure, here's the text translated into simplified Chinese while keeping the HTML structure intact: ```html

我特别欣赏他们的提取方法,因为他们捕捉了节点和关系的描述。这些描述使得语言模型可以保留更多信息,而不仅仅是简化为节点ID和关系类型。

``` This HTML snippet contains the translated text: **Chinese (Simplified):** 我特别欣赏他们的提取方法,因为他们捕捉了节点和关系的描述。这些描述使得语言模型可以保留更多信息,而不仅仅是简化为节点ID和关系类型。

以下是将英文文本翻译为简体中文后保持HTML结构的版本: ```html 此外,他们还表明,仅对文本进行单次提取可能无法捕获所有相关信息,并引入逻辑以执行必要的多次传递。作者还提出了一个有趣的想法,即在图社区上执行摘要,使我们能够嵌入和索引跨多个数据源的精简主题信息。 ``` 请注意,以上翻译保留了HTML结构,适合在网页或类似环境中使用。

Certainly! Here's the translated text in simplified Chinese, while keeping the HTML structure intact: ```html 在下一篇博客文章中,我们将讨论本地和全局搜索检索器的实现,并讨论基于给定图结构可以实现的其他方法。 ``` This HTML structure maintains the original content while providing the translation in simplified Chinese.

在GitHub上始终可以找到代码。

Here’s the translation with HTML structure kept intact: ```html 这次,我也上传了数据库转储,以便你可以探索结果并尝试不同的检索器选项。 ```

Sure, here is the translated text in simplified Chinese, keeping the HTML structure intact: ```html

你也可以将这个数据导入到永久免费的Neo4j AuraDB实例中,我们可以用它来进行检索探索,因为在这些情况下我们不需要图数据科学算法 — 只需要图模式匹配、向量和全文索引。

```

Sure, here is the translated text in simplified Chinese while keeping the HTML structure: ```html Learn more about the Neo4j integrations with all the GenAI frameworks and practical graph algorithms in my book “Graph Algorithms for Data Science.” ``` Translated to simplified Chinese: ```html 了解更多关于Neo4j与所有GenAI框架及实用图算法的集成,详见我的书籍《数据科学中的图算法》。 ```

2024-07-11 04:30:42 AI中文站翻译自原文