在Google Colab中为Fine-Tuning Llama2创建自己的数据集：一步一步指南（第1部分）

**How would you make a Llama watch movies?**

您将学到什么内容：

如何使用Llama2自定义创建您自己的数据集进行指导微调
从数据集构建到微调的端到端过程：一切都在您喜爱的Google Colab上进行（免费版本！） :)

（在此阅读第2部分）

为什么要进行微调？

我可以列举一百个原因，解释为什么对一个开源的LLM进行微调是取得出色表现的秘诀，但让我们聚焦于一些受欢迎的原因：

你需要一个格式专家来输出特定格式的数据，比如json、csv或者其他你需要的格式。
您想要一个领域专家，或者拥有最新最好见解的内容。
您想免费模仿ChatGPT的状元宝！

无论原因如何，我在这里向您展示如何构建自定义数据集以对Llama2-7b模型进行微调。更具体地说，我们将制作自己的Llama来观看电影"芭比！"

在我们继续之前，我们的LLama2了解2023年电影《芭比》吗？我们问道：告诉我关于电影《芭比》的事情。LLama回答说：

巴比是一部关于一个小女孩梦想成为芭蕾舞演员的电影。她的祖母送给她一个芭蕾舞女孩的玩偶。玩偶突然活了起来，并带领巴比展开了成为真正的芭蕾舞演员的旅程。

这不是我们想要的！由于GenAI模型不确定性，我们再次进行了询问。这是我们得到的结果：

对不起，我不知道你在说什么。你能更具体一点吗？

构建自定义数据集：

首先，让我们搭建我们的环境。我们需要在Google Colab中安装一些重要的软件包。

!pip install langchain_openai langchain

Langchain是一种适用于各种LLM应用的出色框架。如果您完全不熟悉，请查看这个亲身教程。

这些是我们必需的导入项。您还需要一个OpenAI API密钥。这些非常容易获得，并提供了一种快速和低成本的生成文本的方式。

import pandas as pd
import time
from tqdm import tqdm
import langchain
from langchain_openai import OpenAI
import re
from typing import List, Tuple

import os
os.environ["OPENAI_API_KEY"] = 'get-your-own-api-here'

现在，该构建数据集了。在本教程中，我们将使用电影芭比的维基百科文章的一部分。我们需要三个描述部分。简介、剧情和演员阵容。

intro_description = f"""Barbie is a 2023 fantasy comedy film...""" #the full intro description
plot_description = f"""Stereotypical Barbie ("Barbie") ...""" #the full plot description
cast_description = f"""Margot Robbie as Barbie, often ...""" #the full cast description

通常，您需要大约1000个指令-响应对来构建一个经过指令优化的数据集。然而，根据特定的应用场景，您可能需要更多。在这里，我们选择一千个。

从介绍部分开始，我们将获得300对。
从情节中，我们将得到600对。
从演员中，我们将得到100对。

我们准备设计我们的提示信息。ChatGPT喜欢结构化和详细的提示信息，我们将尊重这一点！

focus = None #can be introductory, plot (start/middle/end) or cast
describe = None #can be intro_description, plot description or cast_description

prompt = f"""### Instruction: Based on the {focus} information of the movie 
  "Barbie" below, generate 5 instruction-detailed response pairs.
  Make sure the Instruction-Response are in the json format:\n\n
  ### Example: {{"Instruction": "the instruction", "Response": "the response"}}\n\n
  ### Description:{describe}\n\n
  ### Response:"""

根据需要，相应地修改“focus” 和 “describe” 变量（稍后会更详细介绍）。

我们还需要从一个将字符串转换为JSON格式的文件中提取结果。以下是用于此目的的有用代码：

def extract_instruction_response_pairs(string: str)-> Tuple[List[str], List[str]]:
    """
    Extracts pairs of instructions and responses from a JSON-formatted string.

    Parameters:
        - json_string (str): A string containing JSON-formatted instruction and response pairs.

    Returns:
        - instructions (list): A list of extracted instructions.
        - responses (list): A list of extracted responses corresponding to the instructions.
    """

    pattern = r'{"Instruction": "(.*?)", "Response": "(.*?)"}'

    # Use re.findall to extract matches
    matches = re.findall(pattern, string)

    # Extract lists of "Instruction" and "Response"
    instructions = [match[0] for match in matches]
    responses = [match[1] for match in matches]

    return instructions, responses

以下将从介绍部分生成300对指令和回应。

## Generating based on the intro section
All_instructions = []
All_reponses = []
start = time.time()
for idx in tqdm(range(60)): # 5 pairs per iteration will result in 5*60=300 pairs
  focus = "introductory"
  describe = intro_description
  prompt = f"""### Instruction: Based on the {focus} information of the movie 
  "Barbie" below, generate 5 instruction-detailed response pairs.
  Make sure the Instruction-Response are in the json format:\n\n
  ### Example: {{"Instruction": "the instruction", "Response": "the response"}}\n\n
  ### Description:{describe}\n\n
  ### Response:"""
  generated_text = llm(prompt)
  ins, res = extract_instruction_response_pairs(generated_text)
  All_instructions.extend(ins)
  All_reponses.extend(res)

print("\n\n===Time: {} seconds===".format(time.time()-start))

以相似的方式，我也生成了情节的文本。为了更“详细”起见，我进一步将情节分为三个独立的部分：

剧情介绍的前两段是饰演一名年轻画家的主人公艾莎。她正在为一次艺术展览准备作品，但缺乏创作灵感。她对自己的艺术感到困惑和失望，开始渐渐质疑自己的天赋和才华。在她最需要启发和动力的时候，她遇到了一位神秘的老人，后者声称能够帮助她找回失去的创造力。老人告诉艾莎，她需要在自然中寻找答案。于是，艾莎决定离开繁忙的城市生活，去大自然中寻找她的艺术灵感。在山林和湖泊之间，她重新连接了与大自然的关系，并开始观察和倾听自然世界的声音。这个过程让她重新点燃了内心的激情，恢复了对艺术的信心。她回来艺术展览，并展示了一系列充满生命力和创造力的作品，赢得了观众的瞩目和赞赏。
故事的中间三段是关于主要情节的。其中，主人公发现了一个秘密，这个秘密将改变他的一生。他开始寻找真相，并与其他角色展开了一系列冒险和斗争。在这段时间里，他经历了许多困难和挑战，但从中学到了很多重要的教训。最终，他成功地解开了谜团，并取得了巨大的胜利。这三段深情而引人入胜，为整个故事的发展奠定了基础。
剧情的最后两段：故事发展到最后，主角们终于揭开了谜团。他们找到了失踪的宝藏，并将其带回到村庄。村民们欢呼雀跃，感谢主角们的英勇行为。他们终于可以重新建设并恢复村庄的繁荣。整个故事以一种令人振奋的方式结束。主角们学到了很多关于友谊，互助和勇气的重要教训。他们的冒险不仅使他们自己受益，也影响了整个村庄。通过面临挑战并克服困难，他们证明了团结和努力的力量。这个故事将永远留在人们的心中，激励着每个人去追求自己的梦想，并相信团队合作的力量。

# ## Generating based on the plot section
focus_list =["first 2 paragraph of plot", "middle 3 paragraph of plot", 
"last 2 paragraph of plot"]
how_many_iteration = [20, 60, 40] #we want more data from middle section
describe = plot_description

for focus, iteration in zip(focus_list, how_many_iteration):
  for idx in tqdm(range(iteration)):
  prompt = f"""### Instruction: Based on the {focus} information of the movie 
    "Barbie" below, generate 5 instruction-detailed response pairs.
    Make sure the Instruction-Response are in the json format:\n\n
    ### Example: {{"Instruction": "the instruction", "Response": "the response"}}\n\n
    ### Description:{describe}\n\n
    ### Response:"""
    generated_text = llm(prompt)
    ins, res = extract_instruction_response_pairs(generated_text)
    All_instructions.extend(ins)
    All_reponses.extend(res)

而演员的代码将遵循与介绍部分相同的方法。

最后，让我们把所有的元素结合起来，构建一个pandas的数据集。

df = pd.DataFrame({
    "Instructions": All_instructions,
    "Responses": All_responses
})

df.to_csv("Barbie_ChatGPT_genAI.csv", index=False)

您可以从Google Colab下载数据。如果您以前没有这样做过，请查看这个简短的演示。

数据：https://github.com/sadat1971/Llama2_custom_finetuning/tree/main/Data

代码：https://github.com/sadat1971/Llama2_custom_finetuning/blob/main/Barbie_QA_chatGPT.ipynb

第二部分：如何在自定义数据上进行微调

重要提示：

虽然我们的目标是构建1000个示例对，但实际上我们只得到了954个。这是由于LLMs的非确定性特性造成的。不过，成功率为95.4%，还算不错！
在这个教程中，OpenAI API 的总费用只需 0.27 美元（是的，27美分！）。
这一代花了大约6分钟的时间。
您可以在OpenAI的温度、top-p和一些提示结构上进行调整。这是一篇很不错的阅读内容。

在Google Colab中为Fine-Tuning Llama2创建自己的数据集：一步一步指南（第1部分）

您将学到什么内容：

为什么要进行微调？

构建自定义数据集：

重要提示：

比賣Canva模板好的創業側業

在ChatGPT上，我是一位作家。

可视化你的RAG数据-为检索增强生成进行探索性数据分析

人工智能仅仅是懒人的工具吗？

最疯狂和免费的人工智能工具：一站式聊天！

10个ChatGPT生活技巧-这将改变你的生活！！

我与ChatGPT每日提示的旅程

革新对话：OpenAI直接将GPT集成到ChatGPT中

一月的回顾作为一个游戏开发者和创作者

HTML互动：一个用于构建交互式网页的自定义GPT