合成数据集生成技术：生成自定义句子相似度数据

社区文章发布于 2024 年 5 月 23 日

丹尼尔·范·斯特里恩

davanstrien

本文是合成数据生成技术系列的一部分。您可能还想查看Awesome Synthetic (text) datasets，我将在其中收集这些文章。

LLM 最激动人心的用例之一是生成可用于训练非 LLM 模型的合成数据集。过去，收集足够的数据是训练特定任务模型（如文本分类模型）的最大障碍之一。LLM 有可能在这方面提供帮助。

使用大型语言模型（LLM）创建用于训练和微调嵌入模型的数据？

合成数据的一个引人注目的应用领域是生成用于训练句子相似度模型的数据。

句子相似度是确定两段文本相似程度的任务。句子相似度模型将输入文本转换为捕获语义信息的向量（嵌入），并计算它们之间的接近（相似）程度。此任务对于信息检索和聚类/分组特别有用。来源

虽然有一些强大的开放嵌入模型可用于句子相似度任务，但有时为模型微调提供额外数据可能会有所帮助

当在通用模型效果不佳的领域工作时。
当您希望针对特定用途（例如检索与分类）优化模型时。
扩展、扩展、扩展：您想要训练一个通用嵌入模型，但需要更多数据。

对于后一个例子，LLM 很有用，不仅因为它们允许您扩展数据量，还因为它们允许您控制训练数据中的数据。许多嵌入模型使用“野外”发现的数据的一些弱监督。虽然使用这些数据允许模型学习如何建模相似性，但这些数据中也有很多噪音。最近的一篇论文《使用大型语言模型改进文本嵌入》表明，与使用更大但噪声更大的弱标记数据集相比，生成旨在多样化嵌入模型处理的数据可以减少所需的数据量。

什么是相似

我在讨论句子相似度任务时，有时会感到沮丧，因为“相似度”的含义通常定义得相当模糊（抱歉，这是我人文训练的体现）。这就是我真正喜欢论文《基于描述的文本相似度》的原因之一。在该论文中，作者描述了现有方法的一个问题

相似性概念...没有明确定义，而是从包含标记为相似的文本对的庞大数据集中学习，这些数据集通常混合了各种不同类型的相似性（Kaster et al., 2021; Opitz & Frank, 2022）。这使得它们在信息检索查询中表现不佳，因为很难控制或预测给定基于相似性的查询的结果。对于语义搜索用例，什么是好的查询表示和相似性定义？

他们在论文中采用的方法是使用 LLM 生成新的查询句子，这些句子旨在成为“句子的抽象描述”，可以与它们的实例化一起进行训练。为了更清楚地说明，这里有一些他们在论文中生成的示例

生成自定义句子相似度数据

虽然本文主要关注生成句子的“抽象”查询任务，但该方法可以适用于其他更具针对性的相似度数据集。在本文的其余部分，我将简要介绍如何生成此类数据的一些示例（Awesome Synthetic Datasets repo 中的完整笔记本包含完整代码）。

通过 huggingface_hub 库使用推理端点。

在论文中，作者使用了 OpenAI 的 GPT3.5。在本文中，我们将将其替换为开源模型meta-llama/Meta-Llama-3-70B-Instruct，我们将通过 huggingface_hub 库调用它。

首先，我们可以导入所需的库

from huggingface_hub import get_token
from huggingface_hub import InferenceClient

然后，我们可以使用 InferenceClient 来指定要使用的模型。

client = InferenceClient("meta-llama/Meta-Llama-3-70B-Instruct", token=get_token())

提示

为了生成维基百科的描述，使用了以下提示

wiki_prompt = f"""
Let's write abstract descriptions of sentences. Example:
Sentence: Pilate's role in the events leading to the crucifixion lent themselves to melodrama , even tragedy , and Pilate often has a role in medieval mystery plays .
Description: A description of a historical religious figure's involvement in a significant event and its later portrayal in art.
Note: Descriptions can differ in the level of abstraction, granularity and the part of the sentence they focus on. Some descriptions need to be abstract, while others should be concrete and detailed.
For the following sentence, write up 5 good and stand-alone, independent descriptions and 5 bad descriptions (which may be related, but are clearly wrong). Output a json file with keys 'good', 'bad'.
Sentence: {sentence}
Start your answer with a curly bracket.
"""

让我们使用这个提示生成一些句子。我们将使用这个句子作为例子

“在希腊神话中，阿喀琉斯（）或阿基琉斯（）是特洛伊战争中的英雄，被誉为所有希腊战士中最伟大的。他是荷马史诗《伊利亚特》中的中心人物，是海神忒提斯和弗提亚国王兼著名阿尔戈英雄珀琉斯的儿子。阿喀琉斯在弗提亚与童年伙伴帕特罗克洛斯一起长大，并接受了半人马喀戎的教育。在《伊利亚特》中，他被描绘为神话部族米尔米东人的指挥官。”

resp = client.text_generation(wiki_prompt.format(sentence=sentence))
print(resp)

{
"good": [
"A description of a mythological figure's background and characteristics",
"A summary of a legendary hero's life and exploits",
"A passage about a character from ancient Greek literature",
"A biographical sketch of a famous warrior from mythology",
"A description of a central character in a famous epic poem"
],
"bad": [
"A description of a real person's life",
"A summary of a historical event",
"A passage about a character from a novel",
"A biographical sketch of a king",
"A

我们可以看到我们大致得到了提示中要求的内容，但让我们尝试将其加载为 JSON

import json

json.loads(resp)

---------------------------------------------------------------------------

JSONDecodeError                           Traceback (most recent call last)

Cell In[82], line 3
      1 import json
----> 3 json.loads(resp)


File ~/.pyenv/versions/3.11.1/lib/python3.11/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    341     s = s.decode(detect_encoding(s), 'surrogatepass')
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:
    348     cls = JSONDecoder


File ~/.pyenv/versions/3.11.1/lib/python3.11/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    332 def decode(self, s, _w=WHITESPACE.match):
    333     """Return the Python representation of ``s`` (a ``str`` instance
    334     containing a JSON document).
    335 
    336     """
--> 337     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338     end = _w(s, end).end()
    339     if end != len(s):


File ~/.pyenv/versions/3.11.1/lib/python3.11/json/decoder.py:353, in JSONDecoder.raw_decode(self, s, idx)
    344 """Decode a JSON document from ``s`` (a ``str`` beginning with
    345 a JSON document) and return a 2-tuple of the Python
    346 representation and the index in ``s`` where the document ended.
   (...)
    350 
    351 """
    352 try:
--> 353     obj, end = self.scan_once(s, idx)
    354 except StopIteration as err:
    355     raise JSONDecodeError("Expecting value", s, err.value) from None


JSONDecodeError: Unterminated string starting at: line 14 column 1 (char 489)

结构化生成

我们可以帮助模型生成有效 JSON 的一种方法是增加 token 的数量。但是，我们也可以使用另一种方法，即结构化文本生成。这可以用来将模型的输出限制为更具体的格式。

我们可以通过使用 Text Generation Inference 托管的推理 API 模型来使用结构化文本生成。本文不会深入讨论其内部工作原理（请参阅 https://huggingface.co/docs/text-generation-inference/conceptual/guidance 以获取详细指南）。我们只关注如何使用它来改进我们从开放 LLM 获得的结果。

在进行结构化文本生成时，我们使用一种称为“语法”的东西来指定我们希望输出的格式。创建这些语法有多种方法，但一种方法是使用 Pydantic 模型。Pydantic 是一个广泛使用的 Python 数据验证库，可用于验证数据是否符合特定格式。该库最初主要用于验证通过 API 等传入的数据，但在 LLM 环境中也可能非常有用。

定义数据的一个简单方法是创建一个名为 `Sentences` 的模型，并指定我们想要两个属性：`good` 和 `bad`。每个属性都应该是一个字符串列表。您会注意到，在这个示例中，这些属性是通过标准 Python 类型指定的。

from pydantic import BaseModel

class Sentences(BaseModel):
    good: list[str]
    bad: list[str]

要通过 huggingface_hub 库使用此模型，我们需要将其作为 JSON Schema 传递。让我们看看此模型的 Schema 是什么样的

schema = Sentences.model_json_schema()
schema

{'properties': {'good': {'items': {'type': 'string'},
   'title': 'Good',
   'type': 'array'},
  'bad': {'items': {'type': 'string'}, 'title': 'Bad', 'type': 'array'}},
 'required': ['good', 'bad'],
 'title': 'Sentences',
 'type': 'object'}

我们可以将此 Schema 传递给客户端的 text_generation 方法。

resp = client.text_generation(
    wiki_prompt.format(sentence=sentence),
    grammar={"type": "json", "value": Sentences.model_json_schema()},
    max_new_tokens=2000,
)

我们可以看到现在可以将我们的响应加载到一个有效的 JSON 对象中

json.loads(resp)

{'bad': ["Achilles' biography",
  'A description of a person',
  'A passage about a book',
  'A story about a king',
  'A summary of a myth'],
 'good': ["A description of a mythological figure's background and character in ancient Greek literature",
  'A characterization of a legendary warrior in Greek mythology',
  'A summary of the early life and education of a hero in ancient Greek mythology',
  'A description of a central character in a famous epic poem',
  "A portrayal of a mythological hero's family and upbringing"]}

抽象描述

论文的作者更进一步，使用第一个提示生成的描述以及第二个提示，后者侧重于生成句子的更抽象表示。我们将快速看一个示例，使用我们的一个例子来展示其效果

prompt_abstract = "Sentence: in spite of excellent pediatric health care , several educational problems could be noted in this tertiary pediatric center .\nDescription: Despite having advanced healthcare resources, certain deficiencies in education were identified at a medical center that serves children.\nA very abstract description: The provision of care at a specialized medical center was not optimal in one particular area, despite the presence of advanced resources.\nSentence: {sentence}\nDescription: {description}\nA very abstract description:"

def generate_abstract_description(sentence, description):
    return client.text_generation(
        prompt_abstract.format(sentence=sentence, description=description),
    )

description =json.loads(resp).get('good')[1]

我们原始的句子和描述如下

print(f"Sentence: {sentence}\nDescription: {description}\n")

Sentence: In Greek mythology, Achilles ( ) or Achilleus () was a hero of the Trojan War who was known as being the greatest of all the Greek warriors. A central character in Homer's Iliad, he was the son of the Nereid Thetis and Peleus, king of Phthia and famous Argonaut. Achilles was raised in Phthia along his childhood companion Patroclus and received his education by the centaur Chiron. In the Iliad, he is presented as the commander of the mythical tribe of the Myrmidons. 
Description: A characterization of a legendary hero in a famous epic poem

print(f"Abstract version: {generate_abstract_description(sentence, description)}")

Abstract version:  A figure from ancient mythology is described in terms of their family, upbringing, and role in a famous story.

结论

尽管如果您想大规模生成这种数据，仍有一些细节需要处理，但上述示例希望能展示如何使用开放式 LLM 生成更定制的数据来训练相似性模型。虽然这里的提示借用了论文中的内容，但它们当然可以进行调整，以侧重于生成其他类型的相似性数据。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以发表评论