开源 AI 食谱文档

使用 Haystack 和 NuExtract 进行信息提取

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

使用 Haystack 和 NuExtract 进行信息提取

作者：Stefano Fiorucci

在本笔记本中，我们将了解如何使用语言模型自动从文本数据中提取信息。

🎯 目标：创建一个应用程序，用于从给定的文本或 URL 中提取特定信息，并遵循用户定义的结构。

🧰 技术栈

Haystack 🏗️：一个可定制的编排框架，用于构建 LLM 应用程序。我们将使用 Haystack 构建信息提取管道。
NuExtract：一个小型语言模型，专门为结构化数据提取而微调。

安装依赖项

! pip install haystack-ai trafilatura transformers pyvis

组件

Haystack 有两个主要概念：组件和管道。

🧩 组件是执行单个任务的构建块：文件转换、文本生成、嵌入创建……

➿ 管道允许您通过将组件组合成有向（循环）图来定义数据在 LLM 应用程序中的流动。

我们现在将介绍信息提取应用程序的各种组件。之后，我们将它们集成到一个管道中。

LinkContentFetcher 和 HTMLToDocument：从网页中提取文本

在我们的实验中，我们将从网络上找到的创业公司融资公告中提取数据。

要下载网页并提取文本，我们使用两个组件

LinkContentFetcher：获取某些 URL 的内容，并返回内容流列表（作为 ByteStream 对象）。
HTMLToDocument：将 HTML 源转换为文本 Documents。

>>> from haystack.components.fetchers import LinkContentFetcher
>>> from haystack.components.converters import HTMLToDocument


>>> fetcher = LinkContentFetcher()

>>> streams = fetcher.run(urls=["https://example.com/"])["streams"]

>>> converter = HTMLToDocument()
>>> docs = converter.run(sources=streams)

>>> print(docs)

&#123;'documents': [Document(id=65bb1ce4b6db2f154d3acfa145fa03363ef93f751fb8599dcec3aaf75aa325b9, content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...', meta: &#123;'content_type': 'text/html', 'url': 'https://example.com/'})]}

HuggingFaceLocalGenerator：加载并尝试模型

我们使用 HuggingFaceLocalGenerator，这是一个文本生成组件，允许使用 Transformers 库加载托管在 Hugging Face 上的模型。

Haystack 支持许多其他生成器，包括 HuggingFaceAPIGenerator（与 Hugging Face API 和 TGI 兼容）。

我们加载 NuExtract，这是一个从 microsoft/Phi-3-mini-4k-instruct 微调的模型，用于从文本中执行结构化数据提取。模型大小为 38 亿参数。还有其他变体可用：NuExtract-tiny (5 亿) 和 NuExtract-large (70 亿)。

该模型以 bfloat16 精度加载，以适应 Colab，与 FP32 相比性能损失可忽略不计，正如模型卡中所建议的那样。

关于 Flash Attention 的注意事项

在推理时，您可能会看到警告，提示：“您没有运行 flash-attention 实现”。

像 Colab 或 Kaggle 这样的免费环境上可用的 GPU 不支持它，因此我们决定不在本笔记本中使用它。

如果您的 GPU 架构支持它（详细信息），您可以安装它并按如下方式加速

pip install flash-attn --no-build-isolation

然后将 "attn_implementation": "flash_attention_2" 添加到 model_kwargs。

from haystack.components.generators import HuggingFaceLocalGenerator
import torch

generator = HuggingFaceLocalGenerator(
    model="numind/NuExtract", huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype": torch.bfloat16}}
)

# effectively load the model (warm_up is automatically invoked when the generator is part of a Pipeline)
generator.warm_up()

该模型支持特定的 prompt 结构，可以从模型卡中推断出来。

让我们手动创建一个 prompt 来尝试模型。稍后，我们将了解如何根据不同的输入动态创建 prompt。

>>> prompt = """<|input|>\n### Template:
... {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }
... ### Text:
... The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).

... The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.

... In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
... <|output|>
... """

>>> result = generator.run(prompt=prompt)
>>> print(result)

&#123;'replies': ['&#123;\n    "Car": &#123;\n        "Name": "Fiat Panda",\n        "Manufacturer": "Fiat",\n        "Designers": [\n            "Giorgetto Giugiaro",\n            "Aldo Mantovani",\n            "Giuliano Biasio",\n            "Roberto Giolito"\n        ],\n        "Number of units produced": "over 7.8 million"\n    }\n}\n']}

不错 ✅

PromptBuilder：动态创建 prompt

PromptBuilder 使用 Jinja2 prompt 模板初始化，并通过填充通过关键字参数传递的参数来呈现它。

我们的 prompt 模板重现了模型卡中显示的结构。

在我们的实验中，我们发现缩进模式对于确保良好结果尤为重要。这可能源于模型的训练方式。

from haystack.components.builders import PromptBuilder
from haystack import Document

prompt_template = """<|input|>
### Template:
{{ schema | tojson(indent=4) }}
{% for example in examples %}
### Example:
{{ example | tojson(indent=4) }}\n
{% endfor %}
### Text
{{documents[0].content}}
<|output|>
"""

prompt_builder = PromptBuilder(template=prompt_template)

>>> example_document = Document(content="The Fiat Panda is a city car...")

>>> example_schema = {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }

>>> prompt = prompt_builder.run(documents=[example_document], schema=example_schema)["prompt"]

>>> print(prompt)

<|input|>
### Template:
&#123;
    "Car": &#123;
        "Designers": [],
        "Manufacturer": "",
        "Name": "",
        "Number of units produced": ""
    }
}

### Text
The Fiat Panda is a city car...
<|output|>

效果很好 ✅

OutputAdapter

您可能已经注意到，提取结果是 replies 列表的第一个元素，并且由 JSON 字符串组成。

我们希望每个源文档都有一个字典。要在管道中执行此转换，我们可以使用 OutputAdapter。

>>> import json
>>> from haystack.components.converters import OutputAdapter


>>> adapter = OutputAdapter(
...     template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
...     output_type=dict,
...     custom_filters={"json_loads": json.loads},
... )

... print(adapter.run(**result))

&#123;'output': &#123;'Car': &#123;'Name': 'Fiat Panda', 'Manufacturer': 'Fiat', 'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani', 'Giuliano Biasio', 'Roberto Giolito'], 'Number of units produced': 'over 7.8 million'}}}

信息提取管道

构建管道

我们现在可以通过添加和连接各个组件来创建我们的管道。

from haystack import Pipeline

ie_pipe = Pipeline()
ie_pipe.add_component("fetcher", fetcher)
ie_pipe.add_component("converter", converter)
ie_pipe.add_component("prompt_builder", prompt_builder)
ie_pipe.add_component("generator", generator)
ie_pipe.add_component("adapter", adapter)

ie_pipe.connect("fetcher", "converter")
ie_pipe.connect("converter", "prompt_builder")
ie_pipe.connect("prompt_builder", "generator")
ie_pipe.connect("generator", "adapter")

# IN CASE YOU NEED TO RECREATE THE PIPELINE FROM SCRATCH, YOU CAN UNCOMMENT THIS CELL

# ie_pipe = Pipeline()
# ie_pipe.add_component("fetcher", LinkContentFetcher())
# ie_pipe.add_component("converter", HTMLToDocument())
# ie_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
# ie_pipe.add_component("generator", HuggingFaceLocalGenerator(model="numind/NuExtract",
#                                       huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype":torch.bfloat16}})
# )
# ie_pipe.add_component("adapter", OutputAdapter(template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
#                                          output_type=dict,
#                                          custom_filters={"json_loads": json.loads}))

# ie_pipe.connect("fetcher", "converter")
# ie_pipe.connect("converter", "prompt_builder")
# ie_pipe.connect("prompt_builder", "generator")
# ie_pipe.connect("generator", "adapter")

让我们回顾一下我们的管道设置

>>> ie_pipe.show()

定义来源和提取模式

我们选择与最近的创业公司融资公告相关的 URL 列表。

此外，我们为旨在提取的结构化信息定义了一个模式。

urls = [
    "https://techcrunch.com/2023/04/27/pinecone-drops-100m-investment-on-750m-valuation-as-vector-database-demand-grows/",
    "https://techcrunch.com/2023/04/27/replit-funding-100m-generative-ai/",
    "https://www.cnbc.com/2024/06/12/mistral-ai-raises-645-million-at-a-6-billion-valuation.html",
    "https://techcrunch.com/2024/01/23/qdrant-open-source-vector-database/",
    "https://www.intelcapital.com/anyscale-secures-100m-series-c-at-1b-valuation-to-radically-simplify-scaling-and-productionizing-ai-applications/",
    "https://techcrunch.com/2023/04/28/openai-funding-valuation-chatgpt/",
    "https://techcrunch.com/2024/03/27/amazon-doubles-down-on-anthropic-completing-its-planned-4b-investment/",
    "https://techcrunch.com/2024/01/22/voice-cloning-startup-elevenlabs-lands-80m-achieves-unicorn-status/",
    "https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia",
    "https://www.prnewswire.com/news-releases/ai21-completes-208-million-oversubscribed-series-c-round-301994393.html",
    "https://techcrunch.com/2023/03/15/adept-a-startup-training-ai-to-use-existing-software-and-apis-raises-350m/",
    "https://www.cnbc.com/2023/03/23/characterai-valued-at-1-billion-after-150-million-round-from-a16z.html",
]


schema = {
    "Funding": {
        "New funding": "",
        "Investors": [],
    },
    "Company": {"Name": "", "Activity": "", "Country": "", "Total valuation": "", "Total funding": ""},
}

运行管道！

我们将所需数据传递给每个组件。

请注意，它们中的大多数都接收来自先前执行的组件的数据。

from tqdm import tqdm

extracted_data = []

for url in tqdm(urls):
    result = ie_pipe.run({"fetcher": {"urls": [url]}, "prompt_builder": {"schema": schema}})

    extracted_data.append(result["adapter"]["output"])

让我们检查一些提取的数据

extracted_data[:2]

数据探索和可视化

让我们探索提取的数据，以评估其正确性并获得见解。

数据框

我们首先创建一个 Pandas 数据框。为简单起见，我们展平提取的数据。

def flatten_dict(d, parent_key=""):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key} - {k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key).items())
        elif isinstance(v, list):
            items.append((new_key, ", ".join(v)))
        else:
            items.append((new_key, v))
    return dict(items)

import pandas as pd

df = pd.DataFrame([flatten_dict(el) for el in extracted_data])
df = df.sort_values(by="Company - Name")

df

dataframe

除了“公司 - 国家/地区”中的一些错误外，提取的数据看起来不错。

构建一个简单的图表

为了解公司和投资者之间的关系，我们构建一个图表并将其可视化。

首先，我们使用 NetworkX 构建一个图表。

NetworkX 是一个 Python 包，允许以简单的方式创建和操作网络/图表。

我们的简单图表将公司和投资者作为节点。如果投资者在同一文档中被提及，我们将投资者与公司连接起来。

import networkx as nx

# Create a new graph
G = nx.Graph()

# Add nodes and edges
for el in extracted_data:
    company_name = el["Company"]["Name"]
    G.add_node(company_name, label=company_name, title="Company")

    investors = el["Funding"]["Investors"]
    for investor in investors:
        if not G.has_node(investor):
            G.add_node(investor, label=investor, title="Investor", color="red")
        G.add_edge(company_name, investor)

接下来，我们使用 Pyvis 可视化图表。

Pyvis 是一个用于交互式可视化网络/图表的 Python 包。它可以与 NetworkX 很好地集成。

from pyvis.network import Network
from IPython.display import display, HTML


net = Network(notebook=True, cdn_resources="in_line")
net.from_nx(G)

net.show("simple_graph.html")
display(HTML("simple_graph.html"))

graph visualization

看起来 Andreessen Horowitz 在选定的融资公告中非常活跃 😊

结论和想法

在本笔记本中，我们演示了如何使用小型语言模型 (NuExtract) 和 Haystack（一个用于 LLM 应用程序的可定制编排框架）设置信息提取系统。

我们如何使用提取的数据？

一些想法

提取的数据可以添加到存储在文档存储中的原始文档中。这允许使用元数据过滤进行高级搜索。
在前一个想法的基础上扩展，您可以对查询进行 RAG（检索增强生成），并从查询中提取元数据，如这篇博客文章中所述。
将文档和提取的数据存储在知识图谱中，并执行图谱 RAG (Neo4j-Haystack 集成)。

< > 在 GitHub 上更新

←用于 PII 检测的 LLM 网关使用 Qdrant 的向量嵌入进行代码搜索→