开源 AI 食谱文档

使用 Haystack 和 NuExtract 进行信息提取

Hugging Face's logo
加入 Hugging Face 社区

并获得增强型文档体验

开始使用

Open In Colab

使用 Haystack 和 NuExtract 进行信息提取

作者:Stefano Fiorucci

在本笔记本中,我们将了解如何使用语言模型自动从文本数据中提取信息。

🎯 目标:创建一个应用程序,根据用户定义的结构从给定文本或 URL 中提取特定信息。

🧰 堆栈

  • Haystack 🏗️: 用于构建 LLM 应用程序的可定制编排框架。我们将使用 Haystack 构建信息提取管道。

  • NuExtract: 一个小型语言模型,专门针对结构化数据提取进行微调。

安装依赖项

! pip install haystack-ai trafilatura transformers pyvis

组件

Haystack 有两个主要概念:组件和管道

🧩 组件是执行单个任务的构建块:文件转换、文本生成、嵌入创建等。

管道允许您通过将组件组合成有向(循环)图来定义 LLM 应用程序中的数据流。

现在我们将介绍信息提取应用程序的各个组件。之后,我们将把它们集成到管道中。

LinkContentFetcher 和 HTMLToDocument:从网页中提取文本

在本实验中,我们将从网上发现的初创公司融资公告中提取数据。

为了下载网页并提取文本,我们使用两个组件

>>> from haystack.components.fetchers import LinkContentFetcher
>>> from haystack.components.converters import HTMLToDocument


>>> fetcher = LinkContentFetcher()

>>> streams = fetcher.run(urls=["https://example.com/"])["streams"]

>>> converter = HTMLToDocument()
>>> docs = converter.run(sources=streams)

>>> print(docs)
{'documents': [Document(id=65bb1ce4b6db2f154d3acfa145fa03363ef93f751fb8599dcec3aaf75aa325b9, content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...', meta: {'content_type': 'text/html', 'url': 'https://example.com/'})]}

HuggingFaceLocalGenerator:加载并尝试模型

我们使用 HuggingFaceLocalGenerator,这是一个文本生成组件,允许使用 Transformers 库加载托管在 Hugging Face 上的模型。

Haystack 支持许多其他 Generators,包括 HuggingFaceAPIGenerator(与 Hugging Face API 和 TGI 兼容)。

我们加载 NuExtract,这是一个从 microsoft/Phi-3-mini-4k-instruct 微调的模型,用于从文本中执行结构化数据提取。模型大小为 3.8B 个参数。还有其他变体可用:NuExtract-tiny(0.5B)和 NuExtract-large(7B)。

根据模型卡的建议,模型使用 bfloat16 精度加载,以适应 Colab,与 FP32 相比,性能损失可以忽略不计。

关于 Flash Attention 的说明

在推理时,您可能会看到一条警告,提示“您没有运行 flash-attention 实现”。

Colab 或 Kaggle 等免费环境中提供的 GPU 不支持它,因此我们决定在本笔记本中不使用它。

如果您的 GPU 架构支持它(详情),您可以安装它并按如下方式加速

pip install flash-attn --no-build-isolation

然后将 "attn_implementation": "flash_attention_2" 添加到 model_kwargs 中。

from haystack.components.generators import HuggingFaceLocalGenerator
import torch

generator = HuggingFaceLocalGenerator(
    model="numind/NuExtract", huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype": torch.bfloat16}}
)

# effectively load the model (warm_up is automatically invoked when the generator is part of a Pipeline)
generator.warm_up()

模型支持特定的提示结构,可以从模型卡中推断出来。

让我们手动创建一个提示来尝试模型。稍后,我们将看到如何根据不同的输入动态创建提示。

>>> prompt = """<|input|>\n### Template:
... {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }
... ### Text:
... The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).

... The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.

... In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
... <|output|>
... """

>>> result = generator.run(prompt=prompt)
>>> print(result)
&#123;'replies': ['&#123;\n    "Car": &#123;\n        "Name": "Fiat Panda",\n        "Manufacturer": "Fiat",\n        "Designers": [\n            "Giorgetto Giugiaro",\n            "Aldo Mantovani",\n            "Giuliano Biasio",\n            "Roberto Giolito"\n        ],\n        "Number of units produced": "over 7.8 million"\n    }\n}\n']}

不错 ✅

PromptBuilder:动态创建提示

PromptBuilder 使用 Jinja2 提示模板初始化,并通过关键字参数传递的参数填充它以呈现。

我们的提示模板复制了 模型卡 中显示的结构。

在我们的实验中,我们发现缩进模式对于确保良好的结果特别重要。这可能是由于模型的训练方式造成的。

from haystack.components.builders import PromptBuilder
from haystack import Document

prompt_template = """<|input|>
### Template:
{{ schema | tojson(indent=4) }}
{% for example in examples %}
### Example:
{{ example | tojson(indent=4) }}\n
{% endfor %}
### Text
{{documents[0].content}}
<|output|>
"""

prompt_builder = PromptBuilder(template=prompt_template)
>>> example_document = Document(content="The Fiat Panda is a city car...")

>>> example_schema = {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }

>>> prompt = prompt_builder.run(documents=[example_document], schema=example_schema)["prompt"]

>>> print(prompt)
<|input|>
### Template:
&#123;
    "Car": &#123;
        "Designers": [],
        "Manufacturer": "",
        "Name": "",
        "Number of units produced": ""
    }
}

### Text
The Fiat Panda is a city car...
<|output|>

效果很好 ✅

OutputAdapter

您可能已经注意到,提取结果是 replies 列表的第一个元素,并且包含一个 JSON 字符串。

我们希望每个源文档都包含一个字典。为了在管道中执行这种转换,我们可以使用 OutputAdapter

>>> import json
>>> from haystack.components.converters import OutputAdapter


>>> adapter = OutputAdapter(
...     template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
...     output_type=dict,
...     custom_filters={"json_loads": json.loads},
... )

... print(adapter.run(**result))
&#123;'output': &#123;'Car': &#123;'Name': 'Fiat Panda', 'Manufacturer': 'Fiat', 'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani', 'Giuliano Biasio', 'Roberto Giolito'], 'Number of units produced': 'over 7.8 million'}}}

信息提取管道

构建管道

我们现在可以通过添加和连接各个组件来 创建我们的管道

from haystack import Pipeline

ie_pipe = Pipeline()
ie_pipe.add_component("fetcher", fetcher)
ie_pipe.add_component("converter", converter)
ie_pipe.add_component("prompt_builder", prompt_builder)
ie_pipe.add_component("generator", generator)
ie_pipe.add_component("adapter", adapter)

ie_pipe.connect("fetcher", "converter")
ie_pipe.connect("converter", "prompt_builder")
ie_pipe.connect("prompt_builder", "generator")
ie_pipe.connect("generator", "adapter")
# IN CASE YOU NEED TO RECREATE THE PIPELINE FROM SCRATCH, YOU CAN UNCOMMENT THIS CELL

# ie_pipe = Pipeline()
# ie_pipe.add_component("fetcher", LinkContentFetcher())
# ie_pipe.add_component("converter", HTMLToDocument())
# ie_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
# ie_pipe.add_component("generator", HuggingFaceLocalGenerator(model="numind/NuExtract",
#                                       huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype":torch.bfloat16}})
# )
# ie_pipe.add_component("adapter", OutputAdapter(template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
#                                          output_type=dict,
#                                          custom_filters={"json_loads": json.loads}))

# ie_pipe.connect("fetcher", "converter")
# ie_pipe.connect("converter", "prompt_builder")
# ie_pipe.connect("prompt_builder", "generator")
# ie_pipe.connect("generator", "adapter")

让我们回顾一下我们的管道设置

>>> ie_pipe.show()

定义来源和提取模式

我们选择一个与最近初创公司融资公告相关的 URL 列表。

此外,我们定义了一个模式,用于我们想要提取的结构化信息。

urls = [
    "https://techcrunch.com/2023/04/27/pinecone-drops-100m-investment-on-750m-valuation-as-vector-database-demand-grows/",
    "https://techcrunch.com/2023/04/27/replit-funding-100m-generative-ai/",
    "https://www.cnbc.com/2024/06/12/mistral-ai-raises-645-million-at-a-6-billion-valuation.html",
    "https://techcrunch.com/2024/01/23/qdrant-open-source-vector-database/",
    "https://www.intelcapital.com/anyscale-secures-100m-series-c-at-1b-valuation-to-radically-simplify-scaling-and-productionizing-ai-applications/",
    "https://techcrunch.com/2023/04/28/openai-funding-valuation-chatgpt/",
    "https://techcrunch.com/2024/03/27/amazon-doubles-down-on-anthropic-completing-its-planned-4b-investment/",
    "https://techcrunch.com/2024/01/22/voice-cloning-startup-elevenlabs-lands-80m-achieves-unicorn-status/",
    "https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia",
    "https://www.prnewswire.com/news-releases/ai21-completes-208-million-oversubscribed-series-c-round-301994393.html",
    "https://techcrunch.com/2023/03/15/adept-a-startup-training-ai-to-use-existing-software-and-apis-raises-350m/",
    "https://www.cnbc.com/2023/03/23/characterai-valued-at-1-billion-after-150-million-round-from-a16z.html",
]


schema = {
    "Funding": {
        "New funding": "",
        "Investors": [],
    },
    "Company": {"Name": "", "Activity": "", "Country": "", "Total valuation": "", "Total funding": ""},
}

运行管道!

我们将所需数据传递给每个组件。

请注意,它们中的大多数都接收来自先前执行的组件的数据。

from tqdm import tqdm

extracted_data = []

for url in tqdm(urls):
    result = ie_pipe.run({"fetcher": {"urls": [url]}, "prompt_builder": {"schema": schema}})

    extracted_data.append(result["adapter"]["output"])

让我们检查一些提取的数据

extracted_data[:2]

数据探索和可视化

让我们探索提取的数据,以评估其正确性并获得洞察。

数据框

我们首先创建一个 Pandas 数据框。为简便起见,我们将提取的数据扁平化。

def flatten_dict(d, parent_key=""):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key} - {k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key).items())
        elif isinstance(v, list):
            items.append((new_key, ", ".join(v)))
        else:
            items.append((new_key, v))
    return dict(items)
import pandas as pd

df = pd.DataFrame([flatten_dict(el) for el in extracted_data])
df = df.sort_values(by="Company - Name")

df

dataframe

除了“公司 - 国家”中的一些错误外,提取的数据看起来不错。

构建一个简单的图

为了理解公司和投资者之间的关系,我们构建了一个图并将其可视化。

首先,我们使用 NetworkX 构建一个图。

NetworkX 是一个 Python 包,允许以简单的方式创建和操作网络/图。

我们的简单图将具有公司和投资者作为节点。如果他们在同一文档中被提及,我们将连接投资者到公司。

import networkx as nx

# Create a new graph
G = nx.Graph()

# Add nodes and edges
for el in extracted_data:
    company_name = el["Company"]["Name"]
    G.add_node(company_name, label=company_name, title="Company")

    investors = el["Funding"]["Investors"]
    for investor in investors:
        if not G.has_node(investor):
            G.add_node(investor, label=investor, title="Investor", color="red")
        G.add_edge(company_name, investor)

接下来,我们使用 Pyvis 可视化该图。

Pyvis 是一个用于交互式可视化网络/图的 Python 包。它与 NetworkX 整合良好。

from pyvis.network import Network
from IPython.display import display, HTML


net = Network(notebook=True, cdn_resources="in_line")
net.from_nx(G)

net.show("simple_graph.html")
display(HTML("simple_graph.html"))

graph visualization

看起来 Andreessen Horowitz 在选定的融资公告中相当活跃 😊

结论和想法

在本笔记本中,我们演示了如何使用小型语言模型 (NuExtract) 和 Haystack(LLM 应用程序的可定制编排框架)来设置信息提取系统。

我们如何使用提取的数据?

一些想法

  • 提取的数据可以添加到存储在 文档存储 中的原始文档中。这允许使用 元数据过滤 进行高级搜索功能。
  • 扩展前一个想法,你可以使用来自查询的元数据提取进行 RAG(检索增强提取),如 这篇博文 中所述。
  • 将文档和提取的数据存储在知识图中,并执行图 RAG(Neo4j-Haystack 集成)。
< > 更新 在 GitHub 上