开源 AI 食谱文档
使用 Haystack 和 NuExtract 进行信息提取
并获得增强的文档体验
开始使用
使用 Haystack 和 NuExtract 进行信息提取
在本 notebook 中,我们将了解如何使用语言模型自动从文本数据中提取信息。
🎯 目标:创建一个应用程序,根据用户定义的结构从给定文本或 URL 中提取特定信息。
🧰 技术栈
Haystack 🏗️:一个可定制的编排框架,用于构建 LLM 应用程序。我们将使用 Haystack 构建信息提取流水线。
NuExtract:一个小型的语言模型,专门为结构化数据提取进行了微调。
安装依赖项
! pip install haystack-ai trafilatura transformers pyvis
组件
Haystack 有两个主要概念:组件和流水线 (Components and Pipelines)。
🧩 组件 (Components) 是执行单个任务的构建块:文件转换、文本生成、嵌入创建等。
➿ 流水线 (Pipelines) 允许您通过将组件组合成有向 (循环) 图来定义 LLM 应用程序中的数据流。
我们现在将介绍我们的信息提取应用程序的各个组件。之后,我们会将它们集成到一个流水线中。
LinkContentFetcher 和 HTMLToDocument:从网页中提取文本
在我们的实验中,我们将从网上找到的初创公司融资公告中提取数据。
为了下载网页并提取文本,我们使用两个组件
LinkContentFetcher
:获取一些 URL 的内容,并返回一个内容流列表 (作为ByteStream
对象)。HTMLToDocument
:将 HTML 源转换为文本Documents
。
>>> from haystack.components.fetchers import LinkContentFetcher
>>> from haystack.components.converters import HTMLToDocument
>>> fetcher = LinkContentFetcher()
>>> streams = fetcher.run(urls=["https://example.com/"])["streams"]
>>> converter = HTMLToDocument()
>>> docs = converter.run(sources=streams)
>>> print(docs)
{'documents': [Document(id=65bb1ce4b6db2f154d3acfa145fa03363ef93f751fb8599dcec3aaf75aa325b9, content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...', meta: {'content_type': 'text/html', 'url': 'https://example.com/'})]}
HuggingFaceLocalGenerator:加载并试用模型
我们使用 HuggingFaceLocalGenerator
,这是一个文本生成组件,允许使用 Transformers 库加载托管在 Hugging Face 上的模型。
Haystack 支持许多其他的生成器,包括 HuggingFaceAPIGenerator
(与 Hugging Face API 和 TGI 兼容)。
我们加载 NuExtract,这是一个从 `microsoft/Phi-3-mini-4k-instruct` 微调而来的模型,用于从文本中执行结构化数据提取。模型大小为 3.8B 参数。其他变体也可用:`NuExtract-tiny` (0.5B) 和 `NuExtract-large` (7B)。
模型以 `bfloat16` 精度加载,以便在 Colab 中运行,与 FP32 相比性能损失可忽略不计,正如模型卡中所建议的那样。
关于 Flash Attention 的说明
在推理时,您可能会看到一个警告:“您没有运行 flash-attention 实现”。
Colab 或 Kaggle 等免费环境中的 GPU 不支持它,所以我们决定在本 notebook 中不使用它。
如果您的 GPU 架构支持它 (详细信息),您可以安装它并按如下方式获得加速
pip install flash-attn --no-build-isolation
然后在 `model_kwargs` 中添加 `"attn_implementation": "flash_attention_2"`。
from haystack.components.generators import HuggingFaceLocalGenerator
import torch
generator = HuggingFaceLocalGenerator(
model="numind/NuExtract", huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype": torch.bfloat16}}
)
# effectively load the model (warm_up is automatically invoked when the generator is part of a Pipeline)
generator.warm_up()
该模型支持特定的提示词结构,可以从模型卡中推断出来。
让我们手动创建一个提示词来试用模型。稍后,我们将了解如何根据不同输入动态创建提示词。
>>> prompt = """<|input|>\n### Template:
... {
... "Car": {
... "Name": "",
... "Manufacturer": "",
... "Designers": [],
... "Number of units produced": "",
... }
... }
... ### Text:
... The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).
... The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.
... In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
... <|output|>
... """
>>> result = generator.run(prompt=prompt)
>>> print(result)
{'replies': ['{\n "Car": {\n "Name": "Fiat Panda",\n "Manufacturer": "Fiat",\n "Designers": [\n "Giorgetto Giugiaro",\n "Aldo Mantovani",\n "Giuliano Biasio",\n "Roberto Giolito"\n ],\n "Number of units produced": "over 7.8 million"\n }\n}\n']}
不错 ✅
PromptBuilder:动态创建提示词
PromptBuilder
使用 Jinja2 提示词模板进行初始化,并通过填充通过关键字参数传递的参数来呈现它。
我们的提示词模板再现了模型卡中显示的结构。
在我们的实验中,我们发现缩进模式对于确保好的结果尤为重要。这可能源于模型的训练方式。
from haystack.components.builders import PromptBuilder
from haystack import Document
prompt_template = """<|input|>
### Template:
{{ schema | tojson(indent=4) }}
{% for example in examples %}
### Example:
{{ example | tojson(indent=4) }}\n
{% endfor %}
### Text
{{documents[0].content}}
<|output|>
"""
prompt_builder = PromptBuilder(template=prompt_template)
>>> example_document = Document(content="The Fiat Panda is a city car...")
>>> example_schema = {
... "Car": {
... "Name": "",
... "Manufacturer": "",
... "Designers": [],
... "Number of units produced": "",
... }
... }
>>> prompt = prompt_builder.run(documents=[example_document], schema=example_schema)["prompt"]
>>> print(prompt)
<|input|> ### Template: { "Car": { "Designers": [], "Manufacturer": "", "Name": "", "Number of units produced": "" } } ### Text The Fiat Panda is a city car... <|output|>
效果很好 ✅
OutputAdapter
您可能已经注意到,提取的结果是 `replies` 列表的第一个元素,并且是一个 JSON 字符串。
我们希望为每个源文档生成一个字典。要在流水线中执行此转换,我们可以使用 OutputAdapter
。
>>> import json
>>> from haystack.components.converters import OutputAdapter
>>> adapter = OutputAdapter(
... template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
... output_type=dict,
... custom_filters={"json_loads": json.loads},
... )
... print(adapter.run(**result))
{'output': {'Car': {'Name': 'Fiat Panda', 'Manufacturer': 'Fiat', 'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani', 'Giuliano Biasio', 'Roberto Giolito'], 'Number of units produced': 'over 7.8 million'}}}
信息提取流水线
构建流水线
我们现在可以创建我们的流水线,通过添加和连接各个组件。
from haystack import Pipeline
ie_pipe = Pipeline()
ie_pipe.add_component("fetcher", fetcher)
ie_pipe.add_component("converter", converter)
ie_pipe.add_component("prompt_builder", prompt_builder)
ie_pipe.add_component("generator", generator)
ie_pipe.add_component("adapter", adapter)
ie_pipe.connect("fetcher", "converter")
ie_pipe.connect("converter", "prompt_builder")
ie_pipe.connect("prompt_builder", "generator")
ie_pipe.connect("generator", "adapter")
# IN CASE YOU NEED TO RECREATE THE PIPELINE FROM SCRATCH, YOU CAN UNCOMMENT THIS CELL
# ie_pipe = Pipeline()
# ie_pipe.add_component("fetcher", LinkContentFetcher())
# ie_pipe.add_component("converter", HTMLToDocument())
# ie_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
# ie_pipe.add_component("generator", HuggingFaceLocalGenerator(model="numind/NuExtract",
# huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype":torch.bfloat16}})
# )
# ie_pipe.add_component("adapter", OutputAdapter(template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
# output_type=dict,
# custom_filters={"json_loads": json.loads}))
# ie_pipe.connect("fetcher", "converter")
# ie_pipe.connect("converter", "prompt_builder")
# ie_pipe.connect("prompt_builder", "generator")
# ie_pipe.connect("generator", "adapter")
让我们回顾一下我们的流水线设置
>>> ie_pipe.show()
定义来源和提取模式
我们选择了一系列与近期初创公司融资公告相关的 URL。
此外,我们为我们旨在提取的结构化信息定义了一个模式。
urls = [
"https://techcrunch.com/2023/04/27/pinecone-drops-100m-investment-on-750m-valuation-as-vector-database-demand-grows/",
"https://techcrunch.com/2023/04/27/replit-funding-100m-generative-ai/",
"https://www.cnbc.com/2024/06/12/mistral-ai-raises-645-million-at-a-6-billion-valuation.html",
"https://techcrunch.com/2024/01/23/qdrant-open-source-vector-database/",
"https://www.intelcapital.com/anyscale-secures-100m-series-c-at-1b-valuation-to-radically-simplify-scaling-and-productionizing-ai-applications/",
"https://techcrunch.com/2023/04/28/openai-funding-valuation-chatgpt/",
"https://techcrunch.com/2024/03/27/amazon-doubles-down-on-anthropic-completing-its-planned-4b-investment/",
"https://techcrunch.com/2024/01/22/voice-cloning-startup-elevenlabs-lands-80m-achieves-unicorn-status/",
"https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia",
"https://www.prnewswire.com/news-releases/ai21-completes-208-million-oversubscribed-series-c-round-301994393.html",
"https://techcrunch.com/2023/03/15/adept-a-startup-training-ai-to-use-existing-software-and-apis-raises-350m/",
"https://www.cnbc.com/2023/03/23/characterai-valued-at-1-billion-after-150-million-round-from-a16z.html",
]
schema = {
"Funding": {
"New funding": "",
"Investors": [],
},
"Company": {"Name": "", "Activity": "", "Country": "", "Total valuation": "", "Total funding": ""},
}
运行流水线!
我们将所需数据传递给每个组件。
请注意,它们中的大多数从先前执行的组件接收数据。
from tqdm import tqdm
extracted_data = []
for url in tqdm(urls):
result = ie_pipe.run({"fetcher": {"urls": [url]}, "prompt_builder": {"schema": schema}})
extracted_data.append(result["adapter"]["output"])
让我们检查一些提取的数据
extracted_data[:2]
数据探索与可视化
让我们探索提取的数据,以评估其正确性并获得见解。
数据帧
我们首先创建一个 Pandas 数据帧。为简单起见,我们将提取的数据扁平化。
def flatten_dict(d, parent_key=""):
items = []
for k, v in d.items():
new_key = f"{parent_key} - {k}" if parent_key else k
if isinstance(v, dict):
items.extend(flatten_dict(v, new_key).items())
elif isinstance(v, list):
items.append((new_key, ", ".join(v)))
else:
items.append((new_key, v))
return dict(items)
import pandas as pd
df = pd.DataFrame([flatten_dict(el) for el in extracted_data])
df = df.sort_values(by="Company - Name")
df
除了一些“公司 - 国家”中的错误外,提取的数据看起来不错。
构建一个简单图
为了理解公司和投资者之间的关系,我们构建一个图并将其可视化。
首先,我们使用 NetworkX 构建一个图。
NetworkX 是一个 Python 包,可以简单地创建和操作网络/图。
我们的简单图将以公司和投资者为节点。如果投资者和公司在同一文档中被提及,我们将它们连接起来。
import networkx as nx
# Create a new graph
G = nx.Graph()
# Add nodes and edges
for el in extracted_data:
company_name = el["Company"]["Name"]
G.add_node(company_name, label=company_name, title="Company")
investors = el["Funding"]["Investors"]
for investor in investors:
if not G.has_node(investor):
G.add_node(investor, label=investor, title="Investor", color="red")
G.add_edge(company_name, investor)
接下来,我们使用 Pyvis 来可视化图。
Pyvis 是一个用于网络/图交互式可视化的 Python 包。它与 NetworkX 很好地集成。
from pyvis.network import Network
from IPython.display import display, HTML
net = Network(notebook=True, cdn_resources="in_line")
net.from_nx(G)
net.show("simple_graph.html")
display(HTML("simple_graph.html"))
看起来 Andreessen Horowitz 在选定的融资公告中出现得很频繁 😊
结论与想法
在本 notebook 中,我们演示了如何使用一个小语言模型 (NuExtract) 和一个可定制的 LLM 应用编排框架 Haystack 来建立一个信息提取系统。
我们如何使用提取的数据?
一些想法
- 可以将提取的数据添加到存储在文档存储中的原始文档中。这允许使用元数据过滤实现高级搜索功能。
- 在前一个想法的基础上,您可以进行 RAG (检索增强提取),并从查询中提取元数据,如这篇博文中所解释的那样。
- 将文档和提取的数据存储在知识图谱中,并执行图 RAG (Neo4j-Haystack 集成)。