使用嵌入量化和基于LLM的流水线完成分类：计算语言学案例研究

社区文章发布于2024年7月22日

研究出版物数量的不断增长，需要有效的学术知识结构化方法。这项任务通常涉及开发一个有监督的底层分类方案，并将出版物分配到最相关的类别。在本文中，我们使用嵌入量化和大型语言模型 (LLM) 流水线实现了一个端到端自动化解决方案。我们的案例研究从计算语言学 (cs.CL) 领域中截至2024年7月之前发布的 25,000 篇 arXiv 出版物数据集开始，我们将其组织在一个新的分类方案下。

方法

我们的方法围绕三个关键任务展开：(i) 对 arXiv 数据集进行无监督聚类，将其分为相关集合；(ii) 发现每个聚类中潜在的主题结构；(iii) 基于所述主题结构创建候选分类方案。

从根本上说，聚类任务需要识别未标注数据集中足够数量的相似示例。这对于嵌入来说是一项自然而然的任务，因为它们能捕获语料库中的语义关系，并可以作为输入特征提供给聚类算法，以建立示例之间的相似性链接。我们首先使用基于 BERT-ALiBi 注意力模型 Jina-Embeddings-v2 将数据集的（标题：摘要）对转换为嵌入表示。并使用 Sentence Transformers 和自定义实现进行标量量化。

对于聚类，我们在降维空间中运行 HDBSCAN，并使用 eom 和 leaf 聚类方法比较结果。此外，我们还研究了应用 (u)int8 嵌入量化而不是 float32 表示是否会影响此过程。

为了揭示 arXiv 出版物每个簇中的潜在主题，我们将 LangChain 和 Pydantic 与 Mistral-7B-Instruct-v0.3（以及作为比较的 GPT-4o）结合成一个 LLM 流水线。然后将输出整合到经过改进的提示模板中，该模板指导 Claude Sonnet 3.5 生成层次分类体系。

结果显示了 35 个新兴研究主题，每个主题至少包含 100 篇出版物。这些主题被组织在计算语言学 (cs.CL) 领域的 7 个父类和 20 个子类中。这种方法可作为在高级别 arXiv 类别中自动生成分层候选方案和高效完成分类体系的基线，从而应对学术文献量不断增长带来的挑战。

1. 嵌入转换

嵌入是现实世界对象（如文本、图像和音频）的数值表示，它们封装了所代表数据的语义信息。AI 模型使用它们来理解复杂知识领域中的下游应用，例如聚类、信息检索和语义理解任务等。

支持大序列

我们将使用 Jina-Embeddings-v2 [1] 将 arXiv 出版物的（标题：摘要）对映射到 768 维空间。Jina-Embeddings-v2 是一种开源文本嵌入模型，可容纳多达 8192 个 token。这为标题、摘要和其他可能相关的文档部分提供了足够大的序列长度。为了克服其他模型中常见的 512 token 限制，Jina-Embeddings-v2 将双向 ALiBi [2] 整合到 BERT 框架中。ALiBi（Attention with Linear Biases）通过将位置信息直接编码在自注意力层中，而不是引入位置嵌入，从而实现了输入长度外推（即序列超过 2048 个 token）。实际上，它通过与距离成比例的惩罚来偏置查询-键注意力分数，从而有利于相近 token 之间更强的相互注意力。

使用 Sentence Transformers 进行编码

使用 Jina-Embeddings-v2 模型的第一步是通过 Sentence Transformers 加载它，这是一个可从 Hugging Face Hub 获取的访问最先进模型的框架

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

我们现在使用 batch_size = 64 对数据集的（标题：摘要）对进行编码。这允许在 GPU 等硬件加速器上进行并行计算（尽管代价是需要更多内存）

from datasets import load_dataset
ds = load_dataset("dcarpintero/arxiv.cs.CL.25k", split="train")

corpus = [title + ':' + abstract for title, abstract in zip(ds['title'], ds['abstract'])]
f32_embeddings = model.encode(corpus,
                              batch_size=64,
                              show_progress_bar=True)

计算语义相似度

语料库之间的语义相似度现在可以简单地计算为嵌入的内积。在下面的热图中，每个条目 [x, y] 根据示例“标题”句子 [x] 和 [y] 的嵌入乘积进行着色。

2. 嵌入量化以节省内存

扩展嵌入可能具有挑战性。目前，最先进的模型将每个嵌入表示为 float32，这需要 4 字节的内存。鉴于 Jina-Embeddings-v2 将文本映射到 768 维空间，我们数据集的内存需求约为 73 MB，不包括索引和与出版物记录相关的其他元数据。

25,000 embeddings * 768 dimensions/embedding * 4 bytes/dimension = 76,800,000 bytes
76,800,000 bytes / (1024^2) ≈ 73.24 MB

然而，处理更大的数据集可能会显著增加内存需求和相关成本

嵌入维度	嵌入模型	2.5M ArXiv 摘要	60.9M 维基百科页面	100M 嵌入
384	all-MiniLM-L12-v2	3.57 GB	85.26 GB	142.88 GB
768	all-mpnet-base-v2	7.15 GB	170.52 GB	285.76 GB
768	jina-embeddings-v2	7.15 GB	170.52 GB	285.76 GB
1536	openai-text-embedding-3-small	14.31 GB	341.04 GB	571.53 GB
3072	openai-text-embedding-3-large	28.61 GB	682.08 GB	1.143 TB

一种用于实现内存节省的技术是“量化”。这种方法的直觉是，我们可以通过将其范围 [f_max, f_min] 映射到较小的定点数范围 [q_max, q_min]，并在线性地分配这些范围之间的所有值来离散化浮点值。在实践中，这通常会将 32 位浮点数的精度降低到更低的位宽，例如 8 位（标量量化）或 1 位值（二进制量化）。

通过绘制 Jina 生成的 嵌入的频率分布，我们观察到值确实集中在相对较窄的范围 [-2.0, +2.0] 内。这意味着我们可以有效地将 float32 值映射到 256 个 (u)int8 桶，而不会造成显著的信息丢失。

import matplotlib.pyplot as plt

plt.hist(f32_embeddings.flatten(), bins=250, edgecolor='C0')
plt.xlabel('float-32 jina-embeddings-v2')
plt.title('distribution')
plt.show()

我们可以计算分布的精确 [min, max] 值

>>> np.min(f32_embeddings), np.max(f32_embeddings)
(-2.0162134, 2.074683)

实施标量量化的第一步是定义一个校准嵌入集。一个典型的起点是 10k 嵌入的子集，在我们的例子中，这将覆盖近 99.98% 的原始 float32 嵌入值。使用校准的目的是为了获得沿每个维度的代表性 f_min 和 f_max 值，以减少计算开销和可能在大数据集中出现的异常值引起的问题。

def calibration_accuracy(embeddings: np.ndarray, k: int = 10000) -> float:
  calibration_embeddings = embeddings[:k]
  f_min = np.min(calibration_embeddings, axis=0)
  f_max = np.max(calibration_embeddings, axis=0)

  # Calculate percentage in range for each dimension
  size = embeddings.shape[0]
  avg = []
  for i in range(embeddings.shape[1]):
      in_range = np.sum((embeddings[:, i] >= f_min[i]) & (embeddings[:, i] <= f_max[i]))
      dim_percentage = (in_range / size) * 100
      avg.append(dim_percentage)

  return np.mean(avg)

acc = calibration_accuracy(f32_embeddings, k=10000)
print(f"Average percentage of embeddings within [f_min, f_max] calibration: {acc:.5f}%")
>>> Average percentage of embeddings within [f_min, f_max] calibration: 99.98636%

标量量化的第二步和第三步——计算尺度和零点，以及编码——可以使用 Sentence Transformers 轻松应用，从而比原始 float32 表示节省 4 倍的内存。此外，由于矩阵乘法可以用整数运算更快地执行，我们还将受益于更快的算术运算。

from sentence_transformers.quantization import quantize_embeddings

# quantization is applied in a post-processing step
int8_embeddings = quantize_embeddings(
    np.array(f32_embeddings),
    precision="int8",
    calibration_embeddings=np.array(f32_embeddings[:10000]),
)

f32_embeddings.dtype, f32_embeddings.shape, f32_embeddings.nbytes
>>> (dtype('float32'), (25107, 768), 77128704) # 73.5 MB

int8_embeddings.dtype, int8_embeddings.shape, int8_embeddings.nbytes
>>> (dtype('int8'), (25107, 768), 19282176)    # 18.3 MB

# calculate compression
(f32_embeddings.nbytes - int8_embeddings.nbytes) / f32_embeddings.nbytes * 100
>>> 75.0

为了完整起见，我们实现了一个标量量化方法来演示这三个步骤。

def scalar_quantize_embeddings(embeddings: np.ndarray,
                               calibration_embeddings: np.ndarray) -> np.ndarray:

    # Step 1: Calculate [f_min, f_max] per dimension from the calibration set 
    f_min = np.min(calibration_embeddings, axis=0)
    f_max = np.max(calibration_embeddings, axis=0)

    # Step 2: Map [f_min, f_max] to [q_min, q_max] => (scaling factors, zero point)
    q_min = 0
    q_max = 255
    scales = (f_max - f_min) / (q_max - q_min)
    zero_point = 0 # uint8 quantization maps inherently min_values to zero

    # Step 3: encode (scale, round)
    quantized_embeddings = ((embeddings - f_min) / scales).astype(np.uint8)

    return quantized_embeddings

calibration_embeddings = f32_embeddings[:10000]
beta_uint8_embeddings = scalar_quantize_embeddings(f32_embeddings, calibration_embeddings)

beta_uint8_embeddings[5000][64:128].reshape(8, 8)

array([[187, 111,  96, 128, 116, 129, 130, 122],
       [132, 153,  72, 136,  94, 120, 112,  93],
       [143, 121, 137, 143, 195, 159,  90,  93],
       [178, 189, 143,  99,  99, 151,  93, 102],
       [179, 104, 146, 150, 176,  94, 148, 118],
       [161, 138,  90, 122,  93, 146, 140, 129],
       [121, 115, 153, 118, 107,  45,  70, 171],
       [207,  53,  67, 115, 223, 105, 124, 158]], dtype=uint8)

我们将继续使用已通过 Sentence Transformers 量化的嵌入版本（我们的自定义实现也包含在结果分析中）。

# `f32_embeddings` => if you prefer to not use quantization
# `beta_uint8_embeddings` => to check our custom implemention
embeddings = int8_embeddings

3. 投影嵌入以降低维度

在本节中，我们对（标题：摘要）嵌入对从其原始高维空间（768）到较低维度执行两阶段投影，即

5 个维度用于减少聚类过程中的计算复杂度，以及
2 个维度用于在 (x, y) 坐标中实现可视化表示。

对于这两种投影，我们都采用了 UMAP [3]，这是一种流行的降维技术，以其在保留局部和全局数据结构方面的有效性而闻名。实际上，这使其成为处理高维嵌入复杂数据集的首选。

import umap

embedding_5d = umap.UMAP(n_neighbors=100, # consider 100 nearest neighbors for each point
                         n_components=5,  # reduce embedding space from 768 to 5 dimensions
                         min_dist=0.1,    # maintain local and global balance
                         metric='cosine').fit_transform(embeddings)

embedding_2d = umap.UMAP(n_neighbors=100,
                         n_components=2,
                         min_dist=0.1,
                         metric='cosine').fit_transform(embeddings)

请注意，当我们在下一步应用 HDBSCAN 聚类时，找到的聚类将受到 UMAP 如何保留局部结构的影响。较小的 n_neighbors 值意味着 UMAP 将更侧重于局部结构，而较大的值则可以捕获更多全局表示，这可能有助于理解数据的整体模式。

4. 语义聚类

现在可以将降维后的（标题：摘要）嵌入用作聚类算法的输入特征，从而能够基于嵌入距离识别相关类别。

我们选择了 HDBSCAN（基于层次密度估计的空间聚类应用程序的噪声）[4]，这是一种先进的聚类算法，通过适应不同密度簇来扩展 DBSCAN。与需要预先指定簇数量的 K-Means 不同，HDBSCAN 只有一个重要的超参数 n，它确定了簇中包含的最小示例数量。

HDBSCAN 的工作原理是：首先根据数据点的密度转换数据空间，使密度更高的区域（数据点密集聚集的区域）更具聚类吸引力。然后，该算法根据超参数 n 确定的最小聚类大小构建聚类层次结构。这使得它能够区分噪声（稀疏区域）和密集区域（潜在聚类）。最后，HDBSCAN 压缩此层次结构以导出最持久的聚类，识别不同密度和形状的聚类。作为一种基于密度的方法，它还可以检测异常值。

import hdbscan

hdbs = hdbscan.HDBSCAN(min_cluster_size=100,            # conservative clusters' size
                       metric='euclidean',              # points distance metric
                       cluster_selection_method='leaf') # favour fine grained clustering
clusters = hdbs.fit_predict(embedding_5d)               # apply HDBSCAN on reduced UMAP

cluster_selection_method 决定了 HDBSCAN 如何从树形层次结构中选择扁平簇。在我们的案例中，结合嵌入量化使用 eom（Excess of Mass）簇选择方法倾向于创建一些更大、不那么具体的簇。这些簇需要进一步的“重新聚类过程”才能提取有意义的潜在主题。相反，通过切换到 leaf 选择方法，我们引导算法从簇层次结构中选择叶节点，与 Excess of Mass 方法相比，这产生了更细粒度的聚类。

使用 *int8-embedding-quantization* 的 HDBSCAN *eom* 和 *leaf* 聚类方法比较

5. 使用 LLM 流水线发现潜在主题

在完成聚类步骤后，我们现在将展示如何通过结合 Mistral-7B-Instruct [5] 等 LLM 与 Pydantic 和 LangChain 来推断每个聚类的潜在主题，从而创建一个以可组合结构化格式生成输出的 LLM 流水线。

5.1 Pydantic 模型

Pydantic 模型是派生自 pydantic.BaseModel 的类，通过类型注释属性定义字段。它们类似于 Python 数据类。但是，它们的设计在细微但重要的方面有所不同，优化了验证、序列化和 JSON 模式生成等各种操作。我们的 Topic 类定义了一个名为 label 的字段。这将以结构化格式而不是自由格式文本块生成 LLM 输出，从而便于更轻松的处理和分析。

from pydantic import BaseModel, Field

class Topic(BaseModel):
    """
    Pydantic Model to generate an structured Topic Model
    """
    label: str = Field(..., description="Identified topic")

5.2 Langchain 提示模板

LangChain 提示模板是预定义的配方，用于将用户输入和参数转换为语言模型的指令。我们在此定义我们预期任务的提示。

from langchain_core.prompts import PromptTemplate

topic_prompt = """
  You are a helpful research assistant. Your task is to analyze a set of research paper
  titles related to Natural Language Processing, and determine the overarching topic. 
            
  INSTRUCTIONS:

  1. Based on the titles provided, identify the most relevant topic:
    - Ensure the topic is concise and clear.
            
  2. Format Respose:
    - Ensure the title response is in JSON as in the 'OUTPUT OUTPUT' section below.
    - No follow up questions are needed.

  OUTPUT FORMAT:

  {{"label": "Topic Name"}}

  TITLES:
  {titles}
  """

5.3 使用 LangChain 表达式语言进行推理链

现在，让我们使用 LangChain 表达式语言 (LCEL) 构建一个主题建模流水线，将我们的提示模板渲染为 LLM 输入，并将推理输出解析为 JSON。

from langchain.chains import LLMChain
from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.output_parsers import PydanticOutputParser

from typing import List

def TopicModeling(titles: List[str]) -> str:
    """
    Infer the common topic of the given titles w/ LangChain, Pydantic, OpenAI
    """
    repo_id = "mistralai/Mistral-7B-Instruct-v0.3"
    llm = HuggingFaceEndpoint(
        repo_id=repo_id,
        temperature=0.2,
        huggingfacehub_api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"]
    )
    prompt = PromptTemplate.from_template(topic_prompt)
    parser = PydanticOutputParser(pydantic_object=Topic)

    topic_chain = prompt | llm | parser
    return topic_chain.invoke({"titles": titles})

为了让模型推断每个集群的主题，我们将在 LLM 输入中包含每个集群的 25 篇论文标题的子集。

topics = []
for i, cluster in df.groupby('cluster'):
    titles = cluster['title'].sample(25).tolist()
    topic = TopicModeling(titles)
    topics.append(topic.label)

让我们将每篇 arXiv 出版物分配到其相应的集群。

n_clusters = len(df['cluster'].unique())

topic_map = dict(zip(range(n_clusters), topics))
df['topic'] = df['cluster'].map(topic_map)

6. 生成分类体系

为了创建一个层次分类体系，我们精心设计了一个提示，以指导 Claude Sonnet 3.5 将每个簇中识别出的研究主题组织成一个层次方案。

from langchain_core.prompts import PromptTemplate

taxonomy_prompt = """
    Create a comprehensive and well-structured taxonomy
    for the ArXiv cs.CL (Computational Linguistics) category.
    This taxonomy should organize subtopics in a logical manner.

    INSTRUCTIONS:

    1. Review and Refine Subtopics:
      - Examine the provided list of subtopics in computational linguistics.
      - Ensure each subtopic is clearly defined and distinct from others.

    2. Create Definitions:
      - For each subtopic, provide a concise definition (1-2 sentences).

    3. Develop a Hierarchical Structure:
      - Group related subtopics into broader categories.
      - Create a multi-level hierarchy, with top-level categories and nested subcategories.
      - Ensure that the structure is logical and intuitive for researchers in the field.

    4. Validate and Refine:
      - Review the entire taxonomy for consistency, completeness, and clarity.

    OUTPUT FORMAT:

    - Present the final taxonomy in a clear, hierarchical format, with:

      . Main categories
        .. Subcategories
          ... Individual topics with their definitions

    SUBTOPICS:
    {taxonomy_subtopics}
    """

7. 结果

7.1 聚类分析

让我们创建一个交互式散点图

chart = alt.Chart(df).mark_circle(size=5).encode(
    x='x',
    y='y',
    color='topic:N',
    tooltip=['title', 'topic']
).interactive().properties(
    title='Clustering and Topic Modeling | 25k arXiv cs.CL publications)',
    width=600,
    height=400,
)
chart.display()

并比较使用 float32 嵌入表示和 int8 Sentence Transformers 量化的聚类结果

使用 *float32* 和 *quantized-int8* 嵌入（sentence-transformers-quantization）的 HDBSCAN 叶聚类

我们现在使用自定义量化实现执行相同的比较

使用 *float32* 和 *quantized-uint8* 嵌入（自量化实现）的 HDBSCAN 叶聚类

使用 float32 和 (u)int8 量化嵌入的聚类结果显示出相似的良好定义的聚类总体布局，这表明 (i) HDBSCAN 聚类算法在这两种情况下均有效，并且 (ii) 量化后（使用 Sentence Transformers 和我们的自定义实现）数据中的核心关系得以保持。

值得注意的是，可以观察到，在两种情况下，使用嵌入量化都导致了稍微更细粒度的聚类（35 个聚类，而之前是 31 个），这似乎在语义上是一致的。我们对此差异的初步假设是，标量量化可能会“矛盾地”引导 HDBSCAN 聚类算法分离之前分组在一起的点。

这可能是由于 (i) 噪声（量化会在数据中产生小的“噪声”变化，这可能会产生某种“正则化”效应，并导致更敏感的聚类决策），或者 (ii) 数值精度和距离计算的变化（这可能会放大 float32 表示中不太明显的点之间的某些差异）。需要进一步调查才能完全理解量化对聚类的影响。

7.2 分类方案

整个层次方案可在 cs.CL.taxonomy 查看。这种方法可作为在高级别 arXiv 类别中自动识别候选分类方案的基线。

. Foundations of Language Models
  .. Model Architectures and Mechanisms 
    ... Transformer Models and Attention Mechanisms
    ... Large Language Models (LLMs)
  .. Model Optimization and Efficiency
    ... Compression and Quantization
    ... Parameter-Efficient Fine-Tuning
    ... Knowledge Distillation
  .. Learning Paradigms
    ... In-Context Learning
    ... Instruction Tuning

. AI Ethics, Safety, and Societal Impact
  .. Ethical Considerations
    ... Bias and Fairness in Models
    ... Alignment and Preference Optimization
  .. Safety and Security
    ... Hallucination in LLMs
    ... Adversarial Attacks and Robustness
    ... Detection of AI-Generated Text
  .. Social Impact
    ... Hate Speech and Offensive Language Detection
    ... Fake News Detection

[...]

引用

@article{carpintero2024
  author = { Diego Carpintero},
  title = {Taxonomy Completion with Embedding Quantization and an LLM-Pipeline: A Case Study in Computational Linguistics},
  journal = {Hugging Face Blog},
  year = {2024},
  note = {https://huggingface.co/blog/dcarpintero/taxonomy-completion},
}

参考文献

[1] Günther, 等人。2024 年。Jina Embeddings 2：用于长文档的 8192-Token 通用文本嵌入。arXiv:2310.19923。
[2] Press, 等人。2021 年。短训练，长测试：带有线性偏差的注意力实现输入长度外推。arXiv:2108.12409。
[3] McInnes, 等人。2018 年。Umap：用于降维的均匀流形逼近和投影。arXiv:1802.03426。
[4] Campello，等人。2013。基于层次密度估计的密度聚类。知识发现与数据挖掘进展。第 7819 卷。柏林，海德堡：施普林格柏林海德堡。第 160-172 页。doi:10.1007/978-3-642-37456-2_14。
[5] Jiang, 等人。2023。Mistral 7B。arXiv:2310.06825。
[6] Shakir, 等人。2024 年。二进制和标量嵌入量化显著提高检索速度并降低成本。hf:shakir-embedding-quantization
[7] Liu, Yue, 等人。2024 年。代理设计模式目录：基于基础模型的代理的架构模式集合。arXiv:2405.10467。

资源

GitHub 仓库

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论