使用 Binoculars 检测 LLM 生成的文本

社区文章 发布于 2024 年 2 月 17 日

用 Binoculars 识别 LLM:机器生成文本的零样本检测

Credit to authors Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein.

检测 AI 生成文本的能力是一个重要问题,不仅因为学术诚信问题,还因为存在虚假信息、安全和版权方面的担忧。一种名为 Binoculars 的新方法在检测机器生成文本方面达到了 90% 以上的准确率,而误报率仅为 0.01%。在本 Jupyter Notebook 中,我将对该论文的关键部分进行注释,解释这种新方法背后的机制,并逐步实现它。原始论文的代码可在此处获取,本 Jupyter Notebook 可在此处获取。

论文作者还在 Hugging Face Spaces 上创建了一个空间来试用该方法。

The new Binoculars paper shows significant improvment from previous SoTA models.

目录

LLM 检测

LLM 检测的动机是为了减少危害,追踪文本来源,阻止垃圾邮件,并识别由 LLM 产生的假新闻。抢先检测方法试图为生成的文本添加“水印”,但这需要完全控制生成模型,这似乎已经不可能了。因此,最近的工作集中在事后检测方法上,这种方法无需文本作者的配合即可使用。论文作者指出,事后检测器主要分为两类:第一类是微调预训练语言模型以执行二元分类。有许多额外的技术可以使这种方法更有效,但所有实现都需要在目标模型生成的文本上进行训练,这既计算成本高昂,又受限于新开源模型的数量。

第二类使用机器生成文本的统计特征,旨在实现零样本学习。这将允许检测各种模型,几乎不需要训练数据。这些方法使用困惑度、困惑度曲率、对数秩、内在维度和 N-gram 分析等度量。Binoculars 论文提出将重点放在低误报率 (FPR) 和对域外样本的高性能上,而不是为了 LLM 检测这种高风险应用而关注分类器 AUC。

第一部分:理解 Binoculars

困惑度

Diagram of LLM inference

LLM 使用分词器 TT 将字符串 ss 解析为词元,得到一个词元列表 x\vec{x}

M(T(s))=M(x)=YYij=P(vjx0:i1)for alljV \mathcal{M}(T(s))=\mathcal{M}(\vec{x})=Y \\ Y_{ij} = P(v_j | x_{0:i-1})\,\text{for all}\,j\in V

然后,预测文本是否由机器生成的一种自然方法是测量语言模型 M\mathcal{M} 在给定其前面所有词元的情况下,生成 ss 中每个词元的可能性。困惑度是一种衡量此项的常见基准,我们将对数困惑度(logPPL\log \text{PPL})定义为 ss 中所有词元的平均对数似然。

logPPLM(s)=1Li=1Llog(Yixi),wherex=T(s),Y=M(x)and L= number of tokens in s \log \text{PPL}_{\mathcal{M}}(s)=-\frac{1}{L} \sum^L_{i=1}\log(Y_{ix_i}), \\ \text{where}\,\vec{x}=T(s), Y=\mathcal{M}(\vec{x})\,\text{and $L=$ number of tokens in $s$}

困惑度是一种合理的第一种方法,原因有以下几点:

  • 人类通常生成比 LLM 具有更高困惑度的文本(“更令人惊讶”)。
  • LLM 训练的损失函数就是对数困惑度,因为模型实际上被要求在仅给定其前缀的情况下重现其测试数据。

然而,困惑度本身不足以作为检测 LLM 的方法。有许多提示可能会产生高困惑度输出,例如那些引入高度专业化的领域知识、复杂词汇、新想法,或创建高度依赖上下文的输出的提示。将困惑度阈值设置得足够高以捕获这些 LLM 输出,将不可避免地提高人类生成但更可预测且领域特异性较低的输出的误报率。

正如原始论文中的一个例子所示,像“1、2、3”这样的提示,其补全如“4、5、6”具有最低的困惑度。然而,关于水豚也是天体物理学家的提示会产生令人惊讶的输出,这具有很高的困惑度。在有上下文的情况下,输出的困惑度会低得多,但在实践中,LLM 检测必须依赖于无法访问语言模型的上下文或提示。

动机

Binoculars 使用上述问题所驱动的机制来估计文本的上下文和提示所引起的“基线困惑度”。通过将实际困惑度与预期困惑度进行比较,我们可以更好地判断所讨论的文本是否由 LLM 生成。这是因为我们可以预期,在相同的提示和上下文下,人类编写的文本的困惑度将比机器生成的文本的困惑度更高

为了测量基线困惑度,作者引入了交叉困惑度,它是两个模型在相同字符串 ss 上的交叉熵测量值。

logX-PPLM1,M2(s)=1Li=1LM1(s)ilog(M2(s)i) \log\text{X-PPL}_{\mathcal{M}_1,\mathcal{M}_2}(s)=-\frac{1}{L}\sum^L_{i=1}\mathcal{M}_1(s)_i\cdot\log(\mathcal{M}_2(s)_i)

注意:该度量依赖于模型具有相同的分词器 TT

这个分数本质上衡量了 M1\mathcal{M}_1M2\mathcal{M}_2 的词元预测的“意外程度”,从而提供机器生成词元可能具有多少困惑度的一些概念。结合困惑度,我们得到了建议的 Binoculars 分数。

BM1M2(s)=logPPLM1(s)logX-PPLM1,M2(s) B_{\mathcal{M}_1\mathcal{M}_2}(s)=\frac{\log\text{PPL}_{\mathcal{M}_1}(s)}{\log\text{X-PPL}_{\mathcal{M}_1,\mathcal{M}_2}(s)}

通过这种机制,M1\mathcal{M}_1 充当观察者模型,而 M2\mathcal{M}_2 充当执行者模型。该分数是字符串的生产者令人惊讶的程度与执行者模型对字符串的预测令人惊讶的程度的比较。

第二部分:实现

准备工作

%pip install sentencepiece transformers torch numpy gradio gradio_client
from typing import Union
import numpy as np
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
# Change according to hardware
DEVICE_1 = "cuda:0"
DEVICE_2 = "cpu"

观察者模型和执行者模型

我们首先通过选择两个语言模型来开始实现:观察者模型(M1\mathcal{M}_1)和执行者模型(M2\mathcal{M}_2)。本实现将沿用 Binoculars 论文中性能最佳的模型组合:Falcon-7B-instruct 作为观察者,Falcon-7B 作为执行者。我们首先验证两个分词器是否相同。

torch.set_grad_enabled(False)

observer_name = "tiiuae/falcon-7b-instruct"
performer_name = "tiiuae/falcon-7b"

identical_tokens = (AutoTokenizer.from_pretrained(observer_name).vocab ==
                    AutoTokenizer.from_pretrained(performer_name).vocab)

identical_tokens
True
observer_model = AutoModelForCausalLM.from_pretrained(observer_name,
                                                                   device_map={"": DEVICE_1},
                                                                   trust_remote_code=True,
                                                                   torch_dtype=torch.bfloat16)

performer_model = AutoModelForCausalLM.from_pretrained(performer_name,
                                                                     device_map={"": DEVICE_2},
                                                                     trust_remote_code=True,
                                                                     torch_dtype=torch.bfloat16)

observer_model.eval()
performer_model.eval()

tokenizer = AutoTokenizer.from_pretrained(observer_name)

然后我们实例化两个模型的分词器,它将输入文本转换为多个词元。

def tokenize(text):
    return tokenizer(text, return_tensors="pt")

tokenize("Hello, my dog is cute")
{'input_ids': tensor([[9856,   23,  491, 3696,  304, 7209]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

困惑度和交叉困惑度

为了实现困惑度,我们需要获得观察者模型对序列中每个词元的对数概率。使用观察者模型,我们首先获取某些字符串的 logits。

@torch.inference_mode()
def get_logits(encodings):
    observer_logits = observer_model(**encodings.to(DEVICE_1)).logits
    performer_logits = performer_model(**encodings.to(DEVICE_2)).logits
    return observer_logits, performer_logits

encoding = tokenize('''Dr. Capy Cosmos, a capybara unlike any other, astounded the scientific community with his 
groundbreaking research in astrophysics. With his keen sense of observation and unparalleled ability to interpret 
cosmic data, he uncovered new insights into the mysteries of black holes and the origins of the universe. As he 
peered through telescopes with his large, round eyes, fellow researchers often remarked that it seemed as if the 
stars themselves whispered their secrets directly to him. Dr. Cosmos not only became a beacon of inspiration to 
aspiring scientists but also proved that intellect and innovation can be found in the most unexpected of creatures.'''[:100])


observer_logits, performer_logits = get_logits(encoding)
observer_logits, performer_logits
(tensor([[[-23.1250, -18.1250,  -9.6250,  ..., -10.6875, -12.1250,  -9.1875],
          [-13.7500, -19.6250, -14.1875,  ..., -15.0000, -16.6250, -10.9375],
          [-13.0625, -16.8750, -14.8750,  ..., -17.8750, -15.3750, -14.5000],
          ...,
          [-12.9375, -12.3125, -12.6875,  ..., -16.0000, -14.8750, -18.0000],
          [-16.7500, -15.5000, -16.2500,  ..., -16.7500, -18.3750, -15.8125],
          [-16.1250, -17.6250, -14.5000,  ..., -18.1250, -19.1250, -17.2500]]],
        device='cuda:0', dtype=torch.bfloat16),
 tensor([[[-20.3750, -21.0000, -14.5625,  ..., -13.2500, -15.8750,  -8.5625],
          [-13.0625, -18.7500, -15.1875,  ..., -15.3125, -17.1250, -11.5625],
          [-12.3750, -17.6250, -16.6250,  ..., -17.3750, -16.1250, -14.3125],
          ...,
          [-10.9375, -12.4375, -11.9375,  ..., -14.1875, -14.6875, -16.6250],
          [-14.6250, -13.8750, -16.7500,  ..., -17.6250, -19.1250, -15.6250],
          [-13.6250, -15.5000, -14.5625,  ..., -18.5000, -19.0000, -16.6250]]],
        dtype=torch.bfloat16))
encoding.input_ids.shape, observer_logits.shape
(torch.Size([1, 26]), torch.Size([1, 26, 65024]))

这些 logits 的形状是 B×S×VB \times S \times V,其中 BB 是批量大小,SS 是序列词元长度,VV 是词汇表大小。因为我们只处理一个字符串,所以批量大小为 1,并且前 100 个字符被分词为 26 个词元。我们关注序列中每个词元的 logits(提供下一个词元的预测),并且我们希望 SS 个大小为 VV(本例中约为 65k)的向量,其中包含词汇表中每个词元的对数概率。

S = observer_logits.shape[-2]
V = observer_logits.shape[-1]

observer_logits[..., :-1, :].contiguous().shape
torch.Size([1, 25, 65024])

然后,我们将其与序列中真实的下一个词元(来自字符串的编码)进行比较,忽略第一个词元,因为它没有被预测。

encoding.input_ids[..., 1:].shape
torch.Size([1, 25])

我们暂时忽略注意力掩码,但在我们使用更大的批次后,它将成为一个关键组件。为了计算困惑度,我们使用前面写过的公式。

logPPLM(s)=1Li=1Llog(Yixi),wherex=T(s),Y=M(x)and L= number of tokens in s \log \text{PPL}_{\mathcal{M}}(s)=-\frac{1}{L} \sum^L_{i=1}\log(Y_{ix_i}), \\ \text{where}\,\vec{x}=T(s), Y=\mathcal{M}(\vec{x})\,\text{and $L=$ number of tokens in $s$}

幸运的是,PyTorch 提供了为此目的的类:torch.nn.CrossEntropyLoss。该类可以与形状为 (S,V)(S, V) 的模型输出和形状为 (S)(S) 的目标输出一起调用。我们将 logits 张量的最后两个维度转置以匹配此形状。

loss = torch.nn.CrossEntropyLoss(reduction='none')

ppl = loss(observer_logits[..., :-1, :].contiguous().transpose(1, 2).to("cpu"), 
     encoding.input_ids[..., 1:].contiguous().to("cpu")).float()

ppl, ppl.sum(1)
(tensor([[ 0.7148,  9.8750,  6.0312,  9.8125,  1.8359,  2.9688,  1.9375,  6.9375,
           0.0430,  0.0270, 11.5000,  0.2676,  0.1396,  0.3066,  9.0625,  1.7656,
           1.1250,  2.4531,  0.7109,  1.0859,  1.4297,  4.7188,  9.6875,  5.5938,
           8.8750]]),
 tensor([98.9049]))

由此,我们衡量了每个词元在执行下一个词元预测时对观察者模型的“意外”程度,并将其相加。虽然有些词元的困惑度低至 0.02,但其他词元则高于 10。

我们现在通过运行相同的交叉熵函数来实现了交叉困惑度,不同之处在于目标张量不再是one-hot向量(用户提供的观察到的下一个词元),而是观察者的logits的softmax。

softmax = torch.nn.Softmax(dim=-1)

performer_probs = softmax(performer_logits).view(-1, V)

performer_probs, performer_probs.shape
(tensor([[5.8265e-12, 3.1122e-12, 1.9500e-09,  ..., 7.2177e-09, 5.2387e-10,
          7.8604e-07],
         [7.5903e-08, 2.5648e-10, 9.0804e-09,  ..., 7.9744e-09, 1.3024e-09,
          3.4086e-07],
         [6.6124e-08, 3.4743e-10, 9.4587e-10,  ..., 4.4565e-10, 1.5571e-09,
          9.5461e-09],
         ...,
         [1.3039e-07, 2.8987e-08, 4.7730e-08,  ..., 5.0350e-09, 3.0559e-09,
          4.4020e-10],
         [2.1071e-08, 4.4703e-08, 2.5175e-09,  ..., 1.0477e-09, 2.3465e-10,
          7.7416e-09],
         [2.5332e-07, 3.8883e-08, 9.9186e-08,  ..., 1.9354e-09, 1.1714e-09,
          1.2631e-08]], dtype=torch.bfloat16),
 torch.Size([26, 65024]))
observer_scores = observer_logits.view(-1, V).to("cpu")
observer_scores, observer_scores.shape
(tensor([[-23.1250, -18.1250,  -9.6250,  ..., -10.6875, -12.1250,  -9.1875],
         [-13.7500, -19.6250, -14.1875,  ..., -15.0000, -16.6250, -10.9375],
         [-13.0625, -16.8750, -14.8750,  ..., -17.8750, -15.3750, -14.5000],
         ...,
         [-12.9375, -12.3125, -12.6875,  ..., -16.0000, -14.8750, -18.0000],
         [-16.7500, -15.5000, -16.2500,  ..., -16.7500, -18.3750, -15.8125],
         [-16.1250, -17.6250, -14.5000,  ..., -18.1250, -19.1250, -17.2500]],
        dtype=torch.bfloat16),
 torch.Size([26, 65024]))

我们使用所有对数概率和观察者 softmax 分数,除了与最后一个词元之后的词元对应的最后一个值。

xppl = loss(observer_scores[:-1], performer_probs[:-1]).view(-1, S - 1)

xppl, xppl.sum(1)
(tensor([[3.1406, 7.3438, 5.1875, 7.5000, 3.9375, 5.0312, 6.2500, 7.1250, 0.2930,
          0.2598, 4.4688, 1.2188, 0.9102, 1.6719, 4.4688, 1.8281, 4.0312, 4.6875,
          1.0078, 2.7656, 1.9531, 6.0000, 3.8125, 9.8125, 0.9219]],
        dtype=torch.bfloat16),
 tensor([95.5000], dtype=torch.bfloat16))

Binoculars 分数

要从我们之前的工作中获取 Binoculars 分数,我们只需将观察者模型的困惑度分数除以观察者模型和执行者模型之间的交叉困惑度分数。

binocular_score = ppl.sum(1) / xppl.sum(1)

binocular_score
tensor([1.0357])

从这一点来看,唯一要做的就是根据望远镜分数确定一个阈值,超过该阈值即可将文本归类为机器生成。

让我们将这个过程打包成函数,以便快速评估任何给定文本。

# redefine to handle batch of strings
def tokenize(batch):
    encodings = tokenizer(batch, return_tensors="pt", 
    padding="longest" if len(batch) > 1 else False, truncation=True,
    max_length=512, return_token_type_ids=False).to(DEVICE_1)
    return encodings

# redefinition with cuda sync
@torch.inference_mode()
def get_logits(encodings):
    observer_logits = observer_model(**encodings.to(DEVICE_1)).logits
    performer_logits = performer_model(**encodings.to(DEVICE_2)).logits
    torch.cuda.synchronize()

    return observer_logits, performer_logits

loss_fn = torch.nn.CrossEntropyLoss(reduction='none')
softmax_fn = torch.nn.Softmax(dim=-1)

def perplexity(encoding, logits):
    shifted_logits = logits[..., :-1, :].contiguous()
    shifted_labels = encoding.input_ids[..., 1:].contiguous()
    shifted_attention_mask = encoding.attention_mask[..., 1:].contiguous()

    ppl = loss_fn(shifted_logits.transpose(1, 2).to("cpu"), shifted_labels) * shifted_attention_mask
    ppl = ppl.sum(1) / shifted_attention_mask.sum(1)
    
    return ppl.to("cpu").float().numpy()

def cross_perplexity(observer_logits, performer_logits, encoding):
    V = observer_logits.shape[-1]
    S = observer_logits.shape[-2]

    performer_probs = softmax_fn(performer_logits).view(-1, V).to("cpu")
    observer_scores = observer_logits.view(-1, V).to("cpu")
    
    xppl = loss_fn(observer_scores, performer_probs).view(-1, S)
    padding_mask = (encoding.input_ids != tokenizer.pad_token_id).type(torch.uint8)
    
    xppl = (xppl * padding_mask).sum(1) / padding_mask.sum(1)
    
    return xppl.to("cpu").float().numpy()

def binocular_score(text):
    batch = [text] if isinstance(text, str) else text
    encodings = tokenize(batch)
    observer_logits, performer_logits = get_logits(encodings)
    ppl = perplexity(encodings, observer_logits)
    xppl = cross_perplexity(observer_logits, performer_logits, encodings)

    return (ppl / xppl).tolist()

tokenizer.pad_token = tokenizer.eos_token
tests = ['''The motivation behind LLM Detection is harm reduction, to trace text origins, block spam, and identify fake news produced by LLMs. *Preemptive detection* methods attempt to "watermark" generated text, but requires full control of the generating models, which already seems to be impossible. Therefore, more recent works have been on *post-hoc detection* methods, which could be used without the cooperation of the text's author. The paper's authors suggest that there are two main groups for post-hoc detectors, the first being finetuning a pretrained language model to perform binary classification. There are many additional techniques that make this approach more effective, but all implementations will require training on text produced by the target model, which is both computationally expensive and limited by the number of new models that are being open-sourced.
The second group uses statistical signatures of machine-generated text, with the aim of zero-shot learning. This would allow for the detection of a wide range of models, with little to no training data. These methods use measures such as perplexity, perplexity curvature, log rank, intrinsic dimensionality, and n-gram analysis. The Binoculars paper proposes a focus on low false positive rate (FPR) and high performance on out-of-domain samples, rather than focusing on classifier AUCs for the high-stakes application of LLM detection.''',
'''Dr. Capy Cosmos, a capybara unlike any other, astounded the scientific community with his 
groundbreaking research in astrophysics. With his keen sense of observation and unparalleled ability to interpret 
cosmic data, he uncovered new insights into the mysteries of black holes and the origins of the universe. As he 
peered through telescopes with his large, round eyes, fellow researchers often remarked that it seemed as if the 
stars themselves whispered their secrets directly to him. Dr. Cosmos not only became a beacon of inspiration to 
aspiring scientists but also proved that intellect and innovation can be found in the most unexpected of creatures.''',
'''We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.'''
]
binocular_score(tests)
[0.9417475461959839, 0.7566137313842773, 0.4742990732192993]

上面的字符串是

  1. 人类生成(技术和领域特定)
  2. 机器生成(论文中的水豚示例)
  3. 记忆的,人类生成的(美国宪法)

论文作者为这种模型组合(0.85 到 0.9 之间)建议的阈值正确地分类了前两个示例。然而,值得注意的是,通常出现在训练数据中并被语言模型记忆的文本,例如美国宪法,对于大多数 LLM 来说都是非常可预测的,并且困惑度很低,导致 Binoculars 分数非常低,为 0.474。

第三部分:深入探究

让我们再次直观地看看水豚提示。我们将检查每个独立词元的困惑度和交叉困惑度,以探究此方法的机制。

import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 300

capybara = '''Dr. Capy Cosmos, a capybara unlike any other, astounded the scientific community with his 
groundbreaking research in astrophysics. With his keen sense of observation and unparalleled ability to interpret 
cosmic data, he uncovered new insights into the mysteries of black holes and the origins of the universe. As he 
peered through telescopes with his large, round eyes, fellow researchers often remarked that it seemed as if the 
stars themselves whispered their secrets directly to him. Dr. Cosmos not only became a beacon of inspiration to 
aspiring scientists but also proved that intellect and innovation can be found in the most unexpected of creatures.'''

encoding = tokenize([capybara])

observer_logits, performer_logits = get_logits(encoding)

S = observer_logits.shape[-2]
V = observer_logits.shape[-1]

(S, V)
(136, 65024)
shifted_logits = observer_logits[..., :-1, :].contiguous()
shifted_labels = encoding.input_ids[..., 1:].contiguous()

ppl = loss_fn(shifted_logits.transpose(1, 2).to("cpu"), shifted_labels).float()

ppl, ppl.sum(1)
(tensor([[7.2266e-01, 9.8750e+00, 6.0312e+00, 9.7500e+00, 1.8438e+00, 2.9688e+00,
          1.9297e+00, 6.9375e+00, 4.1504e-02, 2.5757e-02, 1.1438e+01, 2.7734e-01,
          1.4062e-01, 3.0469e-01, 9.0000e+00, 1.7812e+00, 1.1172e+00, 2.4531e+00,
          7.1875e-01, 1.0859e+00, 1.4297e+00, 4.6875e+00, 9.6875e+00, 5.6250e+00,
          5.2734e-01, 1.6641e+00, 2.2656e+00, 1.9297e+00, 5.2344e-01, 3.3789e-01,
          3.2500e+00, 1.4453e+00, 3.1562e+00, 2.2656e+00, 3.4424e-02, 3.1094e+00,
          1.1172e+00, 3.8906e+00, 3.5625e+00, 2.3730e-01, 5.5000e+00, 5.2188e+00,
          1.1250e+00, 3.1875e+00, 4.4922e-02, 2.4062e+00, 7.4219e-02, 1.6328e+00,
          3.7812e+00, 4.5938e+00, 1.0469e+00, 6.7578e-01, 1.8555e-01, 3.2812e+00,
          4.7852e-02, 3.4531e+00, 8.3008e-03, 5.8984e-01, 6.4453e-01, 1.9766e+00,
          8.9111e-03, 1.7773e-01, 2.9297e-02, 2.8320e-01, 3.5000e+00, 2.5312e+00,
          1.3281e+00, 9.9487e-03, 7.2188e+00, 2.2461e-01, 1.2734e+00, 4.4062e+00,
          2.1973e-01, 3.5000e+00, 9.1797e-01, 3.3281e+00, 4.6875e-01, 2.5625e+00,
          9.2285e-02, 9.5215e-02, 8.6875e+00, 2.5000e+00, 4.8750e+00, 2.4531e+00,
          1.3516e+00, 2.6094e+00, 1.9219e+00, 1.0625e+00, 2.6758e-01, 2.3594e+00,
          7.8906e-01, 2.9053e-02, 1.4688e+00, 5.6250e-01, 5.7500e+00, 2.4375e+00,
          2.5513e-02, 4.7500e+00, 1.2451e-01, 3.0078e-01, 4.0527e-02, 2.6406e+00,
          8.8501e-03, 2.7734e-01, 3.9978e-03, 6.0625e+00, 6.5918e-03, 2.9844e+00,
          8.2812e-01, 4.1250e+00, 1.6699e-01, 5.2812e+00, 1.7812e+00, 1.2734e+00,
          1.5747e-02, 7.1250e+00, 5.4932e-03, 7.3828e-01, 2.0469e+00, 3.5156e-01,
          5.7188e+00, 8.8281e-01, 9.1250e+00, 6.6406e-01, 6.3438e+00, 7.9688e-01,
          1.4453e+00, 9.6191e-02, 4.9609e-01, 6.9922e-01, 1.4746e-01, 8.5938e-01,
          1.5234e+00, 2.7656e+00, 5.0049e-02]]),
 tensor([302.5422]))

为了可视化简单的观察者困惑度,我们可以对张量 ppl 进行归一化,并生成带有基于困惑度的阴影的 HTML。

from IPython.display import HTML

normalized_ppl = ppl / torch.max(ppl)

def generate_html(tokens, scores):
    html = "<p>" + tokens[0]
    for token, score in zip(tokens[1:], scores.squeeze().tolist()):
        color_value = 255 * score 
        html += f"<span style='background-color: rgb(255, {255-color_value}, {255-color_value}); color: black;'>{token}</span>"
    html += "</p>"
    return html

tokens = [tokenizer.decode([tok], clean_up_tokenization_spaces=False) for tok in encoding.input_ids.squeeze().tolist()]
html_output = generate_html(tokens, normalized_ppl)

display(HTML(html_output))

Capy Cosmos博士一只与众不同的水豚以其在天体物理学领域突破性研究震惊科学凭借敏锐观察无与伦比解读宇宙数据能力揭示黑洞奥秘宇宙起源见解当他他那大而圆眼睛凝视望远镜其他研究人员经常仿佛星星本身窃窃私语它们秘密Cosmos博士不仅成为有抱负科学家灵感灯塔而且证明智力创新可以在意想不到生物发现

几点观察:文本的开头肯定让语言模型措手不及,直到它开始注意到更可预测的模式(例如,“任何其他”紧跟在“unlike”之后,“ybara”紧跟在“Dr. Capy...”之后)。动词的困惑度似乎特别高,因为许多可能的标记都是有意义的。下面我们看看交叉困惑度的相同可视化。

performer_probs = softmax_fn(performer_logits).view(-1, V).to("cpu")
observer_scores = observer_logits.view(-1, V).to("cpu")

xppl = loss_fn(observer_scores[:-1], performer_probs[:-1]).view(-1, S - 1).to("cpu").float()
    
xppl, xppl.sum(1)
(tensor([[3.1406, 7.3750, 5.1875, 7.5312, 3.9375, 5.0312, 6.2188, 7.1250, 0.2852,
          0.2480, 4.4688, 1.2188, 0.9258, 1.6797, 4.4688, 1.8672, 4.0312, 4.6562,
          1.0078, 2.7656, 1.9531, 6.0000, 3.8125, 9.8125, 0.9180, 4.0938, 2.8594,
          4.6875, 1.7031, 1.9375, 4.7500, 3.5312, 7.6250, 4.0000, 0.4648, 5.2500,
          1.2109, 6.4062, 5.1562, 0.8477, 6.4062, 4.0625, 1.4766, 4.5312, 0.7070,
          4.3438, 0.8750, 2.0625, 4.6875, 3.8906, 4.5000, 1.6641, 1.8594, 4.7188,
          0.2734, 1.7812, 0.2559, 1.4141, 4.4062, 5.4375, 0.0659, 1.2578, 0.8359,
          0.8047, 3.6719, 2.9375, 4.2500, 0.1196, 5.6562, 2.8750, 1.8828, 2.0000,
          0.2461, 2.7188, 3.8125, 7.5000, 2.5156, 5.9062, 2.1562, 0.9570, 2.8906,
          2.2500, 5.2500, 5.5625, 2.4219, 2.5469, 1.8281, 2.4062, 1.0938, 2.1719,
          1.6094, 0.1494, 5.5000, 2.5000, 1.7500, 3.6406, 1.3984, 1.6953, 1.1328,
          1.5000, 0.7109, 4.1875, 0.0378, 0.7070, 0.0197, 4.5625, 0.1177, 6.1250,
          2.4844, 4.8438, 0.8867, 3.5625, 1.7812, 1.8125, 0.0718, 4.0938, 0.0422,
          2.8438, 2.4219, 2.2500, 4.0938, 2.2656, 5.2500, 2.3125, 6.1562, 3.0781,
          3.7969, 3.7812, 2.2344, 3.4688, 3.1406, 1.7344, 1.5078, 1.3594, 0.5312]]),
 tensor([399.2843]))
normalized_xppl = xppl / torch.max(xppl)

display(HTML(html_output))

html_output = generate_html(tokens, normalized_xppl)
display(HTML(html_output))

binocular_score = normalized_ppl / normalized_xppl
normalized_binocular_score = binocular_score / torch.max(binocular_score)

html_output = generate_html(tokens, normalized_binocular_score)
display(HTML(html_output))

Capy Cosmos博士一只与众不同的水豚以其在天体物理学领域突破性研究震惊科学凭借敏锐观察无与伦比解读宇宙数据能力揭示黑洞奥秘宇宙起源见解当他他那大而圆眼睛凝视望远镜其他研究人员经常仿佛星星本身窃窃私语它们秘密Cosmos博士不仅成为有抱负科学家灵感灯塔而且证明智力创新可以在意想不到生物发现

Capy Cosmos博士一只与众不同水豚以其在天体物理学领域突破性研究震惊科学凭借敏锐观察无与伦比解读宇宙数据能力揭示黑洞奥秘宇宙起源见解当他他那大而圆眼睛凝视望远镜其他研究人员经常仿佛星星本身窃窃私语它们秘密Cosmos博士不仅成为有抱负科学家灵感灯塔而且证明智力创新可以在意想不到生物发现

Capy Cosmos博士一只与众不同水豚以其在天体物理学领域突破性研究震惊科学凭借敏锐观察无与伦比解读宇宙数据能力揭示黑洞奥秘宇宙起源见解当他他那大而圆眼睛凝视望远镜其他研究人员经常仿佛星星本身窃窃私语它们秘密Cosmos博士不仅成为有抱负科学家灵感灯塔而且证明智力创新可以在意想不到生物发现

这些依次是逐令牌困惑度、交叉困惑度和 Binoculars 分数。因此,中间的输出显示了观察者模型对执行者模型预测的惊讶程度。回顾 Binoculars 分数的定义,最终的输出显示了给定由提示引起的基线困惑度后的困惑度,这极大地改变了每个令牌的评分。Binocular 分数最高(红色)的词对“人类生成”标签的贡献最大,我们发现水豚相关的词并不接近得分最高的词,而“fellow”、“whispered”和“directly”对文本是人类书写的可能性贡献最大。

plt.scatter(xppl.float(), ppl.float())
plt.title("Cross-Perplexity vs Perplexity")
plt.xlabel("Cross-Perplexity")
plt.ylabel("Perplexity")
plt.xlim(0, 12)
plt.ylim(0, 12)
plt.show()

png

从散点图中,我们看到在低困惑度和低交叉困惑度端,令牌密度较高,并且随着困惑度增加,分布更加分散。

接下来,我们调查 Ghostbuster 数据集中人类生成的字符串的行为。

human = '''The healthcare industry typically draws sufficient attention to patients' education, especially when it comes to representatives of minority groups. That is why the article by McCurley et al. (2017) offers valuable information. The researchers demonstrate that Hispanic individuals deal with improved diabetes prevention when they participate in individual and group face-to-face sessions (McCurley et al., 2017). I believe that there is an apparent reason why such positive outcomes are achieved. It seems that face-to-face interventions are effective because patients have an opportunity to ask questions if they require explanations. Simultaneously, such educational sessions demonstrate that a patient is not unique with such a health issue. As a result, such interventions can improve people's morale, which, in turn, will lead to increased motivation to take preventive measures and protect health.'''

encoding = tokenize([human])

observer_logits, performer_logits = get_logits(encoding)

S = observer_logits.shape[-2]
V = observer_logits.shape[-1]
shifted_logits = observer_logits[..., :-1, :].contiguous()
shifted_labels = encoding.input_ids[..., 1:].contiguous()

ppl = loss_fn(shifted_logits.transpose(1, 2).to("cpu"), shifted_labels).float()

normalized_ppl = ppl / torch.max(ppl)

tokens = [tokenizer.decode([tok], clean_up_tokenization_spaces=False) for tok in encoding.input_ids.squeeze().tolist()]
html_output = generate_html(tokens, normalized_ppl)

display(HTML(html_output))

performer_probs = softmax_fn(performer_logits).view(-1, V).to("cpu")
observer_scores = observer_logits.view(-1, V).to("cpu")

xppl = loss_fn(observer_scores[:-1], performer_probs[:-1]).view(-1, S - 1).to("cpu").float()
normalized_xppl = xppl / torch.max(xppl)

html_output = generate_html(tokens, normalized_xppl)
display(HTML(html_output))

binocular_score = normalized_ppl / normalized_xppl
normalized_binocular_score = binocular_score / torch.max(binocular_score)

html_output = generate_html(tokens, normalized_binocular_score)
display(HTML(html_output))

医疗行业通常充分关注患者教育尤其当涉及少数群体代表这就是为什么McCurley2017)的文章提供宝贵信息研究人员证明西班牙个人参与个人团体面对面会议他们在糖尿病预防方面取得了改善McCurley2017我相信实现如此积极结果有着明显原因似乎面对面干预之所以有效因为患者机会需要解释提问同时这种教育课程表明患者并非唯一面临此类健康问题因此此类干预措施可以改善人们士气从而提高采取预防措施保护健康积极性

医疗行业通常充分关注患者教育尤其当涉及少数群体代表这就是为什么McCurley2017)的文章提供了宝贵信息研究人员证明西班牙个人参与个人团体面对面会议他们在糖尿病预防方面取得了改善McCurley2017我相信实现如此积极结果有着明显原因似乎面对面干预之所以有效因为患者机会需要解释提问同时这种教育课程表明患者并非唯一面临此类健康问题因此此类干预措施可以改善人们士气从而提高采取预防措施保护健康积极性

医疗保健行业通常患者教育给予足够关注尤其涉及到少数族裔因此McCurley等人2017文章提供宝贵信息研究人员表明西班牙参与个人团体面对面会议糖尿病预防效果得到改善McCurley等人2017认为取得如此积极成果原因明显面对面干预之所以有效似乎因为患者机会提问如果他们需要解释的话同时这些教育课程表明患者并非唯一面临这种健康问题因此这些干预措施可以改善人们情绪进而提高采取预防措施保护健康积极性结果这些干预可以改善人们士气反过来提高采取预防措施保护健康动力

plt.scatter(xppl.float(), ppl.float())
plt.title("Cross-Perplexity vs Perplexity")
plt.xlabel("Cross-Perplexity")
plt.ylabel("Perplexity")
plt.xlim(0, 12)
plt.ylim(0, 12)
plt.show()

ppl.sum(1) / xppl.sum(1)

png

tensor([0.9926])

与之前的困惑度散点图相比,此次散布范围小得多,更多值聚集在零困惑度和交叉困惑度附近。

结论

望远镜法作为一种检测机器生成文本的新方法,前景广阔,在学术诚信和内容审核方面具有潜在应用。望远镜法为更可靠、更公平的AI文本检测工具奠定了基础,与其他检测方法和服务(如GPTZero)相比,其误报率(FPR)要低得多。

  1. 本文的GitHub存储库在执行者/观察者模型选择方面似乎存在错误。执行者模型在代码中用于计算困惑度并观察观察者模型,而论文中指出这是由观察者完成的。

社区

注册登录 发表评论