基于同态加密对加密数据进行情感分析

发布于 2022 年 11 月 17 日

在 GitHub 上更新

Jordan Frery

jfrery-zama

访客

众所周知，情感分析模型可以判断一段文本是积极、消极还是中性的。然而，这个过程通常需要访问未加密的文本，这可能会引发隐私问题。

同态加密是一种允许对加密数据进行计算而无需先解密的加密技术。这使得它非常适合于用户个人和潜在敏感数据面临风险的应用（例如，对私人消息进行情感分析）。

这篇博文使用了 Concrete-ML 库，它允许数据科学家在完全同态加密 (FHE) 环境中使用机器学习模型，而无需任何密码学先验知识。我们提供了一个实践教程，介绍如何使用该库构建一个对加密数据进行情感分析的模型。

本文涵盖以下内容：

Transformer 模型
如何结合使用 Transformer 和 XGBoost 进行情感分析
如何进行训练
如何使用 Concrete-ML 将预测转换为对加密数据的预测
如何使用客户端/服务器协议部署到云端

最后但同样重要的是，我们将以一个在 Hugging Face Spaces 上的完整演示来结束，以展示这一功能的实际应用。

环境设置

首先，请运行以下命令确保您的 pip 和 setuptools 是最新的：

pip install -U pip setuptools

现在，我们可以用以下命令安装这篇博文所需的所有库。

pip install concrete-ml transformers datasets

使用公共数据集

我们在这个 notebook 中使用的数据集可以在这里找到。

为了表示用于情感分析的文本，我们选择使用 Transformer 的隐藏表示，因为它能以一种非常高效的方式为最终模型带来高准确率。若要将这种表示方法与更常见的 TF-IDF 方法进行比较，请参阅这个完整的 notebook。

我们可以先打开数据集并可视化一些统计数据。

from datasets import load_datasets
train = load_dataset("osanseviero/twitter-airline-sentiment")["train"].to_pandas()
text_X = train['text']
y = train['airline_sentiment']
y = y.replace(['negative', 'neutral', 'positive'], [0, 1, 2])
pos_ratio = y.value_counts()[2] / y.value_counts().sum()
neg_ratio = y.value_counts()[0] / y.value_counts().sum()
neutral_ratio = y.value_counts()[1] / y.value_counts().sum()
print(f'Proportion of positive examples: {round(pos_ratio * 100, 2)}%')
print(f'Proportion of negative examples: {round(neg_ratio * 100, 2)}%')
print(f'Proportion of neutral examples: {round(neutral_ratio * 100, 2)}%')

然后，输出结果如下：

Proportion of positive examples: 16.14%
Proportion of negative examples: 62.69%
Proportion of neutral examples: 21.17%

积极和中性样本的比例相当接近，但消极样本的数量要多得多。让我们记住这一点，以便选择最终的评估指标。

现在我们可以将数据集分割成训练集和测试集。我们将为这段代码使用一个种子，以确保其完全可复现。

from sklearn.model_selection import train_test_split
text_X_train, text_X_test, y_train, y_test = train_test_split(text_X, y,
    test_size=0.1, random_state=42)

使用 Transformer 进行文本表示

Transformer 是一种神经网络，通常被训练来预测文本中接下来会出现的词（这个任务通常被称为自监督学习）。它们也可以在一些特定的子任务上进行微调，从而使其在特定问题上表现更佳。

它们是处理各种自然语言处理任务的强大工具。实际上，我们可以利用它们对任何文本的表示，并将其输入到一个对 FHE 更友好的机器学习模型中进行分类。在这个 notebook 中，我们将使用 XGBoost。

我们首先导入 Transformer 所需的库。在这里，我们使用来自 Hugging Face 的流行库来快速获取一个 Transformer 模型。

我们选择的模型是一个 BERT Transformer，它在斯坦福情感树库数据集上进行了微调。

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Load the tokenizer (converts text to tokens)
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")

# Load the pre-trained model
transformer_model = AutoModelForSequenceClassification.from_pretrained(
   "cardiffnlp/twitter-roberta-base-sentiment-latest"
)

这应该会下载模型，现在模型已准备就绪。

对于某些文本，使用其隐藏表示一开始可能会有些棘手，主要是因为我们可以用多种不同的方法来处理。以下是我们选择的方法。

首先，我们对文本进行分词 (tokenize)。分词意味着将文本分割成词元 (token，可以是一个词或特定字符序列)，并用一个数字替换每个词元。然后，我们将分词后的文本发送给 Transformer 模型，该模型会为每个词输出一个隐藏表示（自注意力层的输出，通常用作分类层的输入）。最后，我们对每个词的表示进行平均，以获得文本级别的表示。

结果是一个形状为 (样本数量, 隐藏层大小) 的矩阵。隐藏层大小是隐藏表示中的维度数量。对于 BERT，隐藏层大小是 768。隐藏表示是代表文本的数字向量，可用于许多不同的任务。在这种情况下，我们将用它来进行 XGBoost 分类。

import numpy as np
import tqdm
# Function that transforms a list of texts to their representation
# learned by the transformer.
def text_to_tensor(
   list_text_X_train: list,
   transformer_model: AutoModelForSequenceClassification,
   tokenizer: AutoTokenizer,
   device: str,
) -> np.ndarray:
   # Tokenize each text in the list one by one
   tokenized_text_X_train_split = []
   tokenized_text_X_train_split = [
       tokenizer.encode(text_x_train, return_tensors="pt")
       for text_x_train in list_text_X_train
   ]

   # Send the model to the device
   transformer_model = transformer_model.to(device)
   output_hidden_states_list = [None] * len(tokenized_text_X_train_split)

   for i, tokenized_x in enumerate(tqdm.tqdm(tokenized_text_X_train_split)):
       # Pass the tokens through the transformer model and get the hidden states
       # Only keep the last hidden layer state for now
       output_hidden_states = transformer_model(tokenized_x.to(device), output_hidden_states=True)[
           1
       ][-1]
       # Average over the tokens axis to get a representation at the text level.
       output_hidden_states = output_hidden_states.mean(dim=1)
       output_hidden_states = output_hidden_states.detach().cpu().numpy()
       output_hidden_states_list[i] = output_hidden_states

   return np.concatenate(output_hidden_states_list, axis=0)

# Let's vectorize the text using the transformer
list_text_X_train = text_X_train.tolist()
list_text_X_test = text_X_test.tolist()

X_train_transformer = text_to_tensor(list_text_X_train, transformer_model, tokenizer, device)
X_test_transformer = text_to_tensor(list_text_X_test, transformer_model, tokenizer, device)

这种文本转换（从文本到 Transformer 表示）需要在客户端机器上执行，因为加密是在 Transformer 表示上进行的。

使用 XGBoost 进行分类

既然我们已经为训练分类器正确地构建了训练集和测试集，接下来就是训练我们的 FHE 模型。这里过程会非常直接，使用像 scikit-learn 的 GridSearch 这样的超参数调优工具。

from concrete.ml.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
# Let's build our model
model = XGBClassifier()

# A gridsearch to find the best parameters
parameters = {
    "n_bits": [2, 3],
    "max_depth": [1],
    "n_estimators": [10, 30, 50],
    "n_jobs": [-1],
}

# Now we have a representation for each tweet, we can train a model on these.
grid_search = GridSearchCV(model, parameters, cv=5, n_jobs=1, scoring="accuracy")
grid_search.fit(X_train_transformer, y_train)

# Check the accuracy of the best model
print(f"Best score: {grid_search.best_score_}")

# Check best hyperparameters
print(f"Best parameters: {grid_search.best_params_}")

# Extract best model
best_model = grid_search.best_estimator_

输出如下：

Best score: 0.8378111718275654
Best parameters: {'max_depth': 1, 'n_bits': 3, 'n_estimators': 50, 'n_jobs': -1}

现在，让我们看看模型在测试集上的表现如何。

from sklearn.metrics import ConfusionMatrixDisplay
# Compute the metrics on the test set
y_pred = best_model.predict(X_test_transformer)
y_proba = best_model.predict_proba(X_test_transformer)

# Compute and plot the confusion matrix
matrix = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(matrix).plot()

# Compute the accuracy
accuracy_transformer_xgboost = np.mean(y_pred == y_test)
print(f"Accuracy: {accuracy_transformer_xgboost:.4f}")

输出如下：

Accuracy: 0.8504

对加密数据进行预测

现在让我们对加密文本进行预测。这里的想法是，我们将加密由 Transformer 提供的表示，而不是原始文本本身。在 Concrete-ML 中，你可以通过在 predict 函数中设置参数 execute_in_fhe=True 来快速实现这一点。这只是一个开发者功能（主要用于检查 FHE 模型的运行时间）。稍后我们将看到如何在部署环境中实现这一点。

import time
# Compile the model to get the FHE inference engine
# (this may take a few minutes depending on the selected model)
start = time.perf_counter()
best_model.compile(X_train_transformer)
end = time.perf_counter()
print(f"Compilation time: {end - start:.4f} seconds")

# Let's write a custom example and predict in FHE
tested_tweet = ["AirFrance is awesome, almost as much as Zama!"]
X_tested_tweet = text_to_tensor(tested_tweet, transformer_model, tokenizer, device)
clear_proba = best_model.predict_proba(X_tested_tweet)

# Now let's predict with FHE over a single tweet and print the time it takes
start = time.perf_counter()
decrypted_proba = best_model.predict_proba(X_tested_tweet, execute_in_fhe=True)
end = time.perf_counter()
fhe_exec_time = end - start
print(f"FHE inference time: {fhe_exec_time:.4f} seconds")

输出变为：

Compilation time: 9.3354 seconds
FHE inference time: 4.4085 seconds

检查 FHE 预测是否与明文预测相同也是必要的。

print(f"Probabilities from the FHE inference: {decrypted_proba}")
print(f"Probabilities from the clear model: {clear_proba}")

此输出显示为：

Probabilities from the FHE inference: [[0.08434131 0.05571389 0.8599448 ]]
Probabilities from the clear model: [[0.08434131 0.05571389 0.8599448 ]]

部署

至此，我们的模型已经完全训练和编译好，可以进行部署了。在 Concrete-ML 中，你可以使用部署 API 来轻松完成此操作。

# Let's save the model to be pushed to a server later
from concrete.ml.deployment import FHEModelDev
fhe_api = FHEModelDev("sentiment_fhe_model", best_model)
fhe_api.save()

这几行代码足以导出客户端和服务器所需的所有文件。你可以在这里查看详细解释这个部署 API 的 notebook。